FakeGL: A Synthetic General Ledger and Trial Balance
January 2017

My work involves processing a lot of General Ledgers and I wanted to build and test various automation and analytical techniques to see how my workflow could be improved. In order to do that in a free and fun way, I would need fake data, so I set out to build a process to generate a fake General Ledger (GL) and a corresponding Trial Balance (TB).

Motivation and scope

Initially I didn’t know how comprehensive I needed the GL to be - modern systems are complex and store data for a wide variety of uses. I resolved to start with something simple and iterate for as long as I wanted.

The journals produced below satisfy the following general accounting principles:

  1. Each journal contains equal debits and credits
  2. Opening and closing Trial Balances net to 0
  3. Profit and Loss (P&L) accounts start the year with 0 balance, Balance Sheet (BS) accounts do not.
  4. Each transaction hits both the P&L and the BS (i.e. If an account on the P&L is credited, then the other side of the transaction is a debit to the BS)
  5. Distinguish between manual and automatic journals

The GL:

  1. Contains journals posted evenly throughout the year (this isnt realistic, but is a simple way to generate date data)
  2. Receives journals from subledgers
  3. Identifies if a journal is manual depending on which subledger it originated with
  4. Records which user posted the journal if the journal is manual

The script below allows the user to specify:

  1. The number of accounts on the GL/TB
  2. The number of journals in the GL
  3. A mean and variance for the number of lines in each journal
  4. A mean and variance for the functional amounts posted to accounts
  5. How many different users post manual journals
  6. The beginning of the financial year
  7. The criteria for a manual journal, based on subledger
  8. The proportion of manuals which are manual
  9. The proportion of accounts which hit the P&L or BS
  10. An arbitrary list of subledgers

The Jupyter Notebook below shows the annotated Python 3 code I wrote:

Notebook set-up

Load the various libraries used to easily add the required features. Two libraries to note:

  • Pandas is pythons ubiquitous data handling tool
  • Faker is a useful tool to generate fake data, and is an easy way to bootstrap a database
In [1]:
from random import gauss
from faker import Factory
import random
import numpy as np
import time
from datetime import timedelta
import datetime
from natsort import natsorted, ns
import pandas as pd

Choose parameters and values for the GL and TB

In [2]:
# ****** Number of different accounts in the GL *********
x = 111 


# ****** Number of journals in the GL *******************
j = 15713

# Setup posting date 
d0 = '20160101' # first day, data generated over 1 year.
d1 = datetime.datetime.strptime(d0, "%Y%m%d")


# ****** Distribution of lines per journals *************
jl_1 = 21 # mean
jl_2 = 10 # variance
j_l = lambda x, y: abs(int(gauss(jl_1,jl_2)))


# ****** Number of different users posting journals *****
fake = Factory.create('en_GB')
U = 10
ul = []
for _ in range(0,U): ul.append(fake.name())

# ****** Functional amount values ***********************
q1 = 700 # mean
q2 = 104 # variance
def q(q1,q2):
    p = random.random() < 0.5 # True implies 
    if p: i = -1
    else: i = 1
    
    out = i * round(gauss(q1,q2),2)
    return out


# ****** Proportion of journals which are manual ********
Mp = 0.23

# ****** Proportion of accounts that are P&L accounts ***
Pp = 0.3

# ****** Subledger names *********
source_feeds = ['sl1','sl2','sl3']

Functions

Identify if an account feeds into the P&L or BS:

In [3]:
def isPandLaccount(element):
    if len(element) > 0:
        return element[2] == 'P'
        return False
    
def isBSaccount(element):
    if len(element) > 0:
        return element[2] == 'B'
        return False

Generate account codes:

In [4]:
def account_names(x):
    b_names = []
    p_names = []
    a_names = []
    
    p = 'ACP'
    b = 'ACB'
    
    for i in range(x):
        A = random.random() < Pp
        if A: 
            y = b+str(i+1).zfill(5)
            b_names.append(y)
        else: 
            y = p+str(i+1).zfill(5)
            p_names.append(y)
        
    if len(b_names) % 2 != 0: del b_names[-1]
    if len(p_names) % 2 != 0: del p_names[-1]
    
    a_names = b_names + p_names
        
       
    return(a_names)

Generate journal names and lengths:

In [5]:
def journal_names(j):
    d0 = '20160101' # first day, data generated over 1 year.
    d1 = datetime.datetime.strptime(d0, "%Y%m%d")
    a_n = []
    
    for i in range(j):   
        n = d1.strftime("%Y%m%d_")
        y = 'J_' + n + str(i+1).zfill(1)
        d1 = d1 + datetime.timedelta(days=365/j)
        a_n.append(y)
    
    j_names = dict((el, int( j_l(jl_1,jl_2) / 2 )) for el in a_n) # determine how many lines are in each journal. 
    return j_names

Create the list of journal names and account codes

In [6]:
j_names = journal_names(j) 
a_names = account_names(x)

Create the fake General Ledger and save it to a text file

In [7]:
# Output format
glf = 'Journal_ID|Line|Type|Date|User|Account|Source|Functional_Currency|Functional_Amount'

f = open('gl.txt', 'w')
f.write(glf + '\n')

for key in natsorted(j_names, key=lambda y: y.lower()):
    line_no = -1
    i = 0
    
    # Assign each journal a source feed
    source_id = random.choice(source_feeds)
     
    # Assign each journal a posting date   
    posting_date = d1.strftime("%Y%m%d")
    d1 = d1 + datetime.timedelta(days=365/j)
      
    # Make journal either M or A, if M assign user
    t = random.random() < Mp # True implies 
    p=random.triangular(0, U, 3*U/4)
    if t: 
        man_ind = 'M'
        u_name = ul[int(p)]
    else: 
        man_ind = 'A'
        u_name = ''
    
    # Assign functional amount to each line
    while i < j_names[key]:
        i = i + 2
        line_no = line_no + 2
        line_no2 = line_no + 1
        dr = q(q1,q2)
        cr = -1 * dr
        
        a_names_p = list(filter(isPandLaccount, a_names))
        a_names_b = list(filter(isBSaccount, a_names))

        an1 = np.random.choice(a_names_p)  
        an2 = np.random.choice(a_names_b)
        
        l_1 = key + '|' + str(line_no) + '|' + man_ind + '|' + posting_date + '|' + u_name + '|' + an1 + '|' + source_id + '|' + 'GBP' + '|' + str(dr)
        l_2 = key + '|' + str(line_no2) + '|' +man_ind + '|' + posting_date + '|' + u_name + '|' + an2 + '|' + source_id + '|' + 'GBP' + '|' + str(cr)
        f.write(l_1 + '\n')
        f.write(l_2 + '\n')

f.close()

Create the Trial Balance and save it to a text file

In [8]:
# Use gl to calc movement on each account
gl = pd.read_csv('gl.txt', sep = '|')
tb = gl[['Account','Functional_Amount']]

# Calc net movement on each account
tb = tb.groupby(['Account']).sum().round(2)
tb.reset_index(level=0, inplace=True)
tb.columns = ['Account','Movement']
tb.sort_values(by = 'Account')

# Assign account type
# Set b/f balances to 0 for P&L accounts
tb.loc[tb.Account.str[2] == 'P', 'Balance b/f'] = 0
tb.loc[tb.Account.str[2] == 'P', 'Type'] = 'P&L'
tb.loc[tb.Account.str[2] == 'B', 'Type'] = 'BS'
tb['Balance dummy']=tb['Balance b/f']

# if b/f balance is != 0, generate a balance for that account
i = 0
for index, row in tb.iterrows():
    if row['Balance dummy'] != 0:
        row['Balance b/f'] = round(gauss(q1,q2),2)
        bal = round(gauss(q1,q2),2)
        tb.loc[i,'Balance b/f'] = bal
        tb.loc[i+1,'Balance b/f'] = -1 * bal
    i += 2    
del tb['Balance dummy']

# create c/f field
tb['Balance c/f'] = ( tb['Balance b/f'] + tb['Movement'] ).round(2)

# create 'date of balance' column
tb['Balance date'] = d1.strftime("%Y%m%d")

# Arrange columns
tb = tb[['Account', 'Type', 'Balance b/f' , 'Balance c/f', 'Balance date']]

# print TB to file
tb.to_csv('tb.txt', sep='|', encoding='utf-8', header=True, index=False)

Load the text files back in and display their top 10 rows

Verify that the files have been produced correctly and that the TB balances as expected

In [9]:
gl = pd.read_csv('gl.txt', sep = '|')
tb = pd.read_csv('tb.txt', sep = '|')
In [10]:
tb.head(10)
Out[10]:
Account Type Balance b/f Balance c/f Balance date
0 ACB00003 BS 673.10 72348.68 20161231
1 ACB00007 BS -673.10 20045.09 20161231
2 ACB00010 BS 748.16 -30340.79 20161231
3 ACB00012 BS -748.16 188.39 20161231
4 ACB00015 BS 814.96 48294.11 20161231
5 ACB00017 BS -814.96 10659.80 20161231
6 ACB00021 BS 835.56 18357.33 20161231
7 ACB00032 BS -835.56 3406.80 20161231
8 ACB00034 BS 759.60 26505.40 20161231
9 ACB00036 BS -759.60 -39128.80 20161231
In [11]:
gl.head(10)
Out[11]:
Journal_ID Line Type Date User Account Source Functional_Currency Functional_Amount
0 J_20160101_1 1 M 20160101 Iain Gardiner ACP00054 sl2 GBP -587.49
1 J_20160101_1 2 M 20160101 Iain Gardiner ACB00064 sl2 GBP 587.49
2 J_20160101_1 3 M 20160101 Iain Gardiner ACP00022 sl2 GBP 816.17
3 J_20160101_1 4 M 20160101 Iain Gardiner ACB00017 sl2 GBP -816.17
4 J_20160101_1 5 M 20160101 Iain Gardiner ACP00088 sl2 GBP 628.60
5 J_20160101_1 6 M 20160101 Iain Gardiner ACB00062 sl2 GBP -628.60
6 J_20160101_1 7 M 20160101 Iain Gardiner ACP00079 sl2 GBP -672.34
7 J_20160101_1 8 M 20160101 Iain Gardiner ACB00017 sl2 GBP 672.34
8 J_20160101_2 1 A 20160101 NaN ACP00005 sl1 GBP -683.52
9 J_20160101_2 2 A 20160101 NaN ACB00036 sl1 GBP 683.52
In [12]:
print('Net Opening TB:',"%.2f" % tb['Balance b/f'].sum())
print('Net Closing TB:',"%.2f" % tb['Balance c/f'].sum())
Net Opening TB: 0.00
Net Closing TB: 0.00