My work involves processing a lot of General Ledgers and I wanted to build and test various automation and analytical techniques to see how my workflow could be improved. In order to do that in a free and fun way, I would need fake data, so I set out to build a process to generate a fake General Ledger (GL) and a corresponding Trial Balance (TB).

Motivation and scope¶

Initially I didn’t know how comprehensive I needed the GL to be - modern systems are complex and store data for a wide variety of uses. I resolved to start with something simple and iterate for as long as I wanted.

The journals produced below satisfy the following general accounting principles:

Each journal contains equal debits and credits
Opening and closing Trial Balances net to 0
Profit and Loss (P&L) accounts start the year with 0 balance, Balance Sheet (BS) accounts do not.
Each transaction hits both the P&L and the BS (i.e. If an account on the P&L is credited, then the other side of the transaction is a debit to the BS)
Distinguish between manual and automatic journals

The GL:

Contains journals posted evenly throughout the year (this isnt realistic, but is a simple way to generate date data)
Receives journals from subledgers
Identifies if a journal is manual depending on which subledger it originated with
Records which user posted the journal if the journal is manual

The script below allows the user to specify:

The number of accounts on the GL/TB
The number of journals in the GL
A mean and variance for the number of lines in each journal
A mean and variance for the functional amounts posted to accounts
How many different users post manual journals
The beginning of the financial year
The criteria for a manual journal, based on subledger
The proportion of manuals which are manual
The proportion of accounts which hit the P&L or BS
An arbitrary list of subledgers

The Jupyter Notebook below shows the annotated Python 3 code I wrote:

Notebook set-up¶

Load the various libraries used to easily add the required features. Two libraries to note:

Pandas is pythons ubiquitous data handling tool
Faker is a useful tool to generate fake data, and is an easy way to bootstrap a database

In [1]:

from random import gauss
from faker import Factory
import random
import numpy as np
import time
from datetime import timedelta
import datetime
from natsort import natsorted, ns
import pandas as pd

Choose parameters and values for the GL and TB¶

In [2]:

# ****** Number of different accounts in the GL *********
x = 111 


# ****** Number of journals in the GL *******************
j = 15713

# Setup posting date 
d0 = '20160101' # first day, data generated over 1 year.
d1 = datetime.datetime.strptime(d0, "%Y%m%d")


# ****** Distribution of lines per journals *************
jl_1 = 21 # mean
jl_2 = 10 # variance
j_l = lambda x, y: abs(int(gauss(jl_1,jl_2)))


# ****** Number of different users posting journals *****
fake = Factory.create('en_GB')
U = 10
ul = []
for _ in range(0,U): ul.append(fake.name())

# ****** Functional amount values ***********************
q1 = 700 # mean
q2 = 104 # variance
def q(q1,q2):
    p = random.random() < 0.5 # True implies 
    if p: i = -1
    else: i = 1
    
    out = i * round(gauss(q1,q2),2)
    return out


# ****** Proportion of journals which are manual ********
Mp = 0.23

# ****** Proportion of accounts that are P&L accounts ***
Pp = 0.3

# ****** Subledger names *********
source_feeds = ['sl1','sl2','sl3']

Functions¶

Identify if an account feeds into the P&L or BS:

In [3]:

def isPandLaccount(element):
    if len(element) > 0:
        return element[2] == 'P'
        return False
    
def isBSaccount(element):
    if len(element) > 0:
        return element[2] == 'B'
        return False

Generate account codes:

In [4]:

def account_names(x):
    b_names = []
    p_names = []
    a_names = []
    
    p = 'ACP'
    b = 'ACB'
    
    for i in range(x):
        A = random.random() < Pp
        if A: 
            y = b+str(i+1).zfill(5)
            b_names.append(y)
        else: 
            y = p+str(i+1).zfill(5)
            p_names.append(y)
        
    if len(b_names) % 2 != 0: del b_names[-1]
    if len(p_names) % 2 != 0: del p_names[-1]
    
    a_names = b_names + p_names
        
       
    return(a_names)

Generate journal names and lengths:

In [5]:

def journal_names(j):
    d0 = '20160101' # first day, data generated over 1 year.
    d1 = datetime.datetime.strptime(d0, "%Y%m%d")
    a_n = []
    
    for i in range(j):   
        n = d1.strftime("%Y%m%d_")
        y = 'J_' + n + str(i+1).zfill(1)
        d1 = d1 + datetime.timedelta(days=365/j)
        a_n.append(y)
    
    j_names = dict((el, int( j_l(jl_1,jl_2) / 2 )) for el in a_n) # determine how many lines are in each journal. 
    return j_names

Create the list of journal names and account codes¶

In [6]:

j_names = journal_names(j) 
a_names = account_names(x)

Create the fake General Ledger and save it to a text file¶

In [7]:

# Output format
glf = 'Journal_ID|Line|Type|Date|User|Account|Source|Functional_Currency|Functional_Amount'

f = open('gl.txt', 'w')
f.write(glf + '\n')

for key in natsorted(j_names, key=lambda y: y.lower()):
    line_no = -1
    i = 0
    
    # Assign each journal a source feed
    source_id = random.choice(source_feeds)
     
    # Assign each journal a posting date   
    posting_date = d1.strftime("%Y%m%d")
    d1 = d1 + datetime.timedelta(days=365/j)
      
    # Make journal either M or A, if M assign user
    t = random.random() < Mp # True implies 
    p=random.triangular(0, U, 3*U/4)
    if t: 
        man_ind = 'M'
        u_name = ul[int(p)]
    else: 
        man_ind = 'A'
        u_name = ''
    
    # Assign functional amount to each line
    while i < j_names[key]:
        i = i + 2
        line_no = line_no + 2
        line_no2 = line_no + 1
        dr = q(q1,q2)
        cr = -1 * dr
        
        a_names_p = list(filter(isPandLaccount, a_names))
        a_names_b = list(filter(isBSaccount, a_names))

        an1 = np.random.choice(a_names_p)  
        an2 = np.random.choice(a_names_b)
        
        l_1 = key + '|' + str(line_no) + '|' + man_ind + '|' + posting_date + '|' + u_name + '|' + an1 + '|' + source_id + '|' + 'GBP' + '|' + str(dr)
        l_2 = key + '|' + str(line_no2) + '|' +man_ind + '|' + posting_date + '|' + u_name + '|' + an2 + '|' + source_id + '|' + 'GBP' + '|' + str(cr)
        f.write(l_1 + '\n')
        f.write(l_2 + '\n')

f.close()

Create the Trial Balance and save it to a text file¶

In [8]:

# Use gl to calc movement on each account
gl = pd.read_csv('gl.txt', sep = '|')
tb = gl[['Account','Functional_Amount']]

# Calc net movement on each account
tb = tb.groupby(['Account']).sum().round(2)
tb.reset_index(level=0, inplace=True)
tb.columns = ['Account','Movement']
tb.sort_values(by = 'Account')

# Assign account type
# Set b/f balances to 0 for P&L accounts
tb.loc[tb.Account.str[2] == 'P', 'Balance b/f'] = 0
tb.loc[tb.Account.str[2] == 'P', 'Type'] = 'P&L'
tb.loc[tb.Account.str[2] == 'B', 'Type'] = 'BS'
tb['Balance dummy']=tb['Balance b/f']

# if b/f balance is != 0, generate a balance for that account
i = 0
for index, row in tb.iterrows():
    if row['Balance dummy'] != 0:
        row['Balance b/f'] = round(gauss(q1,q2),2)
        bal = round(gauss(q1,q2),2)
        tb.loc[i,'Balance b/f'] = bal
        tb.loc[i+1,'Balance b/f'] = -1 * bal
    i += 2    
del tb['Balance dummy']

# create c/f field
tb['Balance c/f'] = ( tb['Balance b/f'] + tb['Movement'] ).round(2)

# create 'date of balance' column
tb['Balance date'] = d1.strftime("%Y%m%d")

# Arrange columns
tb = tb[['Account', 'Type', 'Balance b/f' , 'Balance c/f', 'Balance date']]

# print TB to file
tb.to_csv('tb.txt', sep='|', encoding='utf-8', header=True, index=False)

Load the text files back in and display their top 10 rows¶

Verify that the files have been produced correctly and that the TB balances as expected

In [9]:

gl = pd.read_csv('gl.txt', sep = '|')
tb = pd.read_csv('tb.txt', sep = '|')

In [10]:

tb.head(10)

Out[10]:

	Account	Type	Balance b/f	Balance c/f	Balance date
0	ACB00003	BS	673.10	72348.68	20161231
1	ACB00007	BS	-673.10	20045.09	20161231
2	ACB00010	BS	748.16	-30340.79	20161231
3	ACB00012	BS	-748.16	188.39	20161231
4	ACB00015	BS	814.96	48294.11	20161231
5	ACB00017	BS	-814.96	10659.80	20161231
6	ACB00021	BS	835.56	18357.33	20161231
7	ACB00032	BS	-835.56	3406.80	20161231
8	ACB00034	BS	759.60	26505.40	20161231
9	ACB00036	BS	-759.60	-39128.80	20161231

In [11]:

gl.head(10)

Out[11]:

	Journal_ID	Line	Type	Date	User	Account	Source	Functional_Currency	Functional_Amount
0	J_20160101_1	1	M	20160101	Iain Gardiner	ACP00054	sl2	GBP	-587.49
1	J_20160101_1	2	M	20160101	Iain Gardiner	ACB00064	sl2	GBP	587.49
2	J_20160101_1	3	M	20160101	Iain Gardiner	ACP00022	sl2	GBP	816.17
3	J_20160101_1	4	M	20160101	Iain Gardiner	ACB00017	sl2	GBP	-816.17
4	J_20160101_1	5	M	20160101	Iain Gardiner	ACP00088	sl2	GBP	628.60
5	J_20160101_1	6	M	20160101	Iain Gardiner	ACB00062	sl2	GBP	-628.60
6	J_20160101_1	7	M	20160101	Iain Gardiner	ACP00079	sl2	GBP	-672.34
7	J_20160101_1	8	M	20160101	Iain Gardiner	ACB00017	sl2	GBP	672.34
8	J_20160101_2	1	A	20160101	NaN	ACP00005	sl1	GBP	-683.52
9	J_20160101_2	2	A	20160101	NaN	ACB00036	sl1	GBP	683.52

In [12]:

print('Net Opening TB:',"%.2f" % tb['Balance b/f'].sum())
print('Net Closing TB:',"%.2f" % tb['Balance c/f'].sum())

Net Opening TB: 0.00
Net Closing TB: 0.00