My work involves processing a lot of General Ledgers and I wanted to build and test various automation and analytical techniques to see how my workflow could be improved. In order to do that in a free and fun way, I would need fake data, so I set out to build a process to generate a fake General Ledger (GL) and a corresponding Trial Balance (TB).
Motivation and scope¶
Initially I didn’t know how comprehensive I needed the GL to be - modern systems are complex and store data for a wide variety of uses. I resolved to start with something simple and iterate for as long as I wanted.
The journals produced below satisfy the following general accounting principles:
- Each journal contains equal debits and credits
- Opening and closing Trial Balances net to 0
- Profit and Loss (P&L) accounts start the year with 0 balance, Balance Sheet (BS) accounts do not.
- Each transaction hits both the P&L and the BS (i.e. If an account on the P&L is credited, then the other side of the transaction is a debit to the BS)
- Distinguish between manual and automatic journals
The GL:
- Contains journals posted evenly throughout the year (this isnt realistic, but is a simple way to generate date data)
- Receives journals from subledgers
- Identifies if a journal is manual depending on which subledger it originated with
- Records which user posted the journal if the journal is manual
The script below allows the user to specify:
- The number of accounts on the GL/TB
- The number of journals in the GL
- A mean and variance for the number of lines in each journal
- A mean and variance for the functional amounts posted to accounts
- How many different users post manual journals
- The beginning of the financial year
- The criteria for a manual journal, based on subledger
- The proportion of manuals which are manual
- The proportion of accounts which hit the P&L or BS
- An arbitrary list of subledgers
The Jupyter Notebook below shows the annotated Python 3 code I wrote:
Notebook set-up¶
Load the various libraries used to easily add the required features. Two libraries to note:
- Pandas is pythons ubiquitous data handling tool
- Faker is a useful tool to generate fake data, and is an easy way to bootstrap a database
from random import gauss
from faker import Factory
import random
import numpy as np
import time
from datetime import timedelta
import datetime
from natsort import natsorted, ns
import pandas as pd
Choose parameters and values for the GL and TB¶
# ****** Number of different accounts in the GL *********
x = 111
# ****** Number of journals in the GL *******************
j = 15713
# Setup posting date
d0 = '20160101' # first day, data generated over 1 year.
d1 = datetime.datetime.strptime(d0, "%Y%m%d")
# ****** Distribution of lines per journals *************
jl_1 = 21 # mean
jl_2 = 10 # variance
j_l = lambda x, y: abs(int(gauss(jl_1,jl_2)))
# ****** Number of different users posting journals *****
fake = Factory.create('en_GB')
U = 10
ul = []
for _ in range(0,U): ul.append(fake.name())
# ****** Functional amount values ***********************
q1 = 700 # mean
q2 = 104 # variance
def q(q1,q2):
p = random.random() < 0.5 # True implies
if p: i = -1
else: i = 1
out = i * round(gauss(q1,q2),2)
return out
# ****** Proportion of journals which are manual ********
Mp = 0.23
# ****** Proportion of accounts that are P&L accounts ***
Pp = 0.3
# ****** Subledger names *********
source_feeds = ['sl1','sl2','sl3']
Functions¶
Identify if an account feeds into the P&L or BS:
def isPandLaccount(element):
if len(element) > 0:
return element[2] == 'P'
return False
def isBSaccount(element):
if len(element) > 0:
return element[2] == 'B'
return False
Generate account codes:
def account_names(x):
b_names = []
p_names = []
a_names = []
p = 'ACP'
b = 'ACB'
for i in range(x):
A = random.random() < Pp
if A:
y = b+str(i+1).zfill(5)
b_names.append(y)
else:
y = p+str(i+1).zfill(5)
p_names.append(y)
if len(b_names) % 2 != 0: del b_names[-1]
if len(p_names) % 2 != 0: del p_names[-1]
a_names = b_names + p_names
return(a_names)
Generate journal names and lengths:
def journal_names(j):
d0 = '20160101' # first day, data generated over 1 year.
d1 = datetime.datetime.strptime(d0, "%Y%m%d")
a_n = []
for i in range(j):
n = d1.strftime("%Y%m%d_")
y = 'J_' + n + str(i+1).zfill(1)
d1 = d1 + datetime.timedelta(days=365/j)
a_n.append(y)
j_names = dict((el, int( j_l(jl_1,jl_2) / 2 )) for el in a_n) # determine how many lines are in each journal.
return j_names
Create the list of journal names and account codes¶
j_names = journal_names(j)
a_names = account_names(x)
Create the fake General Ledger and save it to a text file¶
# Output format
glf = 'Journal_ID|Line|Type|Date|User|Account|Source|Functional_Currency|Functional_Amount'
f = open('gl.txt', 'w')
f.write(glf + '\n')
for key in natsorted(j_names, key=lambda y: y.lower()):
line_no = -1
i = 0
# Assign each journal a source feed
source_id = random.choice(source_feeds)
# Assign each journal a posting date
posting_date = d1.strftime("%Y%m%d")
d1 = d1 + datetime.timedelta(days=365/j)
# Make journal either M or A, if M assign user
t = random.random() < Mp # True implies
p=random.triangular(0, U, 3*U/4)
if t:
man_ind = 'M'
u_name = ul[int(p)]
else:
man_ind = 'A'
u_name = ''
# Assign functional amount to each line
while i < j_names[key]:
i = i + 2
line_no = line_no + 2
line_no2 = line_no + 1
dr = q(q1,q2)
cr = -1 * dr
a_names_p = list(filter(isPandLaccount, a_names))
a_names_b = list(filter(isBSaccount, a_names))
an1 = np.random.choice(a_names_p)
an2 = np.random.choice(a_names_b)
l_1 = key + '|' + str(line_no) + '|' + man_ind + '|' + posting_date + '|' + u_name + '|' + an1 + '|' + source_id + '|' + 'GBP' + '|' + str(dr)
l_2 = key + '|' + str(line_no2) + '|' +man_ind + '|' + posting_date + '|' + u_name + '|' + an2 + '|' + source_id + '|' + 'GBP' + '|' + str(cr)
f.write(l_1 + '\n')
f.write(l_2 + '\n')
f.close()
Create the Trial Balance and save it to a text file¶
# Use gl to calc movement on each account
gl = pd.read_csv('gl.txt', sep = '|')
tb = gl[['Account','Functional_Amount']]
# Calc net movement on each account
tb = tb.groupby(['Account']).sum().round(2)
tb.reset_index(level=0, inplace=True)
tb.columns = ['Account','Movement']
tb.sort_values(by = 'Account')
# Assign account type
# Set b/f balances to 0 for P&L accounts
tb.loc[tb.Account.str[2] == 'P', 'Balance b/f'] = 0
tb.loc[tb.Account.str[2] == 'P', 'Type'] = 'P&L'
tb.loc[tb.Account.str[2] == 'B', 'Type'] = 'BS'
tb['Balance dummy']=tb['Balance b/f']
# if b/f balance is != 0, generate a balance for that account
i = 0
for index, row in tb.iterrows():
if row['Balance dummy'] != 0:
row['Balance b/f'] = round(gauss(q1,q2),2)
bal = round(gauss(q1,q2),2)
tb.loc[i,'Balance b/f'] = bal
tb.loc[i+1,'Balance b/f'] = -1 * bal
i += 2
del tb['Balance dummy']
# create c/f field
tb['Balance c/f'] = ( tb['Balance b/f'] + tb['Movement'] ).round(2)
# create 'date of balance' column
tb['Balance date'] = d1.strftime("%Y%m%d")
# Arrange columns
tb = tb[['Account', 'Type', 'Balance b/f' , 'Balance c/f', 'Balance date']]
# print TB to file
tb.to_csv('tb.txt', sep='|', encoding='utf-8', header=True, index=False)
Load the text files back in and display their top 10 rows¶
Verify that the files have been produced correctly and that the TB balances as expected
gl = pd.read_csv('gl.txt', sep = '|')
tb = pd.read_csv('tb.txt', sep = '|')
tb.head(10)
gl.head(10)
print('Net Opening TB:',"%.2f" % tb['Balance b/f'].sum())
print('Net Closing TB:',"%.2f" % tb['Balance c/f'].sum())