A friend of mine has collected books for many years and has recently begun to catalogue them. In this post, I show some simple analysis of the catalogue and then query an ISBN database to fill in some missing data.
from IPython.display import HTML
HTML('''<script>
code_show=false;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<font color=#33cc33><i>To toggle the visibility of the code blocks, click <a href="javascript:code_toggle()">here</a>.</i></font>
<script>
window.onload = function(){
//time is set in milliseconds
setTimeout(code_toggle, 10000)
};
</script>
''')
Set-up and data preparation¶
Set some settings and import some packages:
# Display plot results inline, not in a separate window
%matplotlib inline
%pylab inline
# Set the size of all figures
pylab.rcParams['figure.figsize'] = (14, 5)
import pandas as pd
import re
import bibtexparser
import numpy as np
import matplotlib.pyplot as plt
Load the catalogue file:
table = pd.read_excel('Library.xlsx')
table = table[0:9188]
df = table
orig_rows = (df.shape[0])
print("There are %d rows in the catalogue" % (df.shape[0]) )
Data formatting and tidying:¶
View the top 5 rows to see how the data is arranged and how many cells are complete.
df.head()
Set float format to two decimal places (currency). Not all rows can become a float:
pd.options.display.float_format = '{:,.2f}'.format
def to_number(s):
try:
s1 = round(float(s),2)
return s1
except ValueError:
return s
df.Price = df.Price.map(lambda f : to_number(f))
df.Value = df.Value.map(lambda f : to_number(f))
Find and remove blank rows:
# How many rows are all NaN values
df = df.dropna(how='all') # drop a row only if ALL columns are NaN
print('%d row removed ' % (orig_rows - df.shape[0]) ) # 1 row contained all NaN and has been removed
List the number of rows in each column which are empty:
# How many rows in each column are NaN
df.isnull().sum().sort_values()
Based on these results, title and publisher are the most complete columns
Split a column containing two types of data:
The “Publisher” column contains both the publisher and the year it was published. This should be split into two columns.
pd.options.mode.chained_assignment = None # default='warn'
df['PubYear'] = df['Publisher'].str.extract('(\d\d\d\d)', expand=True) # regex is confusing
df['Publisher'] = df['Publisher'].str.extract('(((?!\d).)*)', expand=True)
Improve the format of the ‘Date’ column:
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
The data frame is now in the columns I want it to be in, and the top 5 rows are:
df.head()
Insights¶
Distribution of books by year published:
The bar chart below shows how many books in the library were published in a given decade. The list below shows the 5 oldest books.
pd.options.display.float_format = '{:,}'.format
df.PubYear = pd.to_numeric(df.PubYear, errors='ignore')
fig = df.groupby(df.PubYear // 10 * 10).size().plot(kind='bar', xlim=[1000,2016], logy = False)
fig.set_xlabel("Decade of Publication")
fig.set_ylabel("Number of Books")
fig
View the 5 oldest titles:
df['PubYear'] = df['PubYear'].fillna(0.0).astype(int)
df2 = df[df['PubYear'] != 0.0]
df2.sort_values(by='PubYear').head()
List the number of books in each location:
df3 = df
df3["Location"] = df3["Location"].astype(str).map(str.strip).str.upper().str.slice(0,5)
df3.groupby(df3.Location).size().sort_values(ascending=False)
Create a list of the differnet subjects, order the list by the most frequent subjects:
df4 = df
df4["Subject"] = df4["Subject"].astype(str).map(str.strip).str.upper()
df4.groupby(df4.Subject).size().sort_values(ascending=False).head(50)
Create a list of authors in the library. Order the list by number of books:
df5 = df
df5["Author"] = df5["Author"].astype(str).map(str.strip).str.upper() #.str.slice(0,20)
df5.groupby(df5.Author).size().sort_values(ascending=False).head(50)
Distribution of book length by number of pages:
def blank_or_number(s):
try:
s1 = round(float(s),2)
return s1
except ValueError:
return ''
df.Pages = df.Pages.map(lambda f : blank_or_number(f))
df6 = df.Pages[df.Pages != '']
plt.xticks(np.arange(0, 2000, 100.0))
fig = df6.hist(bins=100, range=[0, 2000])
fig.set_xlabel("Number of Pages")
fig.set_ylabel("Number of Books")
Query an ISBN database to find missing data:
Lastly, I thought it would be a fun challenge to fill in gaps in the data. The table below shows rows with ISBN number but missing either Author, Title or Publisher.
It turns out that there are only 10 rows that meet this criteria, and in all cases it is the publisher that is missing.
df7 = df[(df['ISBN?'].notnull()) & ((df['Author'] == '') | (df['Title'] == '') | (df['Publisher'] == ''))]
df7
from isbnlib import *
from isbnlib.config import *
from isbnlib.registry import *
import bibtexparser
def has_isbn(isbn):
SERVICE = 'isbndb'
APIKEY = 'IZXL3ESD' # YOUR key
add_apikey(SERVICE, APIKEY) # register your key
bibtex = bibformatters['bibtex']
isbn = clean(isbn)
try:
a = bibtex(meta(EAN13(isbn), SERVICE))
return a
except:
return 'isbn is invalid'
def get_pub(isbn):
bibtex_str = has_isbn(isbn)
try:
bib_db = bibtexparser.loads(bibtex_str)
dic = bib_db.entries[0]
return dic['publisher']
except:
return
df7['ISBN?'] = df7['ISBN?'].astype(str)
df7.Publisher = df7['ISBN?'].map(lambda f : get_pub(f))
The table below shows the results of the isbnlib query. I thought it odd that all the ‘missing’ publishers names began with a number. It turns out that the regex method I used to split publisher name and year of publication into separate columns doesnt work when there are numbers in the publishers name. Rather than go back and correct this, I’ll leave the script as it is to show how to use the isbnlib library.
df7