Token Analysis and the Bible

About a year and a half ago I interviewed with Monetate (we're hiring!). As part of the interview process I was tasked with coming up with a presentation about something I am passionate about. I would then give my presentation in front of my soon to be workmates.

Some of you might have seen Matt Daniel's analysis of rapper vocabulary. Matt compares today's most popular hip hop artists and their lyrics to the writings of Shakespeare. Because I am passionate about the Bible, my presentation involved a similar analysis but with Bible writers. This post attempts to build further on that. Let's find the most eloquent Bible author!

Setup

The first thing we need to do is get access to the Bible in a format that is easily analyzed. Thankfully, this work has been done for us already. This github repo contains the 1769 Cambridge Edition of the King James Version in CSV format. Have at it!

To start, let's pull in the CSVs that we need.

In [1]:
import pandas as pd

CSV_DATA_ROOT = "https://raw.githubusercontent.com/souliberty/MetaV/master/CSV/"

verses = pd.read_csv("{}Verses.csv".format(CSV_DATA_ROOT))
verses.head()
Out[1]:
VerseID BookID Chapter VerseNum VerseText
0 1 1 1 1 In the beginning God created the heaven and th...
1 2 1 1 2 And the earth was without form, and void; and ...
2 3 1 1 3 And God said, Let there be light: and there wa...
3 4 1 1 4 And God saw the light, that it was good: and G...
4 5 1 1 5 And God called the light Day, and the darkness...

Perfect! The first five verses of the Bible!

However, we'll need a little more information to get useful insight out of these CSVs. Let's join the verses data with Bible author data. But first, let's explore the writers data.

In [2]:
writers = pd.read_csv("{}Writers.csv".format(CSV_DATA_ROOT))
writers
Out[2]:
BookID Writer
0 1 Moses
1 2 Moses
2 3 Moses
3 4 Moses
4 5 Moses
5 6 Joshua
6 7 Samuel
7 8 Samuel
8 9 Samuel, Gad, Nathan
9 10 Gad, Nathan
10 11 Jeremiah
11 12 Jeremiah
12 13 Ezra
13 14 Ezra
14 15 Ezra
15 16 Nehemiah
16 17 Mordecai
17 18 Moses
18 19 David and others
19 20 Solomon, Agur, Lemuel
20 21 Solomon
21 22 Solomon
22 23 Isaiah
23 24 Jeremiah
24 25 Jeremiah
25 26 Ezekiel
26 27 Daniel
27 28 Hosea
28 29 Joel
29 30 Amos
... ... ...
36 37 Haggai
37 38 Zechariah
38 39 Malachi
39 40 Matthew
40 41 Mark
41 42 Luke
42 43 John
43 44 Luke
44 45 Paul
45 46 Paul
46 47 Paul
47 48 Paul
48 49 Paul
49 50 Paul
50 51 Paul
51 52 Paul
52 53 Paul
53 54 Paul
54 55 Paul
55 56 Paul
56 57 Paul
57 58 Paul
58 59 James
59 60 Peter
60 61 Peter
61 62 John
62 63 John
63 64 John
64 65 Jude
65 66 John

66 rows × 2 columns

Simple enough. We can merge the verses DataFrame with the writers DataFrame on the BookID column. Let's see what that looks like in pandas.

In [3]:
verses_writers = verses.merge(writers, on=["BookID"], how="left")
verses_writers.head()
Out[3]:
VerseID BookID Chapter VerseNum VerseText Writer
0 1 1 1 1 In the beginning God created the heaven and th... Moses
1 2 1 1 2 And the earth was without form, and void; and ... Moses
2 3 1 1 3 And God said, Let there be light: and there wa... Moses
3 4 1 1 4 And God saw the light, that it was good: and G... Moses
4 5 1 1 5 And God called the light Day, and the darkness... Moses

Cool. But we also need the book name, which is in a different CSV, so let's merge that data too. First, we explore that dataset to make sure we have the right keys to merge on.

In [4]:
books = pd.read_csv("{}Books.csv".format(CSV_DATA_ROOT))
books
Out[4]:
BookID BookName NumOfChapters
0 1 Genesis 50
1 2 Exodus 40
2 3 Leviticus 27
3 4 Numbers 36
4 5 Deuteronomy 34
5 6 Joshua 24
6 7 Judges 21
7 8 Ruth 4
8 9 1 Samuel 31
9 10 2 Samuel 24
10 11 1 Kings 22
11 12 2 Kings 25
12 13 1 Chronicles 29
13 14 2 Chronicles 36
14 15 Ezra 10
15 16 Nehemiah 13
16 17 Esther 10
17 18 Job 42
18 19 Psalms 150
19 20 Proverbs 31
20 21 Ecclesiastes 12
21 22 Song of Solomon 8
22 23 Isaiah 66
23 24 Jeremiah 52
24 25 Lamentations 5
25 26 Ezekiel 48
26 27 Daniel 12
27 28 Hosea 14
28 29 Joel 3
29 30 Amos 9
... ... ... ...
36 37 Haggai 2
37 38 Zechariah 14
38 39 Malachi 4
39 40 Matthew 28
40 41 Mark 16
41 42 Luke 24
42 43 John 21
43 44 Acts 28
44 45 Romans 16
45 46 1 Corinthians 16
46 47 2 Corinthians 13
47 48 Galatians 6
48 49 Ephesians 6
49 50 Philippians 4
50 51 Colossians 4
51 52 1 Thessalonians 5
52 53 2 Thessalonians 3
53 54 1 Timothy 6
54 55 2 Timothy 4
55 56 Titus 3
56 57 Philemon 1
57 58 Hebrews 13
58 59 James 5
59 60 1 Peter 5
60 61 2 Peter 3
61 62 1 John 5
62 63 2 John 1
63 64 3 John 1
64 65 Jude 1
65 66 Revelation 22

66 rows × 3 columns

We can now proceed with our merge.

In [5]:
bible = verses_writers.merge(books, on=["BookID"], how="left")

# let's only grab the columns we need
bible = bible[['BookID', 'BookName', 'Writer', 'Chapter', 'VerseID', 'VerseNum', 'VerseText']]
bible.head()
Out[5]:
BookID BookName Writer Chapter VerseID VerseNum VerseText
0 1 Genesis Moses 1 1 1 In the beginning God created the heaven and th...
1 1 Genesis Moses 1 2 2 And the earth was without form, and void; and ...
2 1 Genesis Moses 1 3 3 And God said, Let there be light: and there wa...
3 1 Genesis Moses 1 4 4 And God saw the light, that it was good: and G...
4 1 Genesis Moses 1 5 5 And God called the light Day, and the darkness...

Now we have a bible DataFrame with all the data we need. Let's have some fun!

Number of writers

For simplicity's sake, we'll treat the books that have multiple authors in the Writer column as one.

In [6]:
# remember, out base dataset is a verse
# so we need to get a unique count on the writers
len(bible.groupby('Writer').Writer.nunique())
Out[6]:
35

Books by writer

In [7]:
# again we need to use nunique since we are working with verses
bible.groupby(['Writer']).BookID.nunique().sort_values()
Out[7]:
Writer
Amos                      1
Solomon, Agur, Lemuel     1
Samuel, Gad, Nathan       1
Obadiah                   1
Nehemiah                  1
Nahum                     1
Mordecai                  1
Micah                     1
Matthew                   1
Mark                      1
Malachi                   1
Zechariah                 1
Jude                      1
Joshua                    1
Zephaniah                 1
Gad, Nathan               1
Joel                      1
Daniel                    1
James                     1
Isaiah                    1
Hosea                     1
Haggai                    1
Habakkuk                  1
Jonah                     1
David and others          1
Ezekiel                   1
Solomon                   2
Samuel                    2
Luke                      2
Peter                     2
Ezra                      3
Jeremiah                  4
John                      5
Moses                     6
Paul                     14
Name: BookID, dtype: int64

Paul comes out on top with a total of 14 books.

Verses by writer

In [8]:
bible.groupby('Writer').Writer.count().sort_values()
Out[8]:
Writer
Obadiah                    21
Jude                       25
Haggai                     38
Nahum                      47
Jonah                      48
Zephaniah                  53
Malachi                    55
Habakkuk                   56
Joel                       73
Micah                     105
James                     108
Amos                      146
Peter                     166
Mordecai                  167
Hosea                     197
Zechariah                 211
Solomon                   339
Daniel                    357
Nehemiah                  406
Joshua                    658
Mark                      678
Gad, Nathan               695
Samuel                    703
Samuel, Gad, Nathan       810
Solomon, Agur, Lemuel     915
Matthew                  1071
Ezekiel                  1273
Isaiah                   1292
John                     1415
Ezra                     2044
Luke                     2158
Paul                     2336
David and others         2461
Jeremiah                 3053
Moses                    6922
Name: Writer, dtype: int64

Interestingly enough, despite Paul being the writer with the most books written, he's not the writer with the most verses written in the Bible. Moses comes out on top here. Let's find out why.

In [9]:
print("Books written by Moses and the number of verses:")
print(bible[bible['Writer'] == 'Moses'].groupby('BookName').BookName.count())

print("Books written by Paul and the number of verses:")
print(bible[bible['Writer'] == 'Paul'].groupby('BookName').BookName.count())
Books written by Moses and the number of verses:
BookName
Deuteronomy     959
Exodus         1213
Genesis        1533
Job            1070
Leviticus       859
Numbers        1288
Name: BookName, dtype: int64
Books written by Paul and the number of verses:
BookName
1 Corinthians      437
1 Thessalonians     89
1 Timothy          113
2 Corinthians      257
2 Thessalonians     47
2 Timothy           83
Colossians          95
Ephesians          155
Galatians          149
Hebrews            303
Philemon            25
Philippians        104
Romans             433
Titus               46
Name: BookName, dtype: int64

Moses is credited with the first 5 books of the Bible as well as with the book of Job. Paul writes 14 of the letters in the Greek Scriptures. Moses' writings are lengthier than Paul's by nature of their content.

Verses by book

In [10]:
bible.groupby('BookName').BookName.count().sort_values()
Out[10]:
BookName
2 John               13
3 John               14
Obadiah              21
Jude                 25
Philemon             25
Haggai               38
Titus                46
Nahum                47
2 Thessalonians      47
Jonah                48
Zephaniah            53
Malachi              55
Habakkuk             56
2 Peter              61
Joel                 73
2 Timothy            83
Ruth                 85
1 Thessalonians      89
Colossians           95
Philippians         104
1 John              105
Micah               105
1 Peter             105
James               108
1 Timothy           113
Song of Solomon     117
Amos                146
Galatians           149
Lamentations        154
Ephesians           155
                   ... 
Hebrews             303
Daniel              357
Revelation          404
Nehemiah            406
Romans              433
1 Corinthians       437
Judges              618
Joshua              658
Mark                678
2 Samuel            695
2 Kings             719
1 Samuel            810
1 Kings             816
2 Chronicles        822
Leviticus           859
John                879
Proverbs            915
1 Chronicles        942
Deuteronomy         959
Acts               1007
Job                1070
Matthew            1071
Luke               1151
Exodus             1213
Ezekiel            1273
Numbers            1288
Isaiah             1292
Jeremiah           1364
Genesis            1533
Psalms             2461
Name: BookName, dtype: int64

Unique words by book

This is where things start to get even more interesting. Let's take a look at the books with the most unique words used.

In [11]:
# since v.VerseText is a pandas Series we need to join to make that a list
# then we split the verse text to get a list of words
# then cast that list into a set so we can get only the unique words
# finally we get the length of that set to get the total
unique_words = lambda v: len(set(''.join(v.VerseText).split()))

bible.groupby('BookName').apply(unique_words).sort_values()
Out[11]:
BookName
2 John              150
3 John              169
Philemon            211
Obadiah             274
Jude                322
Haggai              360
2 Thessalonians     374
Titus               421
Jonah               468
Nahum               538
Zephaniah           560
1 John              562
1 Thessalonians     586
2 Peter             601
Habakkuk            601
Malachi             602
2 Timothy           649
Colossians          662
Joel                664
Philippians         700
Ruth                767
1 Timothy           836
James               848
1 Peter             851
Song of Solomon     861
Ephesians           883
Galatians           911
Micah               958
Lamentations       1074
Amos               1117
                   ... 
Hebrews            1763
1 Corinthians      2003
Romans             2084
Daniel             2088
Revelation         2113
Nehemiah           2209
Leviticus          2546
Joshua             2689
Mark               2826
Judges             2867
John               2900
2 Kings            3109
Proverbs           3114
2 Samuel           3206
1 Kings            3325
1 Samuel           3397
2 Chronicles       3469
Deuteronomy        3608
Numbers            3718
Exodus             3727
Job                3839
Acts               3909
Matthew            3980
1 Chronicles       3995
Luke               4222
Ezekiel            4505
Genesis            4717
Jeremiah           5081
Isaiah             5806
Psalms             6276
dtype: int64

So Psalms, the book with the most verses, also has the highest number of unique words and 2 John, the book with the least number of verses has the least number of unique words.

Unique words by writer

In [12]:
# same dance we did above
unique_words = lambda v: len(set(''.join(v.VerseText).split()))

bible.groupby('Writer').apply(unique_words).sort_values()
Out[12]:
Writer
Obadiah                    274
Jude                       322
Haggai                     360
Jonah                      468
Nahum                      538
Zephaniah                  560
Habakkuk                   601
Malachi                    602
Joel                       664
James                      848
Micah                      958
Amos                      1117
Mordecai                  1172
Peter                     1236
Hosea                     1377
Zechariah                 1390
Solomon                   1955
Daniel                    2088
Nehemiah                  2209
Joshua                    2689
Mark                      2826
Solomon, Agur, Lemuel     3114
Samuel                    3192
Gad, Nathan               3206
Samuel, Gad, Nathan       3397
Matthew                   3980
Ezekiel                   4505
John                      4539
Isaiah                    5806
David and others          6276
Luke                      6613
Ezra                      6764
Paul                      7042
Jeremiah                  8612
Moses                    13101
dtype: int64

Here Moses comes out on top. But wait a second. This anaylsis isn't fair for a few reasons:

  • Some books are much larger than others
  • Some authors wrote a lot more than others

Let's even out these odds!

In [13]:
# this time we'll get the percentage 
# of unique words as compared to all words
def per_unique(v):
    verse_arr = ''.join(v.VerseText).split()
    return len(set(verse_arr)) * 100 / len(verse_arr)

bible.groupby('Writer').apply(per_unique).sort_values()
Out[13]:
Writer
Moses                     7.802255
Jeremiah                  9.459267
Ezekiel                  11.815154
Ezra                     13.050105
Luke                     13.769338
John                     13.845168
Samuel, Gad, Nathan      14.014605
Paul                     14.709446
Joshua                   14.777161
Samuel                   15.318169
David and others         15.593321
Gad, Nathan              16.105697
Isaiah                   16.242831
Matthew                  17.599717
Daniel                   18.566601
Mark                     19.504452
Mordecai                 21.437717
Nehemiah                 21.925558
Solomon, Agur, Lemuel    22.047579
Zechariah                22.300658
Solomon                  24.749968
Amos                     27.437976
Hosea                    27.661712
Micah                    31.430446
Peter                    31.987578
Haggai                   32.936871
Joel                     33.860275
Malachi                  34.858135
Zephaniah                35.805627
Jonah                    36.763551
James                    38.598088
Obadiah                  42.218798
Habakkuk                 42.323944
Nahum                    43.457189
Jude                     55.136986
dtype: float64
In [14]:
from IPython.display import Image
Image(url='http://media4.giphy.com/media/5aLrlDiJPMPFS/giphy.gif')
Out[14]:

Now all of a sudden Moses is on the bottom! But is our data analysis correct?

Exodus 4:10 reads:

In [15]:
verse = bible[(bible['BookName'] == 'Exodus') & (bible['Chapter'] == 4) & (bible['VerseNum'] == 10)]
verse['VerseText'].values[0]
Out[15]:
'And Moses said unto the LORD, O my Lord, I am not eloquent, neither heretofore, nor since thou hast spoken unto thy servant: but I am slow of speech, and of a slow tongue.'

Moses himself admits to not being eloquent and acknowledges being 'slow of speech, and of a slow tongue'. Our analysis of his writing confirms such.

But what about our most eloquent Bible writer? What do we know about Jude? Unfortunately, almost nothing except that he was the half brother of James and hence Jesus' half brother. Nonetheless, Jude comes out on top as our most eloquent Bible writer with more than 55% of the letter that carries his name consisting of unique words.

Questions? Doubts? Complaints? Leave a comment below.

Comments

Comments powered by Disqus