One of my favorite Simpsons episodes has a scene where Mr. Burns brings Homer to his mansion (YouTube Video). One of his rooms has a thousand monkeys at a thousand typewriters. One of the monkeys writes a slightly incorrect line from Charles Dickens ‘It was the best of times, it was blurst of times.’  The joke is a play on the theory that a million monkeys sitting at a million typewriters will eventually produce Shakespeare.  And that is what I did (tried to do).  I created millions of monkeys on Amazon and put them at virtual typewriters (aka Infinite Monkey Theorem).  An old New York Times article gives a general account of computers as authors.

I have been learning Hadoop lately and wanted a project to try some things out.  Also, I have been wanting to try Amazon’s Web Services out.  This project brought the 2 pieces together by using Hadoop’s MapReduce with Amazon’s Elastic MapReduce.  The best part of all is how cheap it is to try things like this.  A small instance costs $0.10 and a medium CPU instance costs $0.20 per instance hour.

Since I don’t have real monkeys, I have to create fake Amazonian Map Monkeys.  The Map Monkeys create random data in ASCII between a and z.  It uses Sean Luke’s Mersenne Twister to make sure I have fast, random, well behaved monkeys.  Once the monkey’s output is mapped, it is passed to the reducer which runs the characters through a Bloom Field membership test.  If the monkey output passes the membership test, the Shakespearean works are checked using a string comparison.  If that passes, a genius monkey has written 10 characters of Shakespeare.

The source material is all of Shakespeare’s works as taken from Project Gutenberg.  My monkeys leave it to their editors to format things well.  Everything is stripped down to a-z, all whitespace and newlines are taken out of the source material.  Otherwise, things are left intact.  I had to reduce the monkeys’ output to reduce number of combinations due to the exponential nature of the problem.  The table below shows the exponential growth and computational difficulty of the problem.

# Characters Combinations

There are several benefits to using Hadoop (MapReduce) and Amazon.  One is that other works or every book in Project Gutenberg could be searched.  The beauty of a Bloom Filter is how it allows a membership tests without having to load the entire text into memory and it allows a very quick membership test instead of having to compare every character.  Another benefit is that one can scale this very well via Amazon where adding another computer to the cluster increases the output in a linear fashion.  However, my pocketbook doesn’t scale well, maybe Amazon will give me some free monkeys.  With only a few minor changes, the Monkey MapReduce program could be changed to 24 random characters, check against every written work by man, and scale to 100+ computers.

Now for some numbers.  There are 10^26 (141,167,095,653,376) combinations of 10 characters with the 26 letters of the English alphabet.  In my trimmed down Shakespeare, there are 3,696,339 non-distinct possible 10 character groups.  As of today, I have ran 20,164,000,000 maps or 10 character checks using Elastic MapReduce.  This comes out to 0.014% of the possible combinations.  For the curious, I am running 2 instances of “HighCPU – Medium”.  A day’s worth of monkeys costs $19.20 and does 6,980,000,000 checks.

Even though this project ranks up there as the eighth wonder of the world and is as important as curing cancer, I don’t have a bunch of money to put towards it.  You can get in touch with me via the contact form to help out and carry on this monumental work.  We need to act now before the endangered Amazonian Hadoop Monkeys go extinct.  I have an even better idea for creating Shakespeare with Amazonian Monkeys and a very cool way to visualize it


Share This
Me at Strata (Tall)

Are you tired of materials that don't go beyond the basics of data engineering?

Take the next step and sign up for my newsletter. I share exclusive material that you won’t find anywhere else.

Material like:

• Unit testing: What are unit tests so important when working with Big Data?

• Avoiding mistakes: How to avoid common mistakes people make when learning Big Data technologies

• Best practices: What the the best practices professional data engineers use?

You have Successfully Subscribed!