Update 5: The monkeys recreated every work of Shakespeare and went viral. See the project project postmortem for my thoughts on going viral and what I learned during the project.
Update 6: I created a new visualization of the monkeys’ data.
Update 4: The monkeys recreated “A Lover’s Complaint”. Check out the write up.
Update 3: Welcome Slashdot, Fox News, Engadget and Japanese Engadget. So far, the monkeys have ran through 7.5 trillion
6.5 trillion 5 trillion (2011-09-22) 4 trillion (2011-09-16) 3 trillion (2011-09-10) 2.5 trillion (2011-09-07) 2 trillion (2011-09-05) 1.5 trillion (2011-09-01) 1 trillion (2011-08-28) 515,912,000,000 (2011-08-25) character groups.
In a recent post, I described a recent project to recreate Shakespeare using Hadoop and Amazon EC2. This time, I am going to recreate every work of Shakespeare randomly.
This project comes from one of my favorite Simpsons episodes which has a scene where Mr. Burns brings Homer to his mansion (YouTube Video). One of his rooms has a thousand monkeys at a thousand typewriters. One of the monkeys writes a slightly incorrect line from Charles Dickens “It was the best of times, it was blurst of times.” The joke is a play on the theory that a million monkeys sitting at a million typewriters will eventually produce Shakespeare. And that is what I did (am doing). I created millions of monkeys on Amazon and put them at virtual typewriters (aka Infinite Monkey Theorem).
Less Technical Explanation
Instead of having real monkeys typing on keyboards, I have virtual, computerized monkeys that output random gibberish. This is supposed to mimic a monkey randomly mashing the keys on a keyboard. The computer program I wrote compares that monkey’s gibberish to every work of Shakespeare to see if it actually matches a small portion of what Shakespeare wrote. If it does match, the portion of gibberish that matched Shakespeare is marked with green in the images below to show it was found by a monkey. The table below shows the exact number of characters and percentage the monkeys have found in Shakespeare. The parts of Shakespeare that have not been found are colored white. This process is repeated over and over until the monkeys have created every work of Shakespeare through random gibberish.
For this project, I used Hadoop, Amazon EC2, and Ubuntu Linux. Since I don’t have real monkeys, I have to create fake Amazonian Map Monkeys. The Map Monkeys create random data in ASCII between a and z. It uses Sean Luke’s Mersenne Twister to make sure I have fast, random, well behaved monkeys. Once the monkey’s output is mapped, it is passed to the reducer which runs the characters through a Bloom Field membership test. If the monkey output passes the membership test, the Shakespearean works are checked using a string comparison. If that passes, a genius monkey has written 9 characters of Shakespeare. The source material is all of Shakespeare’s works as taken from Project Gutenberg.
The monkeys’ data from Amazon’s cloud is updated on this site every 30 minutes. The images below show green for every character group that was found and white for those that are still missing. The images output is kind of like the animations for defrag utilities. As the monkeys progress through the works, more and more character groups will be found and show green.
The Tabular Output Of What Has Been Found
Every Work Of Shakespeare
Progress Through Individual Works Of Shakespeare
Update: I was running this on a free micro instance (600 MB RAM) from Amazon. Alas, the monkeys needed more RAM than the free micro instance had and the processes get out of memory errors. I have moved the Hadoop server to my home computer which is much faster and has more memory.
Update 2: I updated the Hadoop configuration to have less idle CPU time. This will significantly increase the monkey power and find more character groups.
Update 4: I made a small change to how memory is allocated for the random character groups. It should help speed things up again.