Million Monkeys Visualization

Million Monkeys Visualization

At last weekend’s Hack4Reno, I created a new visualization of the Million Monkeys’ data.  It allows you to choose your favorite work of Shakespeare and find out how a particular character was found.  You simply place or hover the mouse over a character and the box to the right will show the number of times that character was found.

For more information on the Million Monkeys Project, go here.

To make this visualization possible, I took the ~3GB of raw monkey data and generated a JSON output.  This was tricky because I had to break the works of Shakespeare down into individual works.  Once I had the JSON data, I wrote some Javascript that used JQuery to show the data and allow the interactions.

NOTE: It was a 24 hour Hackathon and there are a few bugs.

Read More

A Few Million Monkeys Randomly Recreate Every Work Of Shakespeare

A Few Million Monkeys Randomly Recreate Every Work Of Shakespeare

All the world’s a stage,
And all the monkeys merely players;
They have their typos and their hits,
And one monkey in his time plays many parts,
His acts being 38 works of Shakespeare.
- Monkey As You Like It

Update: I created a new visualization of the monkeys’ data.

The monkeys accomplished their goal of recreating all 38 works of Shakespeare. The last work, The Taming Of The Shrew, was completed at 2 AM PST on October 6, 2011. This is the first time every work of Shakespeare has actually been randomly reproduced. Furthermore, this is the largest work ever randomly reproduced. It is one small step for a monkey, one giant leap for virtual primates everywhere. This page shows what day each work of Shakespeare was completed on.

The Million Monkeys project went viral, but not in the cool, apocalyptic way. The Million Monkeys project went viral starting on September 25, 2011 and went into full swing on September 26, 2011. On September 26, 2011, over 25,000 unique visitors viewed the Million Monkeys project, 300 sites referred traffic, and people viewed it from 119 countries. This post will contain some of my thoughts and reactions on going viral. If this article about going viral goes viral, it will create an infinite loop that will bring about the destruction of the world.

NOTE: I apologize in advance for having to use the term “go viral” so much, but that really explains the phenomenon.

I am proud to announce that I have open sourced the Million Monkeys project. The source code is available here.

This project originally started on August 21, 2011.  Over the course of the project, over 7.5 trillion character groups have been randomly generated and checked, out of the 5.5 trillion (5,429,503,678,976) possible combinations.

Update: The monkeys are not RFC 2795 compliant. A Slashdot user pointed out that I forgot to talk about the similarities between this project and Richard Dawkin’s Weasel Experiment.

If you would like to do a story, please contact me via the Contact page.

Thoughts on Going Viral

As I mentioned before, the Million Monkeys project went viral on September 26, 2011. This was partly due to me spending a few hours E-mailing every news outlet I could think of. Another part was people using Twitter and Facebook to promote the project. On that day alone, over 2,300 visitors came to the site through Facebook and Twitter.

The first round of the project had no recognition, even among my friends. I thought the concept was cool and I kept with it. During a conversation with a friend of mine, we came up with a new concept for the project.

I went back to the drawing board for the second round of the project with the ideas from the new concept. I started using a smaller group size, 9 character groups instead of 24 character groups. This would allow the project to complete without infinite amount of resources. I added near real-time updates of the site so people could see the progress of the monkeys. I wanted people to be able to come back to the site to watch their favorite work being recreated. This round received some recognition and landed on the front pages of Fox News and Engadget.

I knew I was on the right track. I was getting some media attention and people were starting to see the site. My goal was to do another media blitz once the monkeys completed their first work. My goal was to get an Associated Press article and, if I was lucky enough, get on the front page of Slashdot. I thought I had a good idea, but I had no delusions of the project going viral.

On Sunday night September 25, 2011, I was reading through my RSS feeds on Google Reader. Some new Slashdot stories appeared and I dutifully started reading them. When I started reading about myself and my project, I started to think I had clicked on the wrong feed or I had erred in some fashion. I could not believe I was reading about myself on Slashdot after many years of reading it. My wife was next to me at the time and I tried to explain why I was so ecstatic to be on Slashdot. Explaining to a non-geek about Slashdot is difficult, but I think she could see it was important to me. If the media blitz had died at that point, I would have been happy. It didn’t. Over the course of the next day, the story kept on gaining momentum, getting more news stories, and more hits on the website.

All glory may be fleeting, but not everyone liked the project. I received my share of hate mail, hate comments, and hate blog posts. I was informed that I didn’t understand Infinite Monkey Theorem (I do), that I was conning people (I’m not, the source code and data are available), and that the project was boring (beauty is in the eye of the beholder). Before anyone decides to create a project on the Internet, you better have a thick skin to put up with peoples’ comments. I responded to the people I thought were genuinely asking a question or those that seemed to be open to a discussion about the project. Most people responded and most people were nice.

Pre-Viral Checklist

You should create as many social objects as possible. I have several YouTube videos where I explain in various levels of detail about the project. These YouTube videos, in turn, were posted by the various sites on their postings. The blog postings themselves were great social objects. I could see by the direct traffic that people were E-mailing the link about to their friends. My Twitter feed allowed me to converse with people who had questions about the project. They also allowed me to tweet the URLs of interviews, articles and radio shows about the project.

To gain the most amount of media attention, you make your project and/or post as media friendly as possible. Many of the sites wrote their articles only using the posts as source material. I put a lot of effort into making the site as straightforward as possible and as quotable as possible. When doing a technical project like this, not all of your readers will be technically minded people. I recommend creating sections for technical and non-technical people. The non-technical people may glaze over at a very technical explanation of your project and a technical person will want more technical detail.

The site itself needs to ready technically for a huge increase in traffic. Many sites go down during a Slashdotting. Fortunately for me, DreamHost kept my site going without stoppage. It’s usually too late to change your site once it goes viral. Make sure you have some metrics for your site to track the usage. In my case, I use Google Analytics for WordPress. Having a decent looking site also helps. If you are not a designer, use your good taste and find a good them for site. I used ElegantTheme’s Minimal theme for this site. To handle a Slashdotting, your site needs to be optimized. From the beginning of this project, I tried to optimize the site. The images showing the progress through Shakespeare were indexed PNGs. They provided the smallest file size and therefore the best scalability. Much to my lament, the comments are not working on this site. One of the CAPTCHA plugins I installed messed things up and it is still not working even after I uninstalled all of them.

Make sure your site makes it as easy as possible to connect with your users socially. The previous posts did not have the Facebook likes and Tweets when they were on Engadget and Fox News. I made it more difficult than it should have been for people to tell their friends about the project. From the start of this round, I have the “like” buttons for the major social players. The site’s traffic and the numbers of people “liking” shows much better the story made its rounds.

Was Was It A Success?

I always do a postmortem at the end of every project. This is the Million Monkeys project postmortem. I think the project was a resounding success. It achieved its primary goal of recreating every work of Shakespeare. People saw my work. While I might have received over 25,000 unique visitors to my site, millions and millions of people read about my work on mainstream news, blogs, print and radio. My personal branding (which is what this website is) went through the roof. On Google, the search term “jesse anderson” used to appear as the 45th link. Now, I have links 4-6. The top 3 spots belong to an anime character named Jesse Anderson (Andersen). The project also brought me recognition within my own company, Intuit.

This success was not the result of luck. I found it is not the result of luck or a random chance, but the result of countless hours of hard work. Even though the Million Monkeys project took 40-60 hours of my time to write, it took countless hours before that to become a better programmer and learn new technologies like Hadoop. A lot of time was spent submitting the story and working with reporters on stories.

In a way, the Million Monkeys is the current culmination of this time spent.

Miscellaneous Thoughts

A lot of reporters asked me what I wanted to accomplish with this project. For me it is performance art with monkeys and computers. I wanted to make it engaging and have people coming back to check the monkeys’ progress, so I did near real-time updates of the site. People did just that as was reflected through the usage logs. People were coming back and they were E-mailing it around to their friends. They were tweeting it and liking it on Facebook. I consider that the most gratifying part of the project; people enjoyed it.

As time went on, I began to anthropomorphise the monkeys more and more. Instead of thinking of them as a PRNG (pseudo random number generator) and a computer program, I was talking about them as if they were really monkeys. I began to identify with them and think of them like a pet. Maybe I spent too much time curating their work.

Going back to thick skin, I have a list of people to contact to get approval of projects. If anyone wants this list before they start their project, please E-mail me so we can get their approval. It’s of utmost importance that any project contact them before starting any work.

Reading about yourself in the news is one of the craziest things that can happen to you. There is kind of a disembodied realization that it is you, but it does not feel like you did it. That first week seemed like it was a month long. I was doing a lot of interviews and every moment seemed like an eternity.

I could not get the local media in Reno to do any stories on the project. It was incredibly funny because I would E-mail them saying the project has been on BBC, CNN, etc and I never even got a reply. I will take international coverage over local coverage any day, but it was funny that local didn’t follow international. Update: I finally got some local press.

Some More Numbers

The monkeys ran 180,000,000,000 character groups a day. An average iteration lasted 30 minutes 33 seconds and ran 5,000,000,000 character groups. The monkeys found 1,982,507 distinct character groups and those character groups were found 3,788,175 times for a ratio of 1.8718555. The monkeys ran 7,445,912,000,000 total character groups out of the 5,429,503,678,976 possible combinations for a ratio of 1.3713.

There are 2 technologies I think set the Monkeys Project apart from previous endeavors. The first is Hadoop, which scales well and can handle exponential problems like Infinite Monkey Theorem. The second is a Bloom Filter. I ran a test last night comparing the Bloom Filter speed to a String.indexOf. The Bloom Filter took 25 seconds to run every work of Shakespeare and I stopped the String.indexOf after 2 hours. The monkeys project would not be close to the number of character sets it is now if not for the Bloom Filter. In other words, if not for the usage of a Bloom Filter, the project would be far from complete. I think this would even be true of using Lucene or Sphinx but not as bad.

The Inspiration

This project comes from one of my favorite Simpsons episodes which has a scene where Mr. Burns brings Homer to his mansion (YouTube Video). One of his rooms has a thousand monkeys at a thousand typewriters. One of the monkeys writes a slightly incorrect line from Charles Dickens “It was the best of times, it was blurst of times.”  The joke is a play on the theory that a million monkeys sitting at a million typewriters will eventually produce Shakespeare.  And that is what I did.  I created millions of monkeys on Amazon EC2 (then my home computer) and put them at virtual typewriters (aka Infinite Monkey Theorem).

Less Technical Explanation

Instead of having real monkeys typing on keyboards, I have virtual, computerized monkeys that output random gibberish. This is supposed to mimic a monkey randomly mashing the keys on a keyboard. The computer program I wrote compares that monkey’s gibberish to every work of Shakespeare to see if it actually matches a small portion of what Shakespeare wrote. If it does match, the portion of gibberish that matched Shakespeare is marked with green in the images below to show it was found by a monkey. The table below shows the exact number of characters and percentage the monkeys have found in Shakespeare. The parts of Shakespeare that have not been found are colored white. This process is repeated over and over until the monkeys have created every work of Shakespeare through random gibberish.

Technical Explanation

For this project, I used Hadoop, Amazon EC2, and Ubuntu Linux.  Since I don’t have real monkeys, I have to create fake Amazonian Map Monkeys.  The Map Monkeys create random data in ASCII between a and z.  It uses Sean Luke’s Mersenne Twister to make sure I have fast, random, well behaved monkeys.  Once the monkey’s output is mapped, it is passed to the reducer which runs the characters through a Bloom Field membership test.  If the monkey output passes the membership test, the Shakespearean works are checked using a string comparison.  If that passes, a genius monkey has written 9 characters of Shakespeare.  The source material is all of Shakespeare’s works as taken from Project Gutenberg.


This chart shows the total number of character groups as more and more iterations of the checks are run.


This chart shows percent complete as more and more iterations are run for each story.

For the curious, the computer I ran the monkeys on is a Core 2 Duo 2.66GHZ with 4 GB RAM running Ubuntu 10.10 64-bit.

A Few Words To Try and Prevent The Usual Comments

I realize there are different interpretations to this saying/theorem and I have done 2 different ones already.  I understand the definition of infinite and infinite monkey theorem and I realize that this project does not have infinite resources.  This project was funded and written by myself and was not supported by any grant money or federal money.  No monkeys were harmed during the making of this code.  This project is my attempt to find a creative way to attain an answer without infinite resources.  It is a fun side project.  If you still feel angry or slighted or feel the need to set me straight, please read this sign:


Thanks to my wife Sara, daughter Ashley, David Weinberg, Ryan Polk, and Tim Dailey.

Read More

A Few Million Monkeys Randomly Recreate Shakespeare

A Few Million Monkeys Randomly Recreate Shakespeare

Friends, Romans, countrymen, lend me your ears;
I come to recreate Shakespeare, not to praise him.
- Monkey Julius Caesar

Update 1: The monkeys recreated every work of Shakespeare and went viral. See the project project postmortem for my thoughts on going viral and what I learned during the project.

Update 2: I created a new visualization of the monkeys’ data.

Today (2011-09-23) at 2:30 PST the monkeys successfully randomly recreated A Lover’s Complaint, The Tempest (2011-09-26), As You Like It (2011-09-28), Loves Labours Lost (2011-09-29), Much Ado About Nothing (2011-09-29), The Merchant Of Venice (2011-09-29), The Sonnets (2011-09-29), The Third Part Of King Henry The Sixth (2011-09-29), The Two Gentlemen Of Verona (2011-09-29), A Midsummer Nights Dream (2011-09-30), As You Like It (2011-09-30), The Life Of King Henry The Fifth (2011-09-30), The First Part Of Henry The Sixth (2011-09-30), The Tragedy Of Titus Andronicus (2011-09-30), The Winters Tale (2011-09-30), Measure for Measure (2011-10-01), The First Part Of King Henry The Fourth (2011-10-01), and The History Of Troilus (2011-10-01), Cressida (2011-10-01), Cymbeline (2011-10-02), King Richard The Second (2011-10-02), The Comedy Of Errors (2011-10-02), The Life Of Timon Of Athens (2011-10-02), The Tragedy Of Macbeth (2011-10-02), The Tragedy Of Othello Moor Of Venice (2011-10-02), Twelfth Night Or What You Will (2011-10-02), Alls Well That Ends Well (2011-10-03), King Henry The Eighth (2011-10-03), The Second Part Of King Henry The Sixth (2011-10-03), The Tragedy Of Hamlet Prince Of Denmark (2011-10-03), The Tragedy Of Julius Caesar (2011-10-03), The Tragedy Of Romeo And Juliet (2011-10-03), King John (2011-10-04), King Richard III (2011-10-04), Second Part Of King Henry IV (2011-10-04), The Tragedy Of Antony And Cleopatra (2011-10-04), The Tragedy Of Coriolanus (2011-10-04), The Tragedy Of King Lear (2011-10-04), and The Taming Of The Shrew (2011-10-06). This is the first time a work of Shakespeare has actually been randomly reproduced.  Furthermore, this is the largest work ever randomly reproduced. It is one small step for a monkey, one giant leap for virtual primates everywhere.

The monkeys will continue typing away until every work of Shakespeare is randomly created.  Until then, you can continue to view the monkeys’ progress on that page.  I am making the raw data available to anyone who wants it.  Please use the Contact page to ask for the URL. If you have a Hadoop cluster that I could run the monkeys project on, please contact me as well.

This project originally started on August 21, 2011.  Over the course of the project, over 6.5 trillion character groups have been randomly generated and checked out of the 5.5 trillion possible combinations.

So far, the project has appeared on SlashdotFox NewsEngadgetJapanese Engadget, and Solidot. The radio interviews are Australian Broadcasting Company, Little Tommy, Jeff and Jer in San Diego and Radio New Zealand.  If you would like to do a story, please contact me via the Contact page.

The Inspiration

This project comes from one of my favorite Simpsons episodes which has a scene where Mr. Burns brings Homer to his mansion (YouTube Video). One of his rooms has a thousand monkeys at a thousand typewriters. One of the monkeys writes a slightly incorrect line from Charles Dickens “It was the best of times, it was blurst of times.”  The joke is a play on the theory that a million monkeys sitting at a million typewriters will eventually produce Shakespeare.  And that is what I did.  I created millions of monkeys on Amazon EC2 (then my home computer) and put them at virtual typewriters (aka Infinite Monkey Theorem).

Less Technical Explanation

Instead of having real monkeys typing on keyboards, I have virtual, computerized monkeys that output random gibberish. This is supposed to mimic a monkey randomly mashing the keys on a keyboard. The computer program I wrote compares that monkey’s gibberish to every work of Shakespeare to see if it actually matches a small portion of what Shakespeare wrote. If it does match, the portion of gibberish that matched Shakespeare is marked with green in the images below to show it was found by a monkey. The table below shows the exact number of characters and percentage the monkeys have found in Shakespeare. The parts of Shakespeare that have not been found are colored white. This process is repeated over and over until the monkeys have created every work of Shakespeare through random gibberish.

Technical Explanation

For this project, I used Hadoop, Amazon EC2, and Ubuntu Linux.  Since I don’t have real monkeys, I have to create fake Amazonian Map Monkeys.  The Map Monkeys create random data in ASCII between a and z.  It uses Sean Luke’s Mersenne Twister to make sure I have fast, random, well behaved monkeys.  Once the monkey’s output is mapped, it is passed to the reducer which runs the characters through a Bloom Field membership test.  If the monkey output passes the membership test, the Shakespearean works are checked using a string comparison.  If that passes, a genius monkey has written 9 characters of Shakespeare.  The source material is all of Shakespeare’s works as taken from Project Gutenberg.

The monkeys’ data from Amazon’s cloud is updated on this site every 30 minutes.  The images below show green for every character group that was found and white for those that are still missing.  The images output is kind of like the animations for defrag utilities.  As the monkeys progress through the works, more and more character groups will be found and show green.

This chart shows the total number of character groups as more and more iterations of the checks are run.

This chart shows percent complete as more and more iterations are run for each story.

For the curious, the computer I ran the monkeys on is a Core 2 Duo 2.66GHZ with 4 GB RAM running Ubuntu 10.10 64-bit.

A Few Words To Try and Prevent The Usual Comments

I realize there are different interpretations to this saying/theorem and I have done 2 different ones already.  I understand the definition of infinite and infinite monkey theorem and I realize that this project does not have infinite resources.  This project was funded and written by myself and was not supported by any grant money or federal money.  No monkeys were harmed during the making of this code.  This project is my attempt to find a creative way to attain an answer without infinite resources.  It is a fun side project.  If you still feel angry or slighted or feel the need to set me straight, please read this sign:


Read More

A Few More Million Amazonian Monkeys

A Few More Million Amazonian Monkeys

Update 5: The monkeys recreated every work of Shakespeare and went viral. See the project project postmortem for my thoughts on going viral and what I learned during the project.

Update 6: I created a new visualization of the monkeys’ data.

Update 4: The monkeys recreated “A Lover’s Complaint”. Check out the write up.

Update 3: Welcome Slashdot, Fox News, Engadget and Japanese Engadget. So far, the monkeys have ran through 7.5 trillion 6.5 trillion 5 trillion (2011-09-22) 4 trillion (2011-09-16) 3 trillion (2011-09-10) 2.5 trillion (2011-09-07) 2 trillion (2011-09-05) 1.5 trillion (2011-09-01) 1 trillion (2011-08-28) 515,912,000,000 (2011-08-25) character groups.

In a recent post, I described a recent project to recreate Shakespeare using Hadoop and Amazon EC2.  This time, I am going to recreate every work of Shakespeare randomly.

This project comes from one of my favorite Simpsons episodes which has a scene where Mr. Burns brings Homer to his mansion (YouTube Video). One of his rooms has a thousand monkeys at a thousand typewriters. One of the monkeys writes a slightly incorrect line from Charles Dickens ‘It was the best of times, it was blurst of times.’  The joke is a play on the theory that a million monkeys sitting at a million typewriters will eventually produce Shakespeare.  And that is what I did (am doing).  I created millions of monkeys on Amazon and put them at virtual typewriters (aka Infinite Monkey Theorem).

Less Technical Explanation

Instead of having real monkeys typing on keyboards, I have virtual, computerized monkeys that output random gibberish. This is supposed to mimic a monkey randomly mashing the keys on a keyboard. The computer program I wrote compares that monkey’s gibberish to every work of Shakespeare to see if it actually matches a small portion of what Shakespeare wrote. If it does match, the portion of gibberish that matched Shakespeare is marked with green in the images below to show it was found by a monkey. The table below shows the exact number of characters and percentage the monkeys have found in Shakespeare. The parts of Shakespeare that have not been found are colored white. This process is repeated over and over until the monkeys have created every work of Shakespeare through random gibberish.

Technical Explanation

For this project, I used Hadoop, Amazon EC2, and Ubuntu Linux.  Since I don’t have real monkeys, I have to create fake Amazonian Map Monkeys.  The Map Monkeys create random data in ASCII between a and z.  It uses Sean Luke’s Mersenne Twister to make sure I have fast, random, well behaved monkeys.  Once the monkey’s output is mapped, it is passed to the reducer which runs the characters through a Bloom Field membership test.  If the monkey output passes the membership test, the Shakespearean works are checked using a string comparison.  If that passes, a genius monkey has written 9 characters of Shakespeare.  The source material is all of Shakespeare’s works as taken from Project Gutenberg.

The monkeys’ data from Amazon’s cloud is updated on this site every 30 minutes.  The images below show green for every character group that was found and white for those that are still missing.  The images output is kind of like the animations for defrag utilities.  As the monkeys progress through the works, more and more character groups will be found and show green.

The Tabular Output Of What Has Been Found

Loading Results… (Will only work on jesse-anderson.com due to browser security restrictions, go here)

Every Work Of Shakespeare

All Works of Shakespeare

All Works of Shakespeare

Progress Through Individual Works Of Shakespeare

A Lovers Complaint

A Lovers Complaint

Loves Labours Lost

Loves Labours Lost

The Merchant Of Venice

The Merchant Of Venice

The Tragedy Of Julius Caesar

The Tragedy Of Julius Caesar

A Midsummer Nights Dream

A Midsummer Nights Dream

Measure For Measure

Measure For Measure

The Merry Wives Of Windsor

The Merry Wives Of Windsor

The Tragedy Of King Lear

The Tragedy Of King Lear

Much Ado About Nothing

Much Ado About Nothing

The Tragedy Of Macbeth

The Tragedy Of Macbeth

Alls Well That Ends Well

Alls Well That Ends Well

The Sonnets

The Sonnets

The Tragedy Of Othello Moor Of Venice

The Tragedy Of Othello Moor Of Venice

As You Like It

As You Like It

The Comedy Of Errors

The Comedy Of Errors

The Taming Of The Shrew

The Taming Of The Shrew

The Tragedy Of Romeo And Juliet

The Tragedy Of Romeo And Juliet

Cymbeline

Cymbeline

The Tempest

The Tempest

The Tragedy Of Titus Andronicus

The Tragedy Of Titus Andronicus

King Henry The Eighth

King Henry The Eighth

The First Part Of King Henry The Fourth

The First Part Of King Henry The Fourth

Second Part Of King Henry IV

Second Part Of King Henry IV

The First Part Of Henry The Sixth

The First Part Of Henry The Sixth

The Second Part Of King Henry The Sixth

The Second Part Of King Henry The Sixth

The Third Part Of King Henry The Sixth

The Third Part Of King Henry The Sixth

The Two Gentlemen Of Verona

The Two Gentlemen Of Verona

King John

King John

The History Of Troilus And Cressida

The History Of Troilus And Cressida

The Tragedy Of Antony And Cleopatra

The Tragedy Of Antony And Cleopatra

The Winters Tale

The Winters Tale

King Richard III

King Richard III

The Life Of King Henry The Fifth

The Life Of King Henry The Fifth

The Tragedy Of Coriolanus

The Tragedy Of Coriolanus

Twelfth Night Or What You Will

Twelfth Night Or What You Will

King Richard The Second

King Richard The Second

The Life Of Timon Of Athens

The Life Of Timon Of Athens

The Tragedy Of Hamlet Prince Of Denmark

The Tragedy Of Hamlet Prince Of Denmark

Update: I was running this on a free micro instance (600 MB RAM) from Amazon. Alas, the monkeys needed more RAM than the free micro instance had and the processes get out of memory errors. I have moved the Hadoop server to my home computer which is much faster and has more memory.

Update 2: I updated the Hadoop configuration to have less idle CPU time. This will significantly increase the monkey power and find more character groups.

Update 4: I made a small change to how memory is allocated for the random character groups. It should help speed things up again.

Read More

A Million Amazonian Monkeys

A Million Amazonian Monkeys

One of my favorite Simpsons episodes has a scene where Mr. Burns brings Homer to his mansion (YouTube Video). One of his rooms has a thousand monkeys at a thousand typewriters. One of the monkeys writes a slightly incorrect line from Charles Dickens ‘It was the best of times, it was blurst of times.’  The joke is a play on the theory that a million monkeys sitting at a million typewriters will eventually produce Shakespeare.  And that is what I did (tried to do).  I created millions of monkeys on Amazon and put them at virtual typewriters (aka Infinite Monkey Theorem).  An old New York Times article gives a general account of computers as authors.

I have been learning Hadoop lately and wanted a project to try some things out.  Also, I have been wanting to try Amazon’s Web Services out.  This project brought the 2 pieces together by using Hadoop’s MapReduce with Amazon’s Elastic MapReduce.  The best part of all is how cheap it is to try things like this.  A small instance costs $0.10 and a medium CPU instance costs $0.20 per instance hour.

Since I don’t have real monkeys, I have to create fake Amazonian Map Monkeys.  The Map Monkeys create random data in ASCII between a and z.  It uses Sean Luke’s Mersenne Twister to make sure I have fast, random, well behaved monkeys.  Once the monkey’s output is mapped, it is passed to the reducer which runs the characters through a Bloom Field membership test.  If the monkey output passes the membership test, the Shakespearean works are checked using a string comparison.  If that passes, a genius monkey has written 10 characters of Shakespeare.

The source material is all of Shakespeare’s works as taken from Project Gutenberg.  My monkeys leave it to their editors to format things well.  Everything is stripped down to a-z, all whitespace and newlines are taken out of the source material.  Otherwise, things are left intact.  I had to reduce the monkeys’ output to reduce number of combinations due to the exponential nature of the problem.  The table below shows the exponential growth and computational difficulty of the problem.

# Characters Combinations
10
141,167,095,653,376
9
5,429,503,678,976
8
208,827,064,576
7
8,031,810,176
6
308,915,776
5
11,881,376
4
456,976
3
17,576
2
676
1
26

There are several benefits to using Hadoop (MapReduce) and Amazon.  One is that other works or every book in Project Gutenberg could be searched.  The beauty of a Bloom Filter is how it allows a membership tests without having to load the entire text into memory and it allows a very quick membership test instead of having to compare every character.  Another benefit is that one can scale this very well via Amazon where adding another computer to the cluster increases the output in a linear fashion.  However, my pocketbook doesn’t scale well, maybe Amazon will give me some free monkeys.  With only a few minor changes, the Monkey MapReduce program could be changed to 24 random characters, check against every written work by man, and scale to 100+ computers.

Now for some numbers.  There are 10^26 (141,167,095,653,376) combinations of 10 characters with the 26 letters of the English alphabet.  In my trimmed down Shakespeare, there are 3,696,339 non-distinct possible 10 character groups.  As of today, I have ran 20,164,000,000 maps or 10 character checks using Elastic MapReduce.  This comes out to 0.014% of the possible combinations.  For the curious, I am running 2 instances of “HighCPU – Medium”.  A day’s worth of monkeys costs $19.20 and does 6,980,000,000 checks.

Even though this project ranks up there as the eighth wonder of the world and is as important as curing cancer, I don’t have a bunch of money to put towards it.  You can get in touch with me via the contact form to help out and carry on this monumental work.  We need to act now before the endangered Amazonian Hadoop Monkeys go extinct.  I have an even better idea for creating Shakespeare with Amazonian Monkeys and a very cool way to visualize it

.

Read More

© 2011-2014 Jesse Anderson All Rights Reserved