A Few More Million Amazonian Monkeys

Jesse Anderson
August 21, 2011
Blog, Million Monkeys
No Comments

Blog Summary: (AI Summaries by Summarizes)

The project aims to recreate every work of Shakespeare randomly using virtual, computerized monkeys that output random gibberish.
The computer program compares the monkey's gibberish to every work of Shakespeare to see if it matches a small portion of what Shakespeare wrote.
The monkeys' data from Amazon's cloud is updated on the website every 30 minutes.
The project uses Hadoop, Amazon EC2, and Ubuntu Linux.
The source material is all of Shakespeare's works as taken from Project Gutenberg.

Update 5: The monkeys recreated every work of Shakespeare and went viral. See the project project postmortem for my thoughts on going viral and what I learned during the project.

Update 6: I created a new visualization of the monkeys’ data.

Update 4: The monkeys recreated “A Lover’s Complaint”. Check out the write up.

Update 3: Welcome Slashdot, Fox News, Engadget and Japanese Engadget. So far, the monkeys have ran through 7.5 trillion ~~6.5 trillion~~ ~~5 trillion (2011-09-22)~~ ~~4 trillion (2011-09-16)~~ ~~3 trillion (2011-09-10)~~ ~~2.5 trillion (2011-09-07)~~ ~~2 trillion (2011-09-05)~~ ~~1.5 trillion (2011-09-01)~~ ~~1 trillion (2011-08-28)~~ ~~515,912,000,000 (2011-08-25)~~ character groups.

In a recent post, I described a recent project to recreate Shakespeare using Hadoop and Amazon EC2. This time, I am going to recreate every work of Shakespeare randomly.

This project comes from one of my favorite Simpsons episodes which has a scene where Mr. Burns brings Homer to his mansion (YouTube Video). One of his rooms has a thousand monkeys at a thousand typewriters. One of the monkeys writes a slightly incorrect line from Charles Dickens “It was the best of times, it was blurst of times.” The joke is a play on the theory that a million monkeys sitting at a million typewriters will eventually produce Shakespeare. And that is what I did (am doing). I created millions of monkeys on Amazon and put them at virtual typewriters (aka Infinite Monkey Theorem).

Less Technical Explanation

Instead of having real monkeys typing on keyboards, I have virtual, computerized monkeys that output random gibberish. This is supposed to mimic a monkey randomly mashing the keys on a keyboard. The computer program I wrote compares that monkey’s gibberish to every work of Shakespeare to see if it actually matches a small portion of what Shakespeare wrote. If it does match, the portion of gibberish that matched Shakespeare is marked with green in the images below to show it was found by a monkey. The table below shows the exact number of characters and percentage the monkeys have found in Shakespeare. The parts of Shakespeare that have not been found are colored white. This process is repeated over and over until the monkeys have created every work of Shakespeare through random gibberish.

Technical Explanation

For this project, I used Hadoop, Amazon EC2, and Ubuntu Linux. Since I don’t have real monkeys, I have to create fake Amazonian Map Monkeys. The Map Monkeys create random data in ASCII between a and z. It uses Sean Luke’s Mersenne Twister to make sure I have fast, random, well behaved monkeys. Once the monkey’s output is mapped, it is passed to the reducer which runs the characters through a Bloom Field membership test. If the monkey output passes the membership test, the Shakespearean works are checked using a string comparison. If that passes, a genius monkey has written 9 characters of Shakespeare. The source material is all of Shakespeare’s works as taken from Project Gutenberg.

The monkeys’ data from Amazon’s cloud is updated on this site every 30 minutes. The images below show green for every character group that was found and white for those that are still missing. The images output is kind of like the animations for defrag utilities. As the monkeys progress through the works, more and more character groups will be found and show green.

The Tabular Output Of What Has Been Found

Loading Results… (Will only work on jesse-anderson.com due to browser security restrictions, go here)

Every Work Of Shakespeare

All Works of Shakespeare

Progress Through Individual Works Of Shakespeare

A Lovers Complaint

Loves Labours Lost

The Merchant Of Venice

The Tragedy Of Julius Caesar

A Midsummer Nights Dream

Measure For Measure

The Merry Wives Of Windsor

The Tragedy Of King Lear

Much Ado About Nothing

The Tragedy Of Macbeth

Alls Well That Ends Well

The Sonnets

The Tragedy Of Othello Moor Of Venice

As You Like It

The Comedy Of Errors

The Taming Of The Shrew

The Tragedy Of Romeo And Juliet

Cymbeline

The Tempest

The Tragedy Of Titus Andronicus

King Henry The Eighth

The First Part Of King Henry The Fourth

Second Part Of King Henry IV

The First Part Of Henry The Sixth

The Second Part Of King Henry The Sixth

The Third Part Of King Henry The Sixth

The Two Gentlemen Of Verona

King John

The History Of Troilus And Cressida

The Tragedy Of Antony And Cleopatra

The Winters Tale

King Richard III

The Life Of King Henry The Fifth

The Tragedy Of Coriolanus

Twelfth Night Or What You Will

King Richard The Second

The Life Of Timon Of Athens

The Tragedy Of Hamlet Prince Of Denmark

Update: I was running this on a free micro instance (600 MB RAM) from Amazon. Alas, the monkeys needed more RAM than the free micro instance had and the processes get out of memory errors. I have moved the Hadoop server to my home computer which is much faster and has more memory.

Update 2: I updated the Hadoop configuration to have less idle CPU time. This will significantly increase the monkey power and find more character groups.

Update 4: I made a small change to how memory is allocated for the random character groups. It should help speed things up again.

A Few More Million Amazonian Monkeys

Less Technical Explanation

Technical Explanation

The Tabular Output Of What Has Been Found

Every Work Of Shakespeare

Progress Through Individual Works Of Shakespeare

Related Posts

Unapologetically Technical Episode 10 – Michael Drogalis

Why Most Data Projects Fail & How to Avoid It at GOTO 2023

Unapologetically Technical Episode 9 – Gunnar Morling

Unapologetically Technical Episode 8 – Tom Scott

The State of Data Engineering at Data Day Texas 2024

Unapologetically Technical Episode 7 – Stephane Derosiaux

The Difference Between Learning and Doing

Unapologetically Technical Episode 6 – Matteo Merli

The Data Discovery Team

Join the Newsletter