EC2 Performance, Spot Instance ROI and EMR Scalability
Note: This is a very long, technical and detailed discussion of Amazon Web Services. You can watch the YouTube video below for a less technical explanation or skip to the conclusion to get the results.
Introduction
In 2006, Amazon introduced Elastic Compute Cloud (EC2) to Amazon Web Services (AWS). In 2009, Amazon introduced Elastic MapReduce (EMR). EMR uses Hadoop to create MapReduce jobs using EC2 instances with Simple Storage Service (S3) as the permanent storage mechanism. In 2011, Amazon added Spot Instance support for EMR jobs. Spot Instances allow you to bid on EMR or EC2 instances that are not in use. The pricing page under the Spot Instances heading gives up to date data on EMR and EC2 instance prices.
In 2011, I created the Million Monkeys Project (source code). It is a good metric for CPU and memory speed in a Hadoop cluster as it is very computational and memory intensive in its character group testing. This project will use the Million Monkeys code to profile the various EC2 instances and the scalability of EMR and Hadoop. I will talk about the cost savings when running EMR jobs as Spot Instances (bid price) instead of On Demand instances (full price). This post will help engineers in choosing the right EC2 instance types based on the amount of work or computation needed.
When I originally ran the Million Monkeys Project to recreate every work of Shakespeare, I lacked the resources to run it entirely on EMR. I started the project on an EC2 micro instance, but the instance lacked enough RAM to run everything I needed. This time, I have the resources to run the entire project and recreate every work of Shakespeare on EMR using a 20 node EMR cluster.
Setting Up An EMR Cluster
To run an EC2 cluster, various Hadoop services like the Task Tracker and DFS service need to be running. This is in addition to the actual Map and Reduce tasks that will do the actual work. In an EMR cluster, the various Hadoop services are run on a master instance group. The Map and Reduce tasks are run on a core instance group. The core instance group is made of up of 1 or more EC2 instances. When creating the EMR cluster, you can choose a different instance type for the master and core nodes. You can use the information in this post in deciding which instance type should be used given the task(s).
An EMR cluster is built on EC2 instances and these instances run various parts of the Hadoop cluster. The data can reside in S3 and be loaded from S3 into the Hadoop Distributed File System (DFS). The compiled code or JAR and any input files are stored in an S3 bucket. At the end of a job, all files that are not in S3 at the termination of the master instance group will be lost. Therefore, you should make sure that the code places any important output in S3. In the Million Monkeys code, I created a prefix that could be added to a file’s path to place them directly on S3.
Table 1.1 The Breakdown of Various EC2 Instances Specifications
| Instance Name | Memory | EC2 Compute Units and Cores | Platform | I/O Performance |
| Small | 1.7 GB | 1 EC2 on 1 Core | 32-bit | Moderate |
| Large | 7.5 GB | 4 EC2 on 2 Cores | 64-bit | High |
| Extra Large | 15 GB | 8 EC2 on 8 Cores | 64-bit | High |
| High-CPU Medium | 1.7 GB | 5 EC2 on 2 Cores | 32-bit | Moderate |
| High-CPU Large | 7 GB | 20 EC2 on 8 Cores | 64-bit | High |
| Quadruple Extra Large | 23 GB | 33.5 on 8 Cores | 64-bit | Very High |
Source Note: High-Memory Extra Large, High-Memory Double Extra Large, High-Memory Quadruple Extra Large instances not tested and are not included on this table.
Instance Testing
EC2 has various instances and performance specifications for those instances. These EC2 instances are analogous to running a virtual machine in the cloud. As shown in Table 1.1, each instance type varies in the number of EC2 Computer Units (ECU), the number of virtual cores, the amount of RAM, 32 or 64 bit platform, the amount of disk space, and network or I/O performance. Some of these descriptions are quite nebulous. For example, this is the description from Amazon regarding the definition of an ECU (Source):
EC2 Compute Unit (ECU) – One EC2 Compute Unit (ECU) provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.
That description of CPU capacity does not really help in making capacity decisions or really even guessing how to scale an application. To this end, I ran various tests to give an absolute idea of how each instance compares to another when running the same tests.
For these tests, I ran the Million Monkeys program for 5 continuous hours. During this time, the Million Monkeys Code is run in a loop and the total number of character groups is calculated. A character group is a group of 9 characters that is randomly generated using a Mersenne Twister and its existence is checked against every work of Shakespeare. The runs lasted for slightly over 5 hours per run and the number of character groups is pro-rated.
Chart 1.1 Total Character Groups Checked In a 5 Hour Pro-rated Period
Chart 1.1 does not present any surprises. The Small instance obviously has the fewest character groups, followed by Hi-CPU Medium. Large comes in third and Extra Large and Hi-CPU Large are a virtual tie, with Extra Large coming out slightly higher. Quadruple Extra Large is the obvious winner with the highest total character groups. Chart 1.1 gives an idea of the raw computing power of each instance. It is not until we start looking at price per unit that we get a handle on cost efficiency of a particular instance.
In the original Million Monkeys project, I ran the entire Hadoop cluster on my home computer, an Intel Core 2 Duo 2.66GHZ with 4 GB RAM running Ubuntu 10.10 64-bit. In 5 hours, my home computer ran 50,000,000,000 character groups. One of the main differences between my home computer and the EC2 instances is that my home computer was not running in a virtualized environment. I have seen a 10-30% decrease in efficiency when using virtualization. Also, all processing was done locally with the Hadoop services and MapReduce tasks running on the same computer.
Chart 1.2 Spot Instance Savings Per Hour When Compared to On Demand
Spot Instances help reduce the cost of running an EMR cluster. The Spot Instance prices will fluctuate as the market price changes. Chart 1.2 represents the Spot Instance (bid) prices relative to their On Demand (full) prices when I ran their tests. The savings in this test was very even across all instances at about 65% off their On Demand prices. With a little bit of forward planning, an EMR cluster can save a lot of money using Spot Instances.
I should point out that running on a Spot Instance does not require a code change per se. However, an EMR job flow’s Spot Instances can be taken away because of market price fluctuations. A MapReduce job flow may need to be changed to accommodate an unplanned stoppage. This might include saving the job state and adding the ability to start back up where it left off at the last save point. The Million Monkeys code already did this and could take advantage of the Spot Instances without any code changes.
Chart 1.3 Cost Per Hour For On Demand and Spot Instances

Chart 1.3 shows another cost breakdown by hour of usage and total costs. Calculating total cost for a single node cluster with EMR can be done using interactive Table 1.2. For an On Demand instance, the total cost per hour is master node group plus core instance(s) group, plus EMR costs for all instances. For a Spot instance, the total cost per hour is master node instance plus core instance(s) group spot price plus EMR instance(s).
For example, when I ran the Hi-CPU Medium instance testing, I paid a spot price for the core instance group of $0.06 per hour ($0.17 On Demand). I also had to pay for the EMR cluster’s master node ($0.17 per hour) which was a Hi-CPU Medium instance as well. On top that I have to pay the EMR price per hour ($0.03) for the master and core node.
To help illustrate the total pricing, Table 1.2 details the breakdown of total price per hour for Spot and On Demand instances.
Table 1.2 Spot Instance and On Demand Price Calculation
| Price Description | Spot Instance Price | On Demand Price |
| Master Node | 0 | 0 |
| Master Node EMR | 0 | 0 |
| Core Node | 0 | 0 |
| Core Node EMR | 0 | 0 |
| Total Price Per Hour | 0 | 0 |
| Spot Instance Price Per Hour | On Demand Instance Price Per Hour |
(Try out your own prices)
It is possible to run the master node as a Spot instance instead of an On Demand instance. Amazon recommends running the master node as an On Demand instance to prevent market price from taking out your master node and stopping the entire cluster.
For these tests, I varied the master node instance type. Table 1.3 shows a list of instance type for the core and the instance type for the master node I used.
Table 1.3 Core Instance Group Type Used With Master Group Type Test
| Core Instance Group Type | Master Instance Group Type |
| Small | Small |
| Large | Large |
| Extra Large | Extra Large |
| Hi-CPU Medium | Hi-CPU Medium |
| Hi-CPU Large | Hi-CPU Medium |
| Quadruple Extra Large | Hi-CPU Medium |
Chart 1.4 Cost Per 100,000,000 Character Groups Checked
Breaking down the data into price per unit gives insight into the most cost efficient means of running a job. In Chart 1.4, I break down the cost by how much it costs to process 100,000,000 character groups. For Chart 1.4, the lower the number the better. This bore out my hunch that the best bang for the monkey buck is a Hi-CPU Medium instance. I was surprised that the Small instance didn’t come in second best; that position was taken by the Large instance.
Once again, we can see the cost benefits of using a Spot instance. Across the board, the Spot instances have a much smaller variance than their On Demand counterparts. The Spot instances went from $0.00128 to $0.00497 and the On Demand instances went from $0.00364 to $0.0142.
Scalability Testing
The Instance testing above led up to the next phase of the project. In Chart 1.4, we found that the Hi-CPU medium instances provided the highest cost efficiency per character group. Now, I will take the most cost efficient instance and see how well it scales by adding more nodes to the cluster. For these tests, I created EMR clusters of 1, 2, 3, 4, 5, 10 and 20 nodes. Once again, I ran each cluster size for 5 hours and captured the results.
Chart 2.1 Spot Instance Savings Compared to On Demand Prices

In Chart 2.1, I show the cost savings by comparing Spot and On Demand prices across clusters sizes. The bars with the the “All” designations show the entire cost roll up of the cluster size. The core cost is consistent across all node sizes; however, having more nodes running at once increased the savings.
Chart 2.2 Cost Per Hour When Running Various Numbers of Nodes

Chart 2.2 shows another cost breakdown by hour of usage and total costs for Spot and On Demand instances.
To help illustrate the total pricing with a multi instance core group, Table 2.1 details the breakdown of total price per hour for Spot and On Demand instances for a 10 node cluster.
Table 2.1 Spot and On Demand Instance Price Calculation
| Price Description | Spot Instance Price | On Demand Price |
| Master Node | 0 | 0 |
| Master Node EMR | 0 | 0 |
| Core Node | 0 | 0 |
| Core Node EMR | 0 | 0 |
| Total Price Per Hour | 0 | 0 |
| Spot Instance Price Per Hour | On Demand Instance Price Per Hour | Number Of Nodes |
(Try out your own prices)
As you can see, you can calculate the current and even project the cost of a cluster. There is a new company, Cloudability, who has made it their mission to make not just cluster, EMR and EC2 price reporting more simple but look for ways to improve it (now in beta). Cloudability can even send you a daily or weekly Email showing the charges for that period. You can check out their website and sign up for a free account. Although I was unable to use Cloudability for this project, I look forward to using it in my next projects.
Chart 2.3 Cost To Run 100,000,000 Character Groups At Various Numbers of Nodes

In Chart 2.3, I break down the cost by how much it costs to process 100,000,000 character groups. For Chart 2.3, the lower the number the better. Once again, the Spot instance pricing shines. In this case, the Spot instances price variations are quite flat and the On Demand varies much more.
Chart 2.4 Total Character Groups At Various Numbers of Nodes Pro-rated to 5 Hours

Chart 2.4 shows the power of creating a multi-node cluster. With 20 nodes in the cluster, 477,987,913,067 character groups can be run in a 5 hour period.
I want to reiterate that there are no code changes necessary for creating a large cluster like this. I only needed to make EMR configuration changes when creating the cluster. Also, cluster configuration changes can be made to a live or running cluster. You can add or remove core instances to increase or decrease the performance of a cluster.
Chart 2.5 Percent Of Linear Scalability From Actual Growth At Various Numbers of Nodes

Now let’s get in to the scalability of EMR and Hadoop. A 1 node cluster is assumed to be the most efficient possible in an EMR cluster. As you can see, Chart 2.5 recognizes that with a 100% efficiency for a 1 node cluster. All subsequent cluster size efficiencies are calculated as number of character groups for 1 node, times the number of nodes in the cluster. A 2-5 node cluster has very similar loss of efficiency at about 5%. A 10 and 20 node cluster have a loss of efficiency at 13% and 16% respectively.
For anyone who has created a distributed system, they will recognize 84% as a phenomenal level of scalability. This really shows that EMR and Hadoop are living up to the hype as revolutionary technologies. With no code changes and simple configuration changes, you can easily scale an application.
Chart 2.6 Actual Scalability With Projected Linear Growth Pro-rated to 5 Hours

Chart 2.6 presents another breakdown of the scalability showing the absolute or actual values and the calculated values at 100% efficiency. Once again, we see a very gradual decline in cluster node sizes 2-5. There is a much more obvious decline on 10 and 20 nodes.
Million Monkeys On EMR
In my original run of the Million Monkeys Project, I tried to use the Micro Instance EC2 to run the project. The project needed more RAM than was available on the micro instance and I had to move it to my home computer. Many reporters and commenters asked me how long the project would take if I ran it to completion on EMR. This time, thanks to Amazon, I have the resources to run the project on a multi-node EMR cluster.
The instance testing and scalability testing really lead up to this test. In the instance testing, I wanted to find the EC2 instance type with the best bang for the buck. Next, I took that best EC2 instance (Hi-CPU Medium) and wanted to see what amount of efficiency I was losing when running a 20 node cluster. From there, I created a 20 node Hi-CPU Medium cluster that ran the Million Monkeys code for a prolonged period of time. I wanted to see how long it would take a 20 node cluster to recreate the original project.
For a little perspective, the original Million Monkeys project recreated every work of Shakespeare after running 7.5 trillion character groups and ran for 46 days. For these prolonged tests, I actually ran the 20 node cluster twice. The first time ran 12 trillion character groups in 5 days 17 hours. The second time ran 25.7 trillion character groups in 11 days 15 hours. Each one ran about 2.2 trillion character groups per day. Given the random nature of the problem, we can only extrapolate how long the original project would have taken. With these performance numbers, it would have taken 3 days 9 hours to complete the original project.
The cluster cost about $45.44 per day to run. I ran the cluster with the configuration as shown in 20 node scalability testing above with the master instance group as one Hi-CPU Medium instance running On Demand. The other 20 nodes are Hi CPU Medium instances running with a Spot price of $0.09 per hour. The 5 day run cost $317.96. The 11 day run total cost was $528.25. If I hadn’t used Spot instances, the 11 day total cost would have been $1,514.87. Once again, Spot pricing really shines because I achieved the same goal with almost $1,000 in savings.
Thoughts and Caveats
Previously, I mentioned that the Million Monkeys code is a good metric of CPU and memory. There is less I/O than might be run in other MapReduce tasks. I spent some time and effort to reduce the amount of I/O in code. To reduce the amount of I/O, I used a Bloom Filter in the Map task. The Bloom Filter is created once and saved in S3. All future Map tasks simply read the Bloom Filter file and run all processing against it. Once the Reduce tasks is run, a 3.5 MB text file is loaded into memory for the final existence checks. Depending on the MapReduce task, a Map tasks may need to read in gigabytes or even terabytes of data for processing. Another key difference for EC instances is their I/O performance. For MapReduce tasks that require high I/O performance, a High-CPU medium instance with moderate I/O performance may not have the best cost to performance ratio.
Earlier in the project, I used the AWS web user interface to create EMR jobs. It was a bit of a pain to setup the command line interface’s (CLI) various keys. Once I set up the CLI, it made the testing much easier and I wish I would have used the CLI sooner. It was much easier to repeat a job. The EMR API can be used to spawn your cluster programmatically. Here is the command line that I used to spawn the job:
./elastic-mapreduce --create --name "Monkeys Scalability 5 Hour Test 20 Node" --instance-group master --instance-type c1.medium --instance-count 1 --instance-group core --instance-type c1.medium --instance-count 20 --bid-price 0.08 --jar s3://monkeys2/monkeys.jar --arg timelimit=5h --arg iterationsize=1 --arg memory=-Xmx1024m --arg -Dmapred.max.split.size=12000 --arg -Dmapred.min.split.size=10000
I would like to break down what this command is doing. It is creating a new EMR cluster with the job name “Monkeys Scalability 5 Hour Test 20 Node.” The master instance group will be made up of one High-CPU Medium instance. The core instance group will be made up of 20 nodes with a bid price of $0.08 per hour per instance. It will be running a custom jar located in S3 at s3://monkeys2/monkeys.jar. The rest of the arguments are for the Million Monkeys code itself.
The performance could be improved by spending some time tuning and looking at configuration changes as all tests used defaults. For the duration of this project, I did not spend time optimizing the jobs and used only default settings, except maximum Java heap memory (-xmx).
Although I kept my bid prices in nice round cents, you can bid in fractions of a cent like $0.085.
For the curious, I used JFreeChart for the charts and graphs on this page with a customized color scheme.
Problems
Hadoop and EMR jobs are usually geared towards very large input files. In the case of the Million Monkeys project, the input files are very small, usually a few KB. This presented an issue when I started running the EMR cluster with multiple nodes. When I compared the multi-node results to the single node results, there was barely any improvement in total character groups. In some cases, a multi-node cluster did worse than a single node cluster. After a LOT of Googling and guesses, I finally found a small input file workaround by specifying the max and min split sizes for the input file. Here is the command so as to save a future person lots of Googling:
./elastic-mapreduce --create --name "Monkeys Scalability 5 Hour Test 20 Node" --instance-group master --instance-type c1.medium --instance-count 1 --instance-group core --instance-type c1.medium --instance-count 20 --bid-price 0.08 --jar s3://monkeys2/monkeys.jar --arg timelimit=5h --arg iterationsize=1 --arg memory=-Xmx1024m --arg -Dmapred.max.split.size=12000 --arg -Dmapred.min.split.size=10000
The split workaround stopped working (I never figured out why). I looked around for a better solution and found the NLineInputFormat class. Had I known about this class when I first wrote the code, I would have used it. It is a much better fit for the type of input I am using for the Million Monkeys project.
When you are budgeting for your AWS project, make sure you bake in some time and money for running down some issues. You may run in to some issues with multiple nodes that did not happen on a single development computer.
Conclusion
EMR is a great, cost effective way to get an enterprise Hadoop cluster going. It is also easier to get an enterprise Hadoop cluster up and running than a Do-It-Yourself method. An EMR cluster solves the many problems of creating an enterprise cluster like hardware specs, uptime and configuration. Until you have dealt with the pain of redundancy and enterprise hardware requirements, you don’t know how much time and effort EC2 and S3 save. With EMR, you simply have to start the cluster and all of these issues are taken care of.
I also showed how EMR and Hadoop make scaling easy. You do not have to convince your boss to buy a $2,000 to $4,000 server(s); you can simply add more EC2 instances to the core instance group or change the instance type to one with more ECUs. This can be done on a temporary basis to accommodate higher usage or a gradual increase in capacity. Without changing the code, I was able to scale the cluster to 20 nodes.
EMR clusters can be run at Amazon Web Service’s various locations around the world. AWS has 3 in the United States, 1 in Ireland, 1 in Singapore, 1 in Tokyo and 1 in Brazil. Separate EMR clusters could be used in conjunction with geographic sharding or simply choosing the nearest location to the client.
Spot instances also show great promise in further reducing the price per hour of an EMR cluster. For the 20 node tests, I reduced total cost per hour from $2.20 to $1.30, a 41% decrease. During one of the 20 node speed runs, I saved $1,000 by using Spot instances. If you decide to use Spot instances, make sure your code can handle its instances being taken away as the market price increases.
I think this project shows there is true substance to the hype and buzz around Hadoop and EMR. Anyone who has created their own distributed system knows that achieving 84% efficiency is an impressive feat. There are a great number of use cases that can make efficient use of Hadoop and EMR. Paired with EMR, you can easily run a cost efficient, enterprise level, cluster that can run around the world.
Full Disclosure: Amazon supported this project with AWS credit. I would like to thank Jeff Barr and Alan Mock from Amazon for their help in making this project possible.
Copyright © Jesse Anderson 2012. All Rights Reserved. All text, graphs and charts on this page are licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Read MoreMillion Monkeys Visualization
At last weekend’s Hack4Reno, I created a new visualization of the Million Monkeys’ data. It allows you to choose your favorite work of Shakespeare and find out how a particular character was found. You simply place or hover the mouse over a character and the box to the right will show the number of times that character was found.
For more information on the Million Monkeys Project, go here.
To make this visualization possible, I took the ~3GB of raw monkey data and generated a JSON output. This was tricky because I had to break the works of Shakespeare down into individual works. Once I had the JSON data, I wrote some Javascript that used JQuery to show the data and allow the interactions.
NOTE: It was a 24 hour Hackathon and there are a few bugs.
Read MoreA Few Million Monkeys Randomly Recreate Every Work Of Shakespeare
All the world’s a stage,
And all the monkeys merely players;
They have their typos and their hits,
And one monkey in his time plays many parts,
His acts being 38 works of Shakespeare.
- Monkey As You Like It
Update: I created a new visualization of the monkeys’ data.
The monkeys accomplished their goal of recreating all 38 works of Shakespeare. The last work, The Taming Of The Shrew, was completed at 2 AM PST on October 6, 2011. This is the first time every work of Shakespeare has actually been randomly reproduced. Furthermore, this is the largest work ever randomly reproduced. It is one small step for a monkey, one giant leap for virtual primates everywhere. This page shows what day each work of Shakespeare was completed on.
The Million Monkeys project went viral, but not in the cool, apocalyptic way. The Million Monkeys project went viral starting on September 25, 2011 and went into full swing on September 26, 2011. On September 26, 2011, over 25,000 unique visitors viewed the Million Monkeys project, 300 sites referred traffic, and people viewed it from 119 countries. This post will contain some of my thoughts and reactions on going viral. If this article about going viral goes viral, it will create an infinite loop that will bring about the destruction of the world.
NOTE: I apologize in advance for having to use the term “go viral” so much, but that really explains the phenomenon.
I am proud to announce that I have open sourced the Million Monkeys project. The source code is available here.
This project originally started on August 21, 2011. Over the course of the project, over 7.5 trillion character groups have been randomly generated and checked, out of the 5.5 trillion (5,429,503,678,976) possible combinations.
Update: The monkeys are not RFC 2795 compliant. A Slashdot user pointed out that I forgot to talk about the similarities between this project and Richard Dawkin’s Weasel Experiment.
If you would like to do a story, please contact me via the Contact page.
Thoughts on Going Viral
As I mentioned before, the Million Monkeys project went viral on September 26, 2011. This was partly due to me spending a few hours E-mailing every news outlet I could think of. Another part was people using Twitter and Facebook to promote the project. On that day alone, over 2,300 visitors came to the site through Facebook and Twitter.
The first round of the project had no recognition, even among my friends. I thought the concept was cool and I kept with it. During a conversation with a friend of mine, we came up with a new concept for the project.
I went back to the drawing board for the second round of the project with the ideas from the new concept. I started using a smaller group size, 9 character groups instead of 24 character groups. This would allow the project to complete without infinite amount of resources. I added near real-time updates of the site so people could see the progress of the monkeys. I wanted people to be able to come back to the site to watch their favorite work being recreated. This round received some recognition and landed on the front pages of Fox News and Engadget.
I knew I was on the right track. I was getting some media attention and people were starting to see the site. My goal was to do another media blitz once the monkeys completed their first work. My goal was to get an Associated Press article and, if I was lucky enough, get on the front page of Slashdot. I thought I had a good idea, but I had no delusions of the project going viral.
On Sunday night September 25, 2011, I was reading through my RSS feeds on Google Reader. Some new Slashdot stories appeared and I dutifully started reading them. When I started reading about myself and my project, I started to think I had clicked on the wrong feed or I had erred in some fashion. I could not believe I was reading about myself on Slashdot after many years of reading it. My wife was next to me at the time and I tried to explain why I was so ecstatic to be on Slashdot. Explaining to a non-geek about Slashdot is difficult, but I think she could see it was important to me. If the media blitz had died at that point, I would have been happy. It didn’t. Over the course of the next day, the story kept on gaining momentum, getting more news stories, and more hits on the website.
All glory may be fleeting, but not everyone liked the project. I received my share of hate mail, hate comments, and hate blog posts. I was informed that I didn’t understand Infinite Monkey Theorem (I do), that I was conning people (I’m not, the source code and data are available), and that the project was boring (beauty is in the eye of the beholder). Before anyone decides to create a project on the Internet, you better have a thick skin to put up with peoples’ comments. I responded to the people I thought were genuinely asking a question or those that seemed to be open to a discussion about the project. Most people responded and most people were nice.
Pre-Viral Checklist
You should create as many social objects as possible. I have several YouTube videos where I explain in various levels of detail about the project. These YouTube videos, in turn, were posted by the various sites on their postings. The blog postings themselves were great social objects. I could see by the direct traffic that people were E-mailing the link about to their friends. My Twitter feed allowed me to converse with people who had questions about the project. They also allowed me to tweet the URLs of interviews, articles and radio shows about the project.
To gain the most amount of media attention, you make your project and/or post as media friendly as possible. Many of the sites wrote their articles only using the posts as source material. I put a lot of effort into making the site as straightforward as possible and as quotable as possible. When doing a technical project like this, not all of your readers will be technically minded people. I recommend creating sections for technical and non-technical people. The non-technical people may glaze over at a very technical explanation of your project and a technical person will want more technical detail.
The site itself needs to ready technically for a huge increase in traffic. Many sites go down during a Slashdotting. Fortunately for me, DreamHost kept my site going without stoppage. It’s usually too late to change your site once it goes viral. Make sure you have some metrics for your site to track the usage. In my case, I use Google Analytics for WordPress. Having a decent looking site also helps. If you are not a designer, use your good taste and find a good them for site. I used ElegantTheme’s Minimal theme for this site. To handle a Slashdotting, your site needs to be optimized. From the beginning of this project, I tried to optimize the site. The images showing the progress through Shakespeare were indexed PNGs. They provided the smallest file size and therefore the best scalability. Much to my lament, the comments are not working on this site. One of the CAPTCHA plugins I installed messed things up and it is still not working even after I uninstalled all of them.
Make sure your site makes it as easy as possible to connect with your users socially. The previous posts did not have the Facebook likes and Tweets when they were on Engadget and Fox News. I made it more difficult than it should have been for people to tell their friends about the project. From the start of this round, I have the “like” buttons for the major social players. The site’s traffic and the numbers of people “liking” shows much better the story made its rounds.
Was Was It A Success?
I always do a postmortem at the end of every project. This is the Million Monkeys project postmortem. I think the project was a resounding success. It achieved its primary goal of recreating every work of Shakespeare. People saw my work. While I might have received over 25,000 unique visitors to my site, millions and millions of people read about my work on mainstream news, blogs, print and radio. My personal branding (which is what this website is) went through the roof. On Google, the search term “jesse anderson” used to appear as the 45th link. Now, I have links 4-6. The top 3 spots belong to an anime character named Jesse Anderson (Andersen). The project also brought me recognition within my own company, Intuit.
This success was not the result of luck. I found it is not the result of luck or a random chance, but the result of countless hours of hard work. Even though the Million Monkeys project took 40-60 hours of my time to write, it took countless hours before that to become a better programmer and learn new technologies like Hadoop. A lot of time was spent submitting the story and working with reporters on stories.
In a way, the Million Monkeys is the current culmination of this time spent.
Miscellaneous Thoughts
A lot of reporters asked me what I wanted to accomplish with this project. For me it is performance art with monkeys and computers. I wanted to make it engaging and have people coming back to check the monkeys’ progress, so I did near real-time updates of the site. People did just that as was reflected through the usage logs. People were coming back and they were E-mailing it around to their friends. They were tweeting it and liking it on Facebook. I consider that the most gratifying part of the project; people enjoyed it.
As time went on, I began to anthropomorphise the monkeys more and more. Instead of thinking of them as a PRNG (pseudo random number generator) and a computer program, I was talking about them as if they were really monkeys. I began to identify with them and think of them like a pet. Maybe I spent too much time curating their work.
Going back to thick skin, I have a list of people to contact to get approval of projects. If anyone wants this list before they start their project, please E-mail me so we can get their approval. It’s of utmost importance that any project contact them before starting any work.
Reading about yourself in the news is one of the craziest things that can happen to you. There is kind of a disembodied realization that it is you, but it does not feel like you did it. That first week seemed like it was a month long. I was doing a lot of interviews and every moment seemed like an eternity.
I could not get the local media in Reno to do any stories on the project. It was incredibly funny because I would E-mail them saying the project has been on BBC, CNN, etc and I never even got a reply. I will take international coverage over local coverage any day, but it was funny that local didn’t follow international. Update: I finally got some local press.
Some More Numbers
The monkeys ran 180,000,000,000 character groups a day. An average iteration lasted 30 minutes 33 seconds and ran 5,000,000,000 character groups. The monkeys found 1,982,507 distinct character groups and those character groups were found 3,788,175 times for a ratio of 1.8718555. The monkeys ran 7,445,912,000,000 total character groups out of the 5,429,503,678,976 possible combinations for a ratio of 1.3713.
There are 2 technologies I think set the Monkeys Project apart from previous endeavors. The first is Hadoop, which scales well and can handle exponential problems like Infinite Monkey Theorem. The second is a Bloom Filter. I ran a test last night comparing the Bloom Filter speed to a String.indexOf. The Bloom Filter took 25 seconds to run every work of Shakespeare and I stopped the String.indexOf after 2 hours. The monkeys project would not be close to the number of character sets it is now if not for the Bloom Filter. In other words, if not for the usage of a Bloom Filter, the project would be far from complete. I think this would even be true of using Lucene or Sphinx but not as bad.
The Inspiration
This project comes from one of my favorite Simpsons episodes which has a scene where Mr. Burns brings Homer to his mansion (YouTube Video). One of his rooms has a thousand monkeys at a thousand typewriters. One of the monkeys writes a slightly incorrect line from Charles Dickens “It was the best of times, it was blurst of times.” The joke is a play on the theory that a million monkeys sitting at a million typewriters will eventually produce Shakespeare. And that is what I did. I created millions of monkeys on Amazon EC2 (then my home computer) and put them at virtual typewriters (aka Infinite Monkey Theorem).
Less Technical Explanation
Instead of having real monkeys typing on keyboards, I have virtual, computerized monkeys that output random gibberish. This is supposed to mimic a monkey randomly mashing the keys on a keyboard. The computer program I wrote compares that monkey’s gibberish to every work of Shakespeare to see if it actually matches a small portion of what Shakespeare wrote. If it does match, the portion of gibberish that matched Shakespeare is marked with green in the images below to show it was found by a monkey. The table below shows the exact number of characters and percentage the monkeys have found in Shakespeare. The parts of Shakespeare that have not been found are colored white. This process is repeated over and over until the monkeys have created every work of Shakespeare through random gibberish.
Technical Explanation
For this project, I used Hadoop, Amazon EC2, and Ubuntu Linux. Since I don’t have real monkeys, I have to create fake Amazonian Map Monkeys. The Map Monkeys create random data in ASCII between a and z. It uses Sean Luke’s Mersenne Twister to make sure I have fast, random, well behaved monkeys. Once the monkey’s output is mapped, it is passed to the reducer which runs the characters through a Bloom Field membership test. If the monkey output passes the membership test, the Shakespearean works are checked using a string comparison. If that passes, a genius monkey has written 9 characters of Shakespeare. The source material is all of Shakespeare’s works as taken from Project Gutenberg.

This chart shows the total number of character groups as more and more iterations of the checks are run.

This chart shows percent complete as more and more iterations are run for each story.
For the curious, the computer I ran the monkeys on is a Core 2 Duo 2.66GHZ with 4 GB RAM running Ubuntu 10.10 64-bit.
A Few Words To Try and Prevent The Usual Comments
I realize there are different interpretations to this saying/theorem and I have done 2 different ones already. I understand the definition of infinite and infinite monkey theorem and I realize that this project does not have infinite resources. This project was funded and written by myself and was not supported by any grant money or federal money. No monkeys were harmed during the making of this code. This project is my attempt to find a creative way to attain an answer without infinite resources. It is a fun side project. If you still feel angry or slighted or feel the need to set me straight, please read this sign:
Thanks to my wife Sara, daughter Ashley, David Weinberg, Ryan Polk, and Tim Dailey.
Read MoreA Few Million Monkeys Randomly Recreate Shakespeare
Friends, Romans, countrymen, lend me your ears;
I come to recreate Shakespeare, not to praise him.
- Monkey Julius Caesar
Update 1: The monkeys recreated every work of Shakespeare and went viral. See the project project postmortem for my thoughts on going viral and what I learned during the project.
Update 2: I created a new visualization of the monkeys’ data.
Today (2011-09-23) at 2:30 PST the monkeys successfully randomly recreated A Lover’s Complaint, The Tempest (2011-09-26), As You Like It (2011-09-28), Loves Labours Lost (2011-09-29), Much Ado About Nothing (2011-09-29), The Merchant Of Venice (2011-09-29), The Sonnets (2011-09-29), The Third Part Of King Henry The Sixth (2011-09-29), The Two Gentlemen Of Verona (2011-09-29), A Midsummer Nights Dream (2011-09-30), As You Like It (2011-09-30), The Life Of King Henry The Fifth (2011-09-30), The First Part Of Henry The Sixth (2011-09-30), The Tragedy Of Titus Andronicus (2011-09-30), The Winters Tale (2011-09-30), Measure for Measure (2011-10-01), The First Part Of King Henry The Fourth (2011-10-01), and The History Of Troilus (2011-10-01), Cressida (2011-10-01), Cymbeline (2011-10-02), King Richard The Second (2011-10-02), The Comedy Of Errors (2011-10-02), The Life Of Timon Of Athens (2011-10-02), The Tragedy Of Macbeth (2011-10-02), The Tragedy Of Othello Moor Of Venice (2011-10-02), Twelfth Night Or What You Will (2011-10-02), Alls Well That Ends Well (2011-10-03), King Henry The Eighth (2011-10-03), The Second Part Of King Henry The Sixth (2011-10-03), The Tragedy Of Hamlet Prince Of Denmark (2011-10-03), The Tragedy Of Julius Caesar (2011-10-03), The Tragedy Of Romeo And Juliet (2011-10-03), King John (2011-10-04), King Richard III (2011-10-04), Second Part Of King Henry IV (2011-10-04), The Tragedy Of Antony And Cleopatra (2011-10-04), The Tragedy Of Coriolanus (2011-10-04), The Tragedy Of King Lear (2011-10-04), and The Taming Of The Shrew (2011-10-06). This is the first time a work of Shakespeare has actually been randomly reproduced. Furthermore, this is the largest work ever randomly reproduced. It is one small step for a monkey, one giant leap for virtual primates everywhere.
The monkeys will continue typing away until every work of Shakespeare is randomly created. Until then, you can continue to view the monkeys’ progress on that page. I am making the raw data available to anyone who wants it. Please use the Contact page to ask for the URL. If you have a Hadoop cluster that I could run the monkeys project on, please contact me as well.
This project originally started on August 21, 2011. Over the course of the project, over 6.5 trillion character groups have been randomly generated and checked out of the 5.5 trillion possible combinations.
So far, the project has appeared on Slashdot, Fox News, Engadget, Japanese Engadget, and Solidot. The radio interviews are Australian Broadcasting Company, Little Tommy, Jeff and Jer in San Diego and Radio New Zealand. If you would like to do a story, please contact me via the Contact page.
The Inspiration
This project comes from one of my favorite Simpsons episodes which has a scene where Mr. Burns brings Homer to his mansion (YouTube Video). One of his rooms has a thousand monkeys at a thousand typewriters. One of the monkeys writes a slightly incorrect line from Charles Dickens “It was the best of times, it was blurst of times.” The joke is a play on the theory that a million monkeys sitting at a million typewriters will eventually produce Shakespeare. And that is what I did. I created millions of monkeys on Amazon EC2 (then my home computer) and put them at virtual typewriters (aka Infinite Monkey Theorem).
Less Technical Explanation
Instead of having real monkeys typing on keyboards, I have virtual, computerized monkeys that output random gibberish. This is supposed to mimic a monkey randomly mashing the keys on a keyboard. The computer program I wrote compares that monkey’s gibberish to every work of Shakespeare to see if it actually matches a small portion of what Shakespeare wrote. If it does match, the portion of gibberish that matched Shakespeare is marked with green in the images below to show it was found by a monkey. The table below shows the exact number of characters and percentage the monkeys have found in Shakespeare. The parts of Shakespeare that have not been found are colored white. This process is repeated over and over until the monkeys have created every work of Shakespeare through random gibberish.
Technical Explanation
For this project, I used Hadoop, Amazon EC2, and Ubuntu Linux. Since I don’t have real monkeys, I have to create fake Amazonian Map Monkeys. The Map Monkeys create random data in ASCII between a and z. It uses Sean Luke’s Mersenne Twister to make sure I have fast, random, well behaved monkeys. Once the monkey’s output is mapped, it is passed to the reducer which runs the characters through a Bloom Field membership test. If the monkey output passes the membership test, the Shakespearean works are checked using a string comparison. If that passes, a genius monkey has written 9 characters of Shakespeare. The source material is all of Shakespeare’s works as taken from Project Gutenberg.
The monkeys’ data from Amazon’s cloud is updated on this site every 30 minutes. The images below show green for every character group that was found and white for those that are still missing. The images output is kind of like the animations for defrag utilities. As the monkeys progress through the works, more and more character groups will be found and show green.
This chart shows the total number of character groups as more and more iterations of the checks are run.
This chart shows percent complete as more and more iterations are run for each story.
For the curious, the computer I ran the monkeys on is a Core 2 Duo 2.66GHZ with 4 GB RAM running Ubuntu 10.10 64-bit.
A Few Words To Try and Prevent The Usual Comments
I realize there are different interpretations to this saying/theorem and I have done 2 different ones already. I understand the definition of infinite and infinite monkey theorem and I realize that this project does not have infinite resources. This project was funded and written by myself and was not supported by any grant money or federal money. No monkeys were harmed during the making of this code. This project is my attempt to find a creative way to attain an answer without infinite resources. It is a fun side project. If you still feel angry or slighted or feel the need to set me straight, please read this sign:
Read MoreA Few More Million Amazonian Monkeys
Update 5: The monkeys recreated every work of Shakespeare and went viral. See the project project postmortem for my thoughts on going viral and what I learned during the project.
Update 6: I created a new visualization of the monkeys’ data.
Update 4: The monkeys recreated “A Lover’s Complaint”. Check out the write up.
Update 3: Welcome Slashdot, Fox News, Engadget and Japanese Engadget. So far, the monkeys have ran through 7.5 trillion 6.5 trillion 5 trillion (2011-09-22) 4 trillion (2011-09-16) 3 trillion (2011-09-10) 2.5 trillion (2011-09-07) 2 trillion (2011-09-05) 1.5 trillion (2011-09-01) 1 trillion (2011-08-28) 515,912,000,000 (2011-08-25) character groups.
In a recent post, I described a recent project to recreate Shakespeare using Hadoop and Amazon EC2. This time, I am going to recreate every work of Shakespeare randomly.
This project comes from one of my favorite Simpsons episodes which has a scene where Mr. Burns brings Homer to his mansion (YouTube Video). One of his rooms has a thousand monkeys at a thousand typewriters. One of the monkeys writes a slightly incorrect line from Charles Dickens ‘It was the best of times, it was blurst of times.’ The joke is a play on the theory that a million monkeys sitting at a million typewriters will eventually produce Shakespeare. And that is what I did (am doing). I created millions of monkeys on Amazon and put them at virtual typewriters (aka Infinite Monkey Theorem).
Less Technical Explanation
Instead of having real monkeys typing on keyboards, I have virtual, computerized monkeys that output random gibberish. This is supposed to mimic a monkey randomly mashing the keys on a keyboard. The computer program I wrote compares that monkey’s gibberish to every work of Shakespeare to see if it actually matches a small portion of what Shakespeare wrote. If it does match, the portion of gibberish that matched Shakespeare is marked with green in the images below to show it was found by a monkey. The table below shows the exact number of characters and percentage the monkeys have found in Shakespeare. The parts of Shakespeare that have not been found are colored white. This process is repeated over and over until the monkeys have created every work of Shakespeare through random gibberish.
Technical Explanation
For this project, I used Hadoop, Amazon EC2, and Ubuntu Linux. Since I don’t have real monkeys, I have to create fake Amazonian Map Monkeys. The Map Monkeys create random data in ASCII between a and z. It uses Sean Luke’s Mersenne Twister to make sure I have fast, random, well behaved monkeys. Once the monkey’s output is mapped, it is passed to the reducer which runs the characters through a Bloom Field membership test. If the monkey output passes the membership test, the Shakespearean works are checked using a string comparison. If that passes, a genius monkey has written 9 characters of Shakespeare. The source material is all of Shakespeare’s works as taken from Project Gutenberg.
The monkeys’ data from Amazon’s cloud is updated on this site every 30 minutes. The images below show green for every character group that was found and white for those that are still missing. The images output is kind of like the animations for defrag utilities. As the monkeys progress through the works, more and more character groups will be found and show green.
The Tabular Output Of What Has Been Found
Every Work Of Shakespeare

All Works of Shakespeare
Progress Through Individual Works Of Shakespeare

A Lovers Complaint

Loves Labours Lost

The Merchant Of Venice

The Tragedy Of Julius Caesar

A Midsummer Nights Dream

Measure For Measure

The Merry Wives Of Windsor

The Tragedy Of King Lear

Much Ado About Nothing

The Tragedy Of Macbeth

Alls Well That Ends Well

The Sonnets

The Tragedy Of Othello Moor Of Venice

As You Like It

The Comedy Of Errors

The Taming Of The Shrew

The Tragedy Of Romeo And Juliet

Cymbeline

The Tempest

The Tragedy Of Titus Andronicus

King Henry The Eighth

The First Part Of King Henry The Fourth

Second Part Of King Henry IV

The First Part Of Henry The Sixth

The Second Part Of King Henry The Sixth

The Third Part Of King Henry The Sixth

The Two Gentlemen Of Verona

King John

The History Of Troilus And Cressida

The Tragedy Of Antony And Cleopatra

The Winters Tale

King Richard III

The Life Of King Henry The Fifth

The Tragedy Of Coriolanus

Twelfth Night Or What You Will

King Richard The Second

The Life Of Timon Of Athens

The Tragedy Of Hamlet Prince Of Denmark
Update: I was running this on a free micro instance (600 MB RAM) from Amazon. Alas, the monkeys needed more RAM than the free micro instance had and the processes get out of memory errors. I have moved the Hadoop server to my home computer which is much faster and has more memory.
Update 2: I updated the Hadoop configuration to have less idle CPU time. This will significantly increase the monkey power and find more character groups.
Update 4: I made a small change to how memory is allocated for the random character groups. It should help speed things up again.
Read MorePitching Agile
Once you have decided to implement Agile Software Development methodology at your company, there is some ground work you should do beforehand. One needs to get as many people to buy in or support moving to Agile as possible. This presentation outlines how to formulate an argument for Agile depending on the person’s department or position.
Part 1
Part 2
Part 3






