<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Jesse Anderson</title>
	<atom:link href="http://www.jesse-anderson.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.jesse-anderson.com</link>
	<description>Online</description>
	<lastBuildDate>Sat, 05 May 2012 02:04:42 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>EC2 Performance, Spot Instance ROI and EMR Scalability</title>
		<link>http://www.jesse-anderson.com/2012/02/ec2-performance-spot-instance-roi-and-emr-scalability/</link>
		<comments>http://www.jesse-anderson.com/2012/02/ec2-performance-spot-instance-roi-and-emr-scalability/#comments</comments>
		<pubDate>Tue, 14 Feb 2012 16:00:02 +0000</pubDate>
		<dc:creator>Jesse</dc:creator>
				<category><![CDATA[Magnum Opus]]></category>
		<category><![CDATA[amazon]]></category>
		<category><![CDATA[amazon ec2]]></category>
		<category><![CDATA[ec2 performance]]></category>
		<category><![CDATA[elastic mapreduce]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[infinite monkey theorem]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[mapreduce scalability]]></category>
		<category><![CDATA[million monkeys project]]></category>

		<guid isPermaLink="false">http://www.jesse-anderson.com/?p=157</guid>
		<description><![CDATA[Note: This is a very long, technical and detailed discussion of Amazon Web Services.  You can watch the YouTube video below for a less technical explanation or skip to the conclusion to get the results. Introduction In 2006, Amazon introduced Elastic Compute Cloud (EC2) to Amazon Web Services (AWS).  In 2009, Amazon introduced Elastic MapReduce (EMR).  EMR uses [...]]]></description>
			<content:encoded><![CDATA[<p>Note: This is a very long, technical and detailed discussion of Amazon Web Services.  You can watch the YouTube video below for a less technical explanation or skip to the <a href="#conclusion">conclusion</a> to get the results.</p>
<p><iframe src="http://www.youtube.com/embed/FyvW-dpskZs" frameborder="0" width="420" height="315"></iframe></p>
<h2 style="margin-top: 10px; margin-bottom: 5px;" dir="ltr">Introduction</h2>
<p>In 2006, <a href="http://aws.typepad.com/aws/2006/08/amazon_ec2_beta.html">Amazon introduced</a> <a href="http://aws.amazon.com/ec2/">Elastic Compute Cloud (EC2)</a> to <a href="http://aws.amazon.com/">Amazon Web Services (AWS)</a>.  In 2009, <a href="http://aws.typepad.com/aws/2009/04/announcing-amazon-elastic-mapreduce.html">Amazon introduced</a> <a href="http://aws.amazon.com/elasticmapreduce/">Elastic MapReduce (EMR)</a>.  EMR uses <a href="http://hadoop.apache.org/">Hadoop</a> to create <a href="http://en.wikipedia.org/wiki/Mapreduce">MapReduce</a> jobs using EC2 instances with <a href="http://aws.amazon.com/s3/">Simple Storage Service (S3)</a> as the permanent storage mechanism.  In 2011, Amazon added <a href="http://aws.typepad.com/aws/2011/08/run-amazon-elastic-mapreduce-on-ec2-spot-instances.html">Spot Instance support</a> for EMR jobs.  <a href="http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/index.html?UsingEMR_SpotInstances.html">Spot Instances</a> allow you to bid on EMR or EC2 instances that are not in use.  The <a href="http://aws.amazon.com/ec2/pricing/">pricing page</a> under the Spot Instances heading gives up to date data on EMR and EC2 instance prices.</p>
<p>In 2011, I created the <a href="http://www.jesse-anderson.com/2011/09/a-few-million-monkeys-randomly-recreate-shakespeare/">Million Monkeys Project</a> (<a href="http://code.google.com/p/million-monkeys-project">source code</a>).  It is a good metric for CPU and memory speed in a Hadoop cluster as it is very computational and memory intensive in its character group testing.  This project will use the Million Monkeys code to profile the various EC2 instances and the scalability of EMR and Hadoop.  I will talk about the cost savings when running EMR jobs as Spot Instances (bid price) instead of On Demand instances (full price).  This post will help engineers in choosing the right EC2 instance types based on the amount of work or computation needed.</p>
<p>When I originally ran the Million Monkeys Project to recreate every work of Shakespeare, I lacked the resources to run it entirely on EMR.  I started the project on an EC2 micro instance, but the instance lacked enough RAM to run everything I needed.  This time, I have the resources to run the entire project and recreate every work of Shakespeare on EMR using a 20 node EMR cluster.</p>
<h2 style="margin-top: 10px; margin-bottom: 5px;" dir="ltr">Setting Up An EMR Cluster</h2>
<p>To run an EC2 cluster, various Hadoop services like the Task Tracker and DFS service need to be running. This is in addition to the actual Map and Reduce tasks that will do the actual work.  In an EMR cluster, the various Hadoop services are run on a master instance group.  The Map and Reduce tasks are run on a core instance group.  The core instance group is made of up of 1 or more EC2 instances.  When creating the EMR cluster, you can choose a different instance type for the master and core nodes.  You can use the information in this post in deciding which instance type should be used given the task(s).</p>
<p>An EMR cluster is built on EC2 instances and these instances run various parts of the Hadoop cluster.  The data can reside in S3 and be loaded from S3 into the <a href="http://hadoop.apache.org/common/docs/current/">Hadoop Distributed File System (DFS)</a>.  The compiled code or JAR and any input files are stored in an S3 bucket.  At the end of a job, all files that are not in S3 at the termination of the master instance group will be lost.  Therefore, you should make sure that the code places any important output in S3.  In the Million Monkeys code, I created a prefix that could be added to a file’s path to place them directly on S3.</p>
<h5 style="margin-bottom: 5px; text-align: center;">Table 1.1 The Breakdown of Various EC2 Instances Specifications</h5>
<div style="text-align: left;" dir="ltr">
<table style="border-width: 1px; border-color: #80807f; border-style: solid;" border="1" cellspacing="0" cellpadding="3">
<thead>
<tr style="background-color: #a3b8c9;">
<td><strong>Instance Name</strong></td>
<td><strong>Memory</strong></td>
<td><strong>EC2 Compute Units and Cores</strong></td>
<td><strong>Platform</strong></td>
<td><strong>I/O Performance</strong></td>
</tr>
</thead>
<colgroup></colgroup>
<tbody>
<tr style="background-color: #f7fbfd;">
<td style="background-color: #f7fbfd;">Small</td>
<td style="background-color: #f7fbfd;">1.7 GB</td>
<td style="background-color: #f7fbfd;">1 EC2 on 1 Core</td>
<td style="background-color: #f7fbfd;">32-bit</td>
<td style="background-color: #f7fbfd;">Moderate</td>
</tr>
<tr style="background-color: #f7fbfd;">
<td style="background-color: #f7fbfd;">Large</td>
<td style="background-color: #f7fbfd;">7.5 GB</td>
<td style="background-color: #f7fbfd;">4 EC2 on 2 Cores</td>
<td style="background-color: #f7fbfd;">64-bit</td>
<td style="background-color: #f7fbfd;">High</td>
</tr>
<tr style="background-color: #f7fbfd;">
<td style="background-color: #f7fbfd;">Extra Large</td>
<td style="background-color: #f7fbfd;">15 GB</td>
<td style="background-color: #f7fbfd;">8 EC2 on 8 Cores</td>
<td style="background-color: #f7fbfd;">64-bit</td>
<td style="background-color: #f7fbfd;">High</td>
</tr>
<tr style="background-color: #f7fbfd;">
<td style="background-color: #f7fbfd;">High-CPU Medium</td>
<td style="background-color: #f7fbfd;">1.7 GB</td>
<td style="background-color: #f7fbfd;">5 EC2 on 2 Cores</td>
<td style="background-color: #f7fbfd;">32-bit</td>
<td style="background-color: #f7fbfd;">Moderate</td>
</tr>
<tr style="background-color: #f7fbfd;">
<td style="background-color: #f7fbfd;">High-CPU Large</td>
<td style="background-color: #f7fbfd;">7 GB</td>
<td style="background-color: #f7fbfd;">20 EC2 on 8 Cores</td>
<td style="background-color: #f7fbfd;">64-bit</td>
<td style="background-color: #f7fbfd;">High</td>
</tr>
<tr style="background-color: #f7fbfd;">
<td style="background-color: #f7fbfd;">Quadruple Extra Large</td>
<td style="background-color: #f7fbfd;">23 GB</td>
<td style="background-color: #f7fbfd;">33.5 on 8 Cores</td>
<td style="background-color: #f7fbfd;">64-bit</td>
<td style="background-color: #f7fbfd;">Very High</td>
</tr>
</tbody>
</table>
</div>
<p style="text-align: center;"><a href="http://aws.amazon.com/ec2/instance-types/">Source</a>  Note: High-Memory Extra Large, High-Memory Double Extra Large, High-Memory Quadruple Extra Large instances not tested and are not included on this table.</p>
<h2 style="margin-top: 10px; margin-bottom: 5px;" dir="ltr">Instance Testing</h2>
<p>EC2 has <a href="http://aws.amazon.com/ec2/instance-types/">various instances</a> and performance specifications for those instances.  These EC2 instances are analogous to running a virtual machine in the cloud.  As shown in Table 1.1, each instance type varies in the number of EC2 Computer Units (ECU), the number of virtual cores, the amount of RAM, 32 or 64 bit platform, the amount of disk space, and network or I/O performance.  Some of these descriptions are quite nebulous.  For example, this is the description from Amazon regarding the definition of an ECU (<a href="http://aws.amazon.com/ec2/">Source</a>):</p>
<pre>EC2 Compute Unit (ECU) – One EC2 Compute Unit (ECU) provides the equivalent
CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.</pre>
<p>That description of CPU capacity does not really help in making capacity decisions or really even guessing how to scale an application.  To this end, I ran various tests to give an absolute idea of how each instance compares to another when running the same tests.</p>
<p>For these tests, I ran the Million Monkeys program for 5 continuous hours.  During this time, the Million Monkeys Code is run in a loop and the total number of character groups is calculated.  A character group is a group of 9 characters that is randomly generated using a <a href="http://www.cs.gmu.edu/~sean/research/">Mersenne Twister</a> and its existence is checked against every work of Shakespeare.  The runs lasted for slightly over 5 hours per run and the number of character groups is pro-rated.</p>
<h5 style="margin-bottom: 5px; text-align: center;">Chart 1.1 Total Character Groups Checked In a 5 Hour Pro-rated Period</h5>
<p style="text-align: center;"><a href="http://www.jesse-anderson.com/wp-content/uploads/2012/02/baseline_totalgroups.png"><img class="aligncenter size-medium wp-image-161 colorbox-157" style="padding: 7px;" title="baseline_totalgroups" src="http://www.jesse-anderson.com/wp-content/uploads/2012/02/baseline_totalgroups-300x150.png" alt="" width="300" height="150" /></a></p>
<p>Chart 1.1 does not present any surprises.  The Small instance obviously has the fewest character groups, followed by Hi-CPU Medium.  Large comes in third and Extra Large and Hi-CPU Large are a virtual tie, with Extra Large coming out slightly higher.  Quadruple Extra Large is the obvious winner with the highest total character groups.  Chart 1.1 gives an idea of the raw computing power of each instance.  It is not until we start looking at price per unit that we get a handle on cost efficiency of a particular instance.</p>
<p>In the original Million Monkeys project, I ran the entire Hadoop cluster on my home computer, an Intel Core 2 Duo 2.66GHZ with 4 GB RAM running Ubuntu 10.10 64-bit.  In 5 hours, my home computer ran 50,000,000,000 character groups.  One of the main differences between my home computer and the EC2 instances is that my home computer was not running in a virtualized environment.  I have seen a 10-30% decrease in efficiency when using virtualization.  Also, all processing was done locally with the Hadoop services and MapReduce tasks running on the same computer.</p>
<h5 style="margin-bottom: 5px; text-align: center;">Chart 1.2 Spot Instance Savings Per Hour When Compared to On Demand</h5>
<p style="text-align: left;"><a href="http://www.jesse-anderson.com/wp-content/uploads/2012/02/baseline_costpergrouppercent.png"><img class="aligncenter size-medium wp-image-158 colorbox-157" style="padding: 7px;" title="baseline_costpergrouppercent" src="http://www.jesse-anderson.com/wp-content/uploads/2012/02/baseline_costpergrouppercent-300x100.png" alt="" width="300" height="100" /></a>Spot Instances help reduce the cost of running an EMR cluster.  The Spot Instance prices will fluctuate as the market price changes.  Chart 1.2 represents the Spot Instance (bid) prices relative to their On Demand (full) prices when I ran their tests.  The savings in this test was very even across all instances at about 65% off their On Demand prices.  With a little bit of forward planning, an EMR cluster can save a lot of money using Spot Instances.</p>
<p>I should point out that running on a Spot Instance does not require a code change per se.  However, an EMR job flow’s Spot Instances can be taken away because of market price fluctuations.  A MapReduce job flow may need to be changed to accommodate an unplanned stoppage.  This might include saving the job state and adding the ability to start back up where it left off at the last save point.  The Million Monkeys code already did this and could take advantage of the Spot Instances without any code changes.</p>
<h5 style="margin-bottom: 5px; text-align: center;">Chart 1.3 Cost Per Hour For On Demand and Spot Instances<br />
<a href="http://www.jesse-anderson.com/wp-content/uploads/2012/02/baseline_priceperhourabsolute.png"><img class="aligncenter size-medium wp-image-160 colorbox-157" style="padding: 7px;" title="baseline_priceperhourabsolute" src="http://www.jesse-anderson.com/wp-content/uploads/2012/02/baseline_priceperhourabsolute-300x150.png" alt="" width="300" height="150" /></a></h5>
<p>Chart 1.3 shows another cost breakdown by hour of usage and total costs.  Calculating total cost for a single node cluster with EMR can be done using interactive Table 1.2.  For an On Demand instance, the total cost per hour is master node group plus core instance(s) group, plus EMR costs for all instances.  For a Spot instance, the total cost per hour is master node instance plus core instance(s) group spot price plus EMR instance(s).</p>
<p>For example, when I ran the Hi-CPU Medium instance testing, I paid a spot price for the core instance group of $0.06 per hour ($0.17 On Demand).  I also had to pay for the EMR cluster’s master node ($0.17 per hour) which was a Hi-CPU Medium instance as well.  On top that I have to pay the EMR price per hour ($0.03) for the master and core node.</p>
<p>To help illustrate the total pricing, Table 1.2 details the breakdown of total price per hour for Spot and On Demand instances.</p>
<h5 style="margin-bottom: 5px; text-align: center;" dir="ltr">Table 1.2 Spot Instance and On Demand Price Calculation</h5>
<div style="text-align: center;" dir="ltr">
<table style="border-width: 1px; border-color: #80807f; border-style: solid;" border="0" cellspacing="0" cellpadding="3">
<thead>
<tr style="background-color: #a3b8c9;">
<td style="background-color: #a3b8c9; border-width: 1px; border-color: #80807f; border-style: solid;"><strong>Price Description </strong></td>
<td style="background-color: #a3b8c9; border-width: 1px; border-color: #80807f; border-style: solid;"><strong>Spot Instance Price </strong></td>
<td style="background-color: #a3b8c9; border-width: 1px; border-color: #80807f; border-style: solid;"><strong>On Demand Price </strong></td>
</tr>
</thead>
<colgroup>
<col width="*" />
<col width="*" />
<col width="*" /></colgroup>
<tbody>
<tr>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">Master Node</td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;"><span id="spotSinglePrice1">0</span></td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;"><span id="onDemandSinglePrice1">0</span></td>
</tr>
<tr>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">Master Node EMR</td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;"><span id="spotSinglePrice2">0</span></td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;"><span id="onDemandSinglePrice2">0</span></td>
</tr>
<tr>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">Core Node</td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;"><span id="spotSinglePrice3">0</span></td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;"><span id="onDemandSinglePrice3">0</span></td>
</tr>
<tr>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">Core Node EMR</td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;"><span id="spotSinglePrice4">0</span></td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;"><span id="onDemandSinglePrice4">0</span></td>
</tr>
<tr>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">Total Price Per Hour</td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;"><span id="spotSinglePrice5">0</span></td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;"><span id="onDemandSinglePrice5">0</span></td>
</tr>
</tbody>
</table>
</div>
<div style="text-align: center;" dir="ltr"></div>
<table class="aligncenter" style="margin-top: 20px;" border="0" cellspacing="0" cellpadding="3">
<tbody>
<tr>
<td style="padding-right: 20px; background-color: #a3b8c9; text-align: left; border-width: 1px; border-color: #80807f; border-style: solid;"><strong>Spot Instance Price Per Hour</strong></td>
<td style="background-color: #a3b8c9; border-width: 1px; border-color: #80807f; border-style: solid;"><strong>On Demand Instance Price Per Hour</strong></td>
</tr>
<tr>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">
<input id="spotSingle1" type="text" value="0.06" /></td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">
<input id="onDemandSingle1" type="text" value="0.17" /></td>
</tr>
</tbody>
</table>
<p style="text-align: left;">
<input id="singleCalculate" type="button" value="Recalculate" /> (Try out your own prices)</p>
<p><script type="text/javascript">// <![CDATA[
       var $j = jQuery.noConflict();  $j(function(){   $j("#singleCalculate").click(function(){     calculatePrice();   });   $j().ready(function(){     calculatePrice();   }); }); function calculatePrice() {   var emrPrice = 0.03;  onDemandPrice = parseFloat($j('#onDemandSingle1').val());  spotPrice = parseFloat($j('#spotSingle1').val());  $j('#onDemandSinglePrice1').html(formatCurrency(onDemandPrice));   $j('#spotSinglePrice1').html(formatCurrency(onDemandPrice));   $j('#onDemandSinglePrice2').html(formatCurrency(emrPrice));    $j('#spotSinglePrice2').html(formatCurrency(emrPrice));   $j('#onDemandSinglePrice3').html(formatCurrency(onDemandPrice));   $j('#spotSinglePrice3').html(formatCurrency(spotPrice));    $j('#onDemandSinglePrice4').html(formatCurrency(emrPrice));   $j('#spotSinglePrice4').html(formatCurrency(emrPrice));   $j('#onDemandSinglePrice5').html(formatCurrency(onDemandPrice + emrPrice + onDemandPrice + emrPrice));   $j('#spotSinglePrice5').html(formatCurrency(onDemandPrice + emrPrice + spotPrice + emrPrice)); } function formatCurrency(num) { num = isNaN(num) || num === '' || num === null ? 0.00 : num;  return "$" +   parseFloat(num).toFixed(2); }
// ]]&gt;</script></p>
<p>It is possible to run the master node as a Spot instance instead of an On Demand instance.  Amazon recommends running the master node as an On Demand instance to prevent market price from taking out your master node and stopping the entire cluster.</p>
<p>For these tests, I varied the master node instance type.  Table 1.3 shows a list of instance type for the core and the instance type for the master node I used.</p>
<h5 style="margin-bottom: 5px; text-align: center;">Table 1.3 Core Instance Group Type Used With Master Group Type Test</h5>
<div dir="ltr">
<table class="aligncenter" style="margin-bottom: 20px;" border="0" cellspacing="0" cellpadding="3">
<colgroup>
<col width="*" />
<col width="*" /></colgroup>
<tbody>
<tr>
<td style="background-color: #a3b8c9; border-width: 1px; border-color: #80807f; border-style: solid;"><strong>Core Instance Group Type </strong></td>
<td style="background-color: #a3b8c9; border-width: 1px; border-color: #80807f; border-style: solid;"><strong>Master Instance Group Type </strong></td>
</tr>
<tr>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">Small</td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">Small</td>
</tr>
<tr>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">Large</td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">Large</td>
</tr>
<tr>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">Extra Large</td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">Extra Large</td>
</tr>
<tr>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">Hi-CPU Medium</td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">Hi-CPU Medium</td>
</tr>
<tr>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">Hi-CPU Large</td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">Hi-CPU Medium</td>
</tr>
<tr>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">Quadruple Extra Large</td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">Hi-CPU Medium</td>
</tr>
</tbody>
</table>
<h5 style="margin-bottom: 5px; text-align: center;">Chart 1.4 Cost Per 100,000,000 Character Groups Checked</h5>
</div>
<h5><a href="http://www.jesse-anderson.com/wp-content/uploads/2012/02/baseline_pricepergroup.png"><img class="aligncenter size-medium wp-image-159 colorbox-157" style="padding: 7px;" title="baseline_pricepergroup" src="http://www.jesse-anderson.com/wp-content/uploads/2012/02/baseline_pricepergroup-300x150.png" alt="" width="300" height="150" /></a></h5>
<p>Breaking down the data into price per unit gives insight into the most cost efficient means of running a job.  In Chart 1.4, I break down the cost by how much it costs to process 100,000,000 character groups.  For Chart 1.4, the lower the number the better.  This bore out my hunch that the best bang for the monkey buck is a Hi-CPU Medium instance.  I was surprised that the Small instance didn’t come in second best; that position was taken by the Large instance.</p>
<p>Once again, we can see the cost benefits of using a Spot instance.  Across the board, the Spot instances have a much smaller variance than their On Demand counterparts.  The Spot instances went from $0.00128 to $0.00497 and the On Demand instances went from $0.00364 to $0.0142.</p>
<h2 style="margin-top: 10px; margin-bottom: 5px;" dir="ltr">Scalability Testing</h2>
<p>The Instance testing above led up to the next phase of the project.  In Chart 1.4, we found that the Hi-CPU medium instances provided the highest cost efficiency per character group.  Now, I will take the most cost efficient instance and see how well it scales by adding more nodes to the cluster.  For these tests, I created EMR clusters of 1, 2, 3, 4, 5, 10 and 20 nodes.  Once again, I ran each cluster size for 5 hours and captured the results.</p>
<h5 style="margin-bottom: 5px; text-align: center;">Chart 2.1 Spot Instance Savings Compared to On Demand Prices<br />
<a href="http://www.jesse-anderson.com/wp-content/uploads/2012/02/scalability_costpergrouppercent.png"><img class="aligncenter size-medium wp-image-163 colorbox-157" style="padding: 7px;" title="scalability_costpergrouppercent" src="http://www.jesse-anderson.com/wp-content/uploads/2012/02/scalability_costpergrouppercent-300x100.png" alt="" width="300" height="100" /></a></h5>
<p>In Chart 2.1, I show the cost savings by comparing Spot and On Demand prices across clusters sizes.  The bars with the the “All” designations show the entire cost roll up of the cluster size.  The core cost is consistent across all node sizes; however, having more nodes running at once increased the savings.</p>
<h5 style="margin-bottom: 5px; text-align: center;">Chart 2.2 Cost Per Hour When Running Various Numbers of Nodes<br />
<a href="http://www.jesse-anderson.com/wp-content/uploads/2012/02/scalability_priceperhourabsolute.png"><img class="aligncenter size-medium wp-image-166 colorbox-157" style="padding: 7px;" title="scalability_priceperhourabsolute" src="http://www.jesse-anderson.com/wp-content/uploads/2012/02/scalability_priceperhourabsolute-300x150.png" alt="" width="300" height="150" /></a></h5>
<p>Chart 2.2 shows another cost breakdown by hour of usage and total costs for Spot and On Demand instances.</p>
<p>To help illustrate the total pricing with a multi instance core group, Table 2.1 details the breakdown of total price per hour for Spot and On Demand instances for a 10 node cluster.</p>
<h5 style="margin-bottom: 5px; text-align: center;" dir="ltr">Table 2.1 Spot and On Demand Instance Price Calculation</h5>
<div style="text-align: left;" dir="ltr">
<table border="0" cellspacing="0" cellpadding="3">
<colgroup>
<col width="*" />
<col width="*" />
<col width="*" /></colgroup>
<tbody>
<tr>
<td style="background-color: #a3b8c9; border-width: 1px; border-color: #80807f; border-style: solid;"><strong>Price Description </strong></td>
<td style="background-color: #a3b8c9; border-width: 1px; border-color: #80807f; border-style: solid;"><strong>Spot Instance Price </strong></td>
<td style="background-color: #a3b8c9; border-width: 1px; border-color: #80807f; border-style: solid;"><strong>On Demand Price </strong></td>
</tr>
<tr>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">Master Node</td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;"><span id="spotClusterPrice1">0</span></td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;"><span id="onDemandClusterPrice1">0</span></td>
</tr>
<tr>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">Master Node EMR</td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;"><span id="spotClusterPrice2">0</span></td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;"><span id="onDemandClusterPrice2">0</span></td>
</tr>
<tr>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">Core Node</td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;"><span id="spotClusterPrice3">0</span></td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;"><span id="onDemandClusterPrice3">0</span></td>
</tr>
<tr>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">Core Node EMR</td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;"><span id="spotClusterPrice4">0</span></td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;"><span id="onDemandClusterPrice4">0</span></td>
</tr>
<tr>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">Total Price Per Hour</td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;"><span id="spotClusterPrice5">0</span></td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;"><span id="onDemandClusterPrice5">0</span></td>
</tr>
</tbody>
</table>
</div>
<div dir="ltr"></div>
<table style="margin-top: 20px;" border="0" cellspacing="0" cellpadding="3">
<tbody>
<tr>
<td style="background-color: #a3b8c9; text-align: center; border-width: 1px; border-color: #80807f; border-style: solid;"><strong>Spot Instance Price Per Hour</strong></td>
<td style="background-color: #a3b8c9; border-width: 1px; border-color: #80807f; border-style: solid;"><strong>On Demand Instance Price Per Hour</strong></td>
<td style="background-color: #a3b8c9; border-width: 1px; border-color: #80807f; border-style: solid;"><strong>Number Of Nodes</strong></td>
</tr>
<tr>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">
<input id="spotCluster1" type="text" value="0.08" /></td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">
<input id="onDemandCluster1" type="text" value="0.17" /></td>
<td style="background-color: #f7fbfd; border-width: 1px; border-color: #80807f; border-style: solid;">
<input id="numberOfNodesCluster1" type="text" value="10" /></td>
</tr>
</tbody>
</table>
<p style="text-align: left;">
<input id="clusterCalculate" type="button" value="Recalculate" /> (Try out your own prices)</p>
<p><script type="text/javascript">// <![CDATA[
       var $j = jQuery.noConflict();  $j(function(){   $j("#clusterCalculate").click(function(){     calculateClusterPrice();   });   $j().ready(function(){     calculateClusterPrice();   }); }); function calculateClusterPrice() {   var emrPrice = 0.03;  onDemandPrice = parseFloat($j('#onDemandCluster1').val());  spotPrice = parseFloat($j('#spotCluster1').val());  nodes = parseInt($j('#numberOfNodesCluster1').val());  $j('#onDemandClusterPrice1').html(formatCurrency(onDemandPrice));   $j('#spotClusterPrice1').html(formatCurrency(onDemandPrice));   $j('#onDemandClusterPrice2').html(formatCurrency(emrPrice));    $j('#spotClusterPrice2').html(formatCurrency(emrPrice));   $j('#onDemandClusterPrice3').html(formatCurrency(onDemandPrice * nodes) + " (" + nodes + " nodes * " + formatCurrency(onDemandPrice) + " Spot Price)");   $j('#spotClusterPrice3').html(formatCurrency(spotPrice * nodes) + " (" + nodes + " nodes * " + formatCurrency(onDemandPrice) + " On Demand Price)");    $j('#onDemandClusterPrice4').html(formatCurrency(emrPrice * nodes) + " (" + nodes + " nodes * " + formatCurrency(emrPrice) + ")");   $j('#spotClusterPrice4').html(formatCurrency(emrPrice * nodes) + " (" + nodes + " nodes * " + formatCurrency(emrPrice) + ")");   $j('#onDemandClusterPrice5').html(formatCurrency(onDemandPrice + emrPrice + (onDemandPrice * nodes) + (emrPrice * nodes)));   $j('#spotClusterPrice5').html(formatCurrency(onDemandPrice + emrPrice + (spotPrice * nodes) + (emrPrice * nodes))); } function formatCurrency(num) { num = isNaN(num) || num === '' || num === null ? 0.00 : num;  return "$" +   parseFloat(num).toFixed(2); }
// ]]&gt;</script><br />
As you can see, you can calculate the current and even project the cost of a cluster.  There is a new company, <a href="http://www.cloudability.com/">Cloudability</a>, who has made it their mission to make not just cluster, EMR and EC2 price reporting more simple but look for ways to improve it (now in beta). Cloudability can even send you a daily or weekly Email showing the charges for that period.  You can check out their website and sign up for a free account.  Although I was unable to use Cloudability for this project, I look forward to using it in my next projects.</p>
<h5 style="margin-bottom: 5px; text-align: center;">Chart 2.3 Cost To Run 100,000,000 Character Groups At Various Numbers of Nodes<br />
<a href="http://www.jesse-anderson.com/wp-content/uploads/2012/02/scalability_pricepergroup.png"><img class="aligncenter size-medium wp-image-165 colorbox-157" style="padding: 7px;" title="scalability_pricepergroup" src="http://www.jesse-anderson.com/wp-content/uploads/2012/02/scalability_pricepergroup-300x150.png" alt="" width="300" height="150" /></a></h5>
<p>In Chart 2.3, I break down the cost by how much it costs to process 100,000,000 character groups.  For Chart 2.3, the lower the number the better.  Once again, the Spot instance pricing shines.  In this case, the Spot instances price variations are quite flat and the On Demand varies much more.</p>
<h5 style="margin-bottom: 5px; text-align: center;">Chart 2.4 Total Character Groups At Various Numbers of Nodes Pro-rated to 5 Hours</h5>
<p style="text-align: center;"><a href="http://www.jesse-anderson.com/wp-content/uploads/2012/02/scalability_totalgroups.png"><img class="aligncenter size-medium wp-image-167 colorbox-157" style="padding: 7px;" title="scalability_totalgroups" src="http://www.jesse-anderson.com/wp-content/uploads/2012/02/scalability_totalgroups-300x150.png" alt="" width="300" height="150" /></a><br />
Chart 2.4 shows the power of creating a multi-node cluster.  With 20 nodes in the cluster, 477,987,913,067 character groups can be run in a 5 hour period.</p>
<p>I want to reiterate that there are no code changes necessary for creating a large cluster like this.  I only needed to make EMR configuration changes when creating the cluster.  Also, cluster configuration changes can be made to a live or running cluster.  You can add or remove core instances to increase or decrease the performance of a cluster.</p>
<h5 style="margin-bottom: 5px; text-align: center;">Chart 2.5 Percent Of Linear Scalability From Actual Growth At Various Numbers of Nodes<br />
<a href="http://www.jesse-anderson.com/wp-content/uploads/2012/02/scalability_percent.png"><img class="aligncenter size-medium wp-image-164 colorbox-157" style="padding: 7px;" title="scalability_percent" src="http://www.jesse-anderson.com/wp-content/uploads/2012/02/scalability_percent-300x150.png" alt="" width="300" height="150" /></a></h5>
<p>Now let’s get in to the scalability of EMR and Hadoop.  A 1 node cluster is assumed to be the most efficient possible in an EMR cluster.  As you can see, Chart 2.5 recognizes that with a 100% efficiency for a 1 node cluster.  All subsequent cluster size efficiencies are calculated as number of character groups for 1 node, times the number of nodes in the cluster.  A 2-5 node cluster has very similar loss of efficiency at about 5%.  A 10 and 20 node cluster have a loss of efficiency at 13% and 16% respectively.</p>
<p>For anyone who has created a distributed system, they will recognize 84% as a phenomenal level of scalability. This really shows that EMR and Hadoop are living up to the hype as revolutionary technologies.  With no code changes and simple configuration changes, you can easily scale an application.</p>
<h5 style="margin-bottom: 5px; text-align: center;">Chart 2.6 Actual Scalability With Projected Linear Growth Pro-rated to 5 Hours<br />
<a href="http://www.jesse-anderson.com/wp-content/uploads/2012/02/scalability_absolute.png"><img class="aligncenter size-medium wp-image-162 colorbox-157" style="padding: 7px;" title="scalability_absolute" src="http://www.jesse-anderson.com/wp-content/uploads/2012/02/scalability_absolute-300x150.png" alt="" width="300" height="150" /></a><a href="http://www.jesse-anderson.com/wp-content/uploads/2012/02/scalability_priceperhourabsolute.png"><br />
</a></h5>
<p>Chart 2.6 presents another breakdown of the scalability showing the absolute or actual values and the calculated values at 100% efficiency.  Once again, we see a very gradual decline in cluster node sizes 2-5.  There is a much more obvious decline on 10 and 20 nodes.</p>
<h2 style="margin-top: 10px; margin-bottom: 5px;" dir="ltr">Million Monkeys On EMR</h2>
<p>In my original run of the Million Monkeys Project, I tried to use the Micro Instance EC2 to run the project.  The project needed more RAM than was available on the micro instance and I had to move it to my home computer.  Many reporters and commenters asked me how long the project would take if I ran it to completion on EMR.  This time, thanks to Amazon, I have the resources to run the project on a multi-node EMR cluster.</p>
<p>The instance testing and scalability testing really lead up to this test.  In the instance testing, I wanted to find the EC2 instance type with the best bang for the buck.  Next, I took that best EC2 instance (Hi-CPU Medium) and wanted to see what amount of efficiency I was losing when running a 20 node cluster.  From there, I created a 20 node Hi-CPU Medium cluster that ran the Million Monkeys code for a prolonged period of time.  I wanted to see how long it would take a 20 node cluster to recreate the original project.</p>
<p>For a little perspective, the <a href="http://www.jesse-anderson.com/2011/09/a-few-million-monkeys-randomly-recreate-shakespeare/">original Million Monkeys project</a> recreated every work of Shakespeare after running 7.5 trillion character groups and ran for 46 days.  For these prolonged tests, I actually ran the 20 node cluster twice.  The first time ran 12 trillion character groups in 5 days 17 hours.  The second time ran 25.7 trillion character groups in 11 days 15 hours.  Each one ran about 2.2 trillion character groups per day.  Given the random nature of the problem, we can only extrapolate how long the original project would have taken.  With these performance numbers, it would have taken 3 days 9 hours to complete the original project.</p>
<p>The cluster cost about $45.44 per day to run.  I ran the cluster with the configuration as shown in 20 node scalability testing above with the master instance group as one Hi-CPU Medium instance running On Demand.  The other 20 nodes are Hi CPU Medium instances running with a Spot price of $0.09 per hour.  The 5 day run cost $317.96.  The 11 day run total cost was $528.25.  If I hadn’t used Spot instances, the 11 day total cost would have been $1,514.87.  Once again, Spot pricing really shines because I achieved the same goal with almost $1,000 in savings.</p>
<h2 style="margin-top: 10px; margin-bottom: 5px;" dir="ltr">Thoughts and Caveats</h2>
<p>Previously, I mentioned that the Million Monkeys code is a good metric of CPU and memory.  There is less I/O than might be run in other MapReduce tasks.  I spent some time and effort to reduce the amount of I/O in code.  To reduce the amount of I/O, I used a Bloom Filter in the Map task.  The <a href="http://en.wikipedia.org/wiki/Bloom_filter">Bloom Filter</a> is created once and saved in S3.  All future Map tasks simply read the Bloom Filter file and run all processing against it.  Once the Reduce tasks is run, a 3.5 MB text file is loaded into memory for the final existence checks.  Depending on the MapReduce task, a Map tasks may need to read in gigabytes or even terabytes of data for processing.  Another key difference for <a href="http://aws.amazon.com/ec2/instance-types/">EC instances</a> is their I/O performance.  For MapReduce tasks that require high I/O performance, a High-CPU medium instance with moderate I/O performance may not have the best cost to performance ratio.</p>
<p>Earlier in the project, I used the AWS web user interface to create EMR jobs.  It was a bit of a pain to setup the command line interface’s (CLI) various keys.  Once I set up the CLI, it made the testing much easier and I wish I would have used the CLI sooner.  It was much easier to repeat a job.  The EMR API can be used to spawn your cluster programmatically.  Here is the command line that I used to spawn the job:</p>
<pre>./elastic-mapreduce --create --name "Monkeys Scalability 5 Hour Test 20 Node"   
--instance-group master --instance-type c1.medium --instance-count 1
--instance-group core --instance-type c1.medium --instance-count 20 --bid-price 0.08  
--jar s3://monkeys2/monkeys.jar --arg timelimit=5h --arg iterationsize=1
--arg memory=-Xmx1024m --arg -Dmapred.max.split.size=12000
--arg -Dmapred.min.split.size=10000</pre>
<p>I would like to break down what this command is doing.  It is creating a new EMR cluster with the job name &#8220;Monkeys Scalability 5 Hour Test 20 Node.”  The master instance group will be made up of one High-CPU Medium instance.  The core instance group will be made up of 20 nodes with a bid price of $0.08 per hour per instance.  It will be running a custom jar located in S3 at s3://monkeys2/monkeys.jar.  The rest of the arguments are for the Million Monkeys code itself.</p>
<p>The performance could be improved by spending some time tuning and looking at configuration changes as all tests used defaults.  For the duration of this project, I did not spend time optimizing the jobs and used only default settings, except maximum Java heap memory (-xmx).</p>
<p>Although I kept my bid prices in nice round cents, you can bid in fractions of a cent like $0.085.</p>
<p>For the curious, I used <a href="http://www.jfree.org/jfreechart/">JFreeChart</a> for the charts and graphs on this page with a customized color scheme.</p>
<h2 style="margin-top: 10px; margin-bottom: 5px;" dir="ltr">Problems</h2>
<p>Hadoop and EMR jobs are usually geared towards very large input files.  In the case of the Million Monkeys project, the input files are very small, usually a few KB.  This presented an issue when I started running the EMR cluster with multiple nodes.  When I compared the multi-node results to the single node results, there was barely any improvement in total character groups.  In some cases, a multi-node cluster did worse than a single node cluster.  After a LOT of Googling and guesses, I finally found a small input file workaround by specifying the max and min split sizes for the input file.  Here is the command so as to save a future person lots of Googling:</p>
<pre>./elastic-mapreduce --create --name "Monkeys Scalability 5 Hour Test 20 Node"   
--instance-group master --instance-type c1.medium --instance-count 1
--instance-group core --instance-type c1.medium --instance-count 20 --bid-price 0.08  
--jar s3://monkeys2/monkeys.jar --arg timelimit=5h --arg iterationsize=1
--arg memory=-Xmx1024m --arg -Dmapred.max.split.size=12000
--arg -Dmapred.min.split.size=10000</pre>
<p>The split workaround stopped working (I never figured out why).  I looked around for a better solution and found the <a href="http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html">NLineInputFormat </a>class.  Had I known about this class when I first wrote the code, I would have used it.  It is a much better fit for the type of input I am using for the Million Monkeys project.</p>
<p>When you are budgeting for your AWS project, make sure you bake in some time and money for running down some issues.  You may run in to some issues with multiple nodes that did not happen on a single development computer.</p>
<h2 dir="ltr"><a name="conclusion"></a>Conclusion</h2>
<p>EMR is a great, cost effective way to get an enterprise Hadoop cluster going.  It is also easier to get an enterprise Hadoop cluster up and running than a Do-It-Yourself method.  An EMR cluster solves the many problems of creating an enterprise cluster like hardware specs, uptime and configuration.  Until you have dealt with the pain of redundancy and enterprise hardware requirements, you don’t know how much time and effort EC2 and S3 save.  With EMR, you simply have to start the cluster and all of these issues are taken care of.</p>
<p>I also showed how EMR and Hadoop make scaling easy.  You do not have to convince your boss to buy a $2,000 to $4,000 server(s); you can simply add more EC2 instances to the core instance group or change the instance type to one with more ECUs.  This can be done on a temporary basis to accommodate higher usage or a gradual increase in capacity.  Without changing the code, I was able to scale the cluster to 20 nodes.</p>
<p>EMR clusters can be run at Amazon Web Service’s various locations around the world.  AWS has 3 in the United States, 1 in Ireland, 1 in Singapore, 1 in Tokyo and 1 in Brazil.  Separate EMR clusters could be used in conjunction with geographic sharding or simply choosing the nearest location to the client.</p>
<p>Spot instances also show great promise in further reducing the price per hour of an EMR cluster.  For the 20 node tests, I reduced total cost per hour from $2.20 to $1.30, a 41% decrease.  During one of the 20 node speed runs, I saved $1,000 by using Spot instances.  If you decide to use Spot instances, make sure your code can handle its instances being taken away as the market price increases.</p>
<p>I think this project shows there is true substance to the hype and buzz around Hadoop and EMR.  Anyone who has created their own distributed system knows that achieving 84% efficiency is an impressive feat.  There are a great number of use cases that can make efficient use of Hadoop and EMR.  Paired with EMR, you can easily run a cost efficient, enterprise level, cluster that can run around the world.</p>
<p>Full Disclosure: Amazon supported this project with AWS credit.  I would like to thank Jeff Barr and Alan Mock from Amazon for their help in making this project possible.</p>
<p>Copyright © Jesse Anderson 2012.  All Rights Reserved.  All text, graphs and charts on this page are licensed under a <a href="http://creativecommons.org/licenses/by-sa/3.0/">Creative Commons Attribution-ShareAlike 3.0 Unported License</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jesse-anderson.com/2012/02/ec2-performance-spot-instance-roi-and-emr-scalability/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Million Monkeys Visualization</title>
		<link>http://www.jesse-anderson.com/2011/10/million-monkeys-visualization/</link>
		<comments>http://www.jesse-anderson.com/2011/10/million-monkeys-visualization/#comments</comments>
		<pubDate>Thu, 20 Oct 2011 01:42:45 +0000</pubDate>
		<dc:creator>Jesse</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[data visualization]]></category>
		<category><![CDATA[infinite monkey theorem]]></category>
		<category><![CDATA[million monkeys]]></category>
		<category><![CDATA[shakespeare]]></category>

		<guid isPermaLink="false">http://www.jesse-anderson.com/?p=152</guid>
		<description><![CDATA[At last weekend&#8217;s Hack4Reno, I created a new visualization of the Million Monkeys&#8217; data.  It allows you to choose your favorite work of Shakespeare and find out how a particular character was found.  You simply place or hover the mouse over a character and the box to the right will show the number of times [...]]]></description>
			<content:encoded><![CDATA[<p>At last weekend&#8217;s <a href="http://hack4reno.com/">Hack4Reno</a>, I created a <a href="http://www.jesse-anderson.com/monkeysvis/monkeys.htm">new visualization</a> of the Million Monkeys&#8217; data.  It allows you to choose your favorite work of Shakespeare and find out how a particular character was found.  You simply place or hover the mouse over a character and the box to the right will show the number of times that character was found.</p>
<p>For more information on the Million Monkeys Project, <a href="http://www.jesse-anderson.com/2011/10/a-few-million-monkeys-randomly-recreate-every-work-of-shakespeare/" title="A Few Million Monkeys Randomly Recreate Every Work Of Shakespeare">go here</a>.</p>
<p>To make this visualization possible, I took the ~3GB of raw monkey data and generated a JSON output.  This was tricky because I had to break the works of Shakespeare down into individual works.  Once I had the JSON data, I wrote some Javascript that used JQuery to show the data and allow the interactions.</p>
<p>NOTE: It was a 24 hour Hackathon and there are a few bugs.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jesse-anderson.com/2011/10/million-monkeys-visualization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Few Million Monkeys Randomly Recreate Every Work Of Shakespeare</title>
		<link>http://www.jesse-anderson.com/2011/10/a-few-million-monkeys-randomly-recreate-every-work-of-shakespeare/</link>
		<comments>http://www.jesse-anderson.com/2011/10/a-few-million-monkeys-randomly-recreate-every-work-of-shakespeare/#comments</comments>
		<pubDate>Thu, 06 Oct 2011 21:48:33 +0000</pubDate>
		<dc:creator>Jesse</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Magnum Opus]]></category>
		<category><![CDATA[going viral]]></category>
		<category><![CDATA[infinite monkey theorem]]></category>
		<category><![CDATA[million mo]]></category>
		<category><![CDATA[million monkeys project]]></category>
		<category><![CDATA[shakespeare]]></category>
		<category><![CDATA[social objects]]></category>

		<guid isPermaLink="false">http://www.jesse-anderson.com/?p=147</guid>
		<description><![CDATA[All the world&#8217;s a stage, And all the monkeys merely players; They have their typos and their hits, And one monkey in his time plays many parts, His acts being 38 works of Shakespeare. - Monkey As You Like It Update: I created a new visualization of the monkeys&#8217; data. The monkeys accomplished their goal [...]]]></description>
			<content:encoded><![CDATA[<p>All the world&#8217;s a stage,<br />
And all the monkeys merely players;<br />
They have their typos and their hits,<br />
And one monkey in his time plays many parts,<br />
His acts being 38 works of Shakespeare.<br />
- Monkey <em>As You Like It</em></p>
<p><strong>Update</strong>: I created a <a href="http://www.jesse-anderson.com/2011/10/million-monkeys-visualization/" title="Million Monkeys Visualization">new visualization</a> of the monkeys&#8217; data.</p>
<p>The monkeys accomplished their goal of recreating all 38 works of Shakespeare.  The last work, <em>The Taming Of The Shrew</em>, was completed at 2 AM PST on October 6, 2011.  This is the first time <strong>every</strong> work of Shakespeare has actually been randomly reproduced.  Furthermore, this is the largest work ever randomly reproduced. It is one small step for a monkey, one giant leap for virtual primates everywhere.  <a href="http://www.jesse-anderson.com/2011/09/a-few-million-monkeys-randomly-recreate-shakespeare/" title="A Few Million Monkeys Randomly Recreate Shakespeare">This page</a> shows what day each work of Shakespeare was completed on.</p>
<p>The Million Monkeys project went viral, but not in the cool, apocalyptic way.  The Million Monkeys project went viral starting on September 25, 2011 and went into full swing on September 26, 2011.  On September 26, 2011, over 25,000 unique visitors viewed the Million Monkeys project, 300 sites referred traffic, and people viewed it from 119 countries.  This post will contain some of my thoughts and reactions on going viral.  If this article about going viral goes viral, it will create an infinite loop that will bring about the destruction of the world.</p>
<p><strong>NOTE:</strong> I apologize in advance for having to use the term &#8220;go viral&#8221; so much, but that really explains the phenomenon.</p>
<p>I am proud to announce that I have open sourced the Million Monkeys project.  The source code is <a href="http://code.google.com/p/million-monkeys-project">available here</a>.</p>
<p>This project originally started on August 21, 2011.  Over the course of the project, over 7.5 trillion character groups have been randomly generated and checked, out of the 5.5 trillion (5,429,503,678,976) possible combinations.</p>
<p><strong>Update</strong>: The monkeys are not <a href="http://www.apps.ietf.org/rfc/rfc2795.html">RFC 2795 compliant</a>.  A Slashdot user pointed out that I forgot to talk about the similarities between this project and <a href="http://en.wikipedia.org/wiki/Weasel_program">Richard Dawkin&#8217;s Weasel Experiment</a>.</p>
<p>If you would like to do a story, please contact me via the Contact page.</p>
<p><iframe width="560" height="315" src="http://www.youtube.com/embed/sZZBgxfN87o" frameborder="0" allowfullscreen></iframe></p>
<h4>Thoughts on Going Viral</h4>
<p>As I mentioned before, the Million Monkeys project went viral on September 26, 2011.  This was partly due to me spending a few hours E-mailing every news outlet I could think of.  Another part was people using Twitter and Facebook to promote the project.  On that day alone, over 2,300 visitors came to the site through Facebook and Twitter.</p>
<p>The <a href="http://www.jesse-anderson.com/2011/06/a-million-amazonian-monkeys/" title="A Million Amazonian Monkeys">first round</a> of the project had no recognition, even among my friends.  I thought the concept was cool and I kept with it.  During a conversation with a friend of mine, we came up with a new concept for the project.</p>
<p>I went back to the drawing board for the second round of the project with the ideas from the new concept.  I started using a smaller group size, 9 character groups instead of 24 character groups.  This would allow the project to complete without infinite amount of resources.  I added near real-time updates of the site so people could see the progress of the monkeys.  I wanted people to be able to come back to the site to watch their favorite work being recreated.  This <a href="http://www.jesse-anderson.com/2011/08/a-few-more-million-amazonian-monkeys/" title="A Few More Million Amazonian Monkeys">round</a> received some recognition and landed on the front pages of <a href="http://www.foxnews.com/scitech/2011/09/07/works-shakespeare-produced-by-millions-monkeys/">Fox News</a> and <a href="http://www.engadget.com/2011/08/23/simulated-monkey-typing-project-is-the-best-blurst-of-times">Engadget</a>.</p>
<p>I knew I was on the right track.  I was getting some media attention and people were starting to see the site.  My goal was to do another media blitz once the monkeys completed their first work.  My goal was to get an Associated Press article and, if I was lucky enough, get on the front page of Slashdot.  I thought I had a good idea, but I had no delusions of the project going viral.</p>
<p>On Sunday night September 25, 2011, I was reading through my RSS feeds on Google Reader.  Some new Slashdot stories appeared and I dutifully started reading them.  When I started reading about myself and my project, I started to think I had clicked on the wrong feed or I had erred in some fashion.  I could not believe I was reading about myself on Slashdot after many years of reading it.  My wife was next to me at the time and I tried to explain why I was so ecstatic to be on Slashdot.  Explaining to a non-geek about Slashdot is difficult, but I think she could see it was important to me.  If the media blitz had died at that point, I would have been happy.  It didn&#8217;t.  Over the course of the next day, the story kept on gaining momentum, getting more news stories, and more hits on the website.</p>
<p>All glory may be fleeting, but not everyone liked the project.  I received my share of hate mail, hate comments, and hate blog posts.  I was informed that I didn&#8217;t understand Infinite Monkey Theorem (I do), that I was conning people (I&#8217;m not, the source code and data are available), and that the project was boring (beauty is in the eye of the beholder).  Before anyone decides to create a project on the Internet, you better have a thick skin to put up with peoples&#8217; comments.  I responded to the people I thought were genuinely asking a question or those that seemed to be open to a discussion about the project.  Most people responded and most people were nice.</p>
<h4>Pre-Viral Checklist</h4>
<p>You should create as many social objects as possible.  I have several YouTube videos where I explain in various levels of detail about the project.  These YouTube videos, in turn, were posted by the various sites on their postings.  The blog postings themselves were great social objects.  I could see by the direct traffic that people were E-mailing the link about to their friends.  My Twitter feed allowed me to converse with people who had questions about the project.  They also allowed me to tweet the URLs of interviews, articles and radio shows about the project.</p>
<p>To gain the most amount of media attention, you make your project and/or post as media friendly as possible.  Many of the sites wrote their articles only using the posts as source material.  I put a lot of effort into making the site as straightforward as possible and as quotable as possible.  When doing a technical project like this, not all of your readers will be technically minded people.  I recommend creating sections for technical and non-technical people.  The non-technical people may glaze over at a very technical explanation of your project and a technical person will want more technical detail.</p>
<p>The site itself needs to ready technically for a huge increase in traffic.  Many sites go down during a Slashdotting.  Fortunately for me, <a href="http://www.dreamhost.com/">DreamHost</a> kept my site going without stoppage.  It&#8217;s usually too late to change your site once it goes viral.  Make sure you have some metrics for your site to track the usage.  In my case, I use Google Analytics for WordPress.  Having a decent looking site also helps.  If you are not a designer, use your good taste and find a good them for site.  I used <a href="http://www.elegantthemes.com">ElegantTheme&#8217;s Minimal theme</a> for this site.  To handle a Slashdotting, your site needs to be optimized.  From the beginning of this project, I tried to optimize the site.  The images showing the progress through Shakespeare were indexed PNGs.  They provided the smallest file size and therefore the best scalability.  Much to my lament, the comments are not working on this site.  One of the CAPTCHA plugins I installed messed things up and it is still not working even after I uninstalled all of them.</p>
<p>Make sure your site makes it as easy as possible to connect with your users socially.  The previous posts did not have the Facebook likes and Tweets when they were on Engadget and Fox News.  I made it more difficult than it should have been for people to tell their friends about the project.  From the start of this round, I have the &#8220;like&#8221; buttons for the major social players.  The site&#8217;s traffic and the numbers of people &#8220;liking&#8221; shows much better the story made its rounds.</p>
<h4>Was Was It A Success?</h4>
<p>I always do a postmortem at the end of every project.  This is the Million Monkeys project postmortem.  I think the project was a resounding success.  It achieved its primary goal of recreating every work of Shakespeare.  People saw my work.  While I might have received over 25,000 unique visitors to my site, millions and millions of people read about my work on mainstream news, blogs, print and radio.  My personal branding (which is what this website is) went through the roof.  On Google, the search term &#8220;jesse anderson&#8221; used to appear as the 45th link.  Now, I have links 4-6.  The top 3 spots belong to an anime character named Jesse Anderson (Andersen).  The project also brought me recognition within my own company, Intuit.</p>
<p>This success was not the result of luck.  I found it is not the result of luck or a random chance, but the result of countless hours of hard work.  Even though the Million Monkeys project took 40-60 hours of my time to write, it took countless hours before that to become a better programmer and learn new technologies like Hadoop.  A lot of time was spent submitting the story and working with reporters on stories.</p>
<p>In a way, the Million Monkeys is the current culmination of this time spent.</p>
<h4>Miscellaneous Thoughts</h4>
<p>A lot of reporters asked me what I wanted to accomplish with this project.  For me it is performance art with monkeys and computers.  I wanted to make it engaging and have people coming back to check the monkeys’ progress, so I did near real-time updates of the site.  People did just that as was reflected through the usage logs.  People were coming back and they were E-mailing it around to their friends.  They were tweeting it and liking it on Facebook.  I consider that the most gratifying part of the project; people enjoyed it.</p>
<p>As time went on, I began to anthropomorphise the monkeys more and more.  Instead of thinking of them as a PRNG (pseudo random number generator) and a computer program, I was talking about them as if they were really monkeys.  I began to identify with them and think of them like a pet.  Maybe I spent too much time curating their work.</p>
<p>Going back to thick skin, I have a list of people to contact to get approval of projects.  If anyone wants this list before they start their project, please E-mail me so we can get their approval.  It&#8217;s of utmost importance that any project contact them before starting any work.</p>
<p>Reading about yourself in the news is one of the craziest things that can happen to you.  There is kind of a disembodied realization that it is you, but it does not feel like you did it.  That first week seemed like it was a month long.  I was doing a lot of interviews and every moment seemed like an eternity.</p>
<p>I could not get the local media in Reno to do any stories on the project.  It was incredibly funny because I would E-mail them saying the project has been on BBC, CNN, etc and I never even got a reply.  I will take international coverage over local coverage any day, but it was funny that local didn&#8217;t follow international.  Update: I finally got some <a href="http://www.rgj.com/article/20111016/BIZ15/110160324/Reno-software-engineer-goes-ape-Shakespeare">local press</a>.</p>
<h4>Some More Numbers</h4>
<p>The monkeys ran 180,000,000,000 character groups a day.  An average iteration lasted 30 minutes 33 seconds and ran 5,000,000,000 character groups.  The monkeys found 1,982,507 distinct character groups and those character groups were found 3,788,175 times for a ratio of 1.8718555.  The monkeys ran 7,445,912,000,000 total character groups out of the 5,429,503,678,976 possible combinations for a ratio of 1.3713.</p>
<p>There are 2 technologies I think set the Monkeys Project apart from previous endeavors.  The first is Hadoop, which scales well and can handle exponential problems like Infinite Monkey Theorem.  The second is a Bloom Filter.  I ran a test last night comparing the Bloom Filter speed to a String.indexOf.  The Bloom Filter took 25 seconds to run every work of Shakespeare and I stopped the String.indexOf after 2 hours.  The monkeys project would not be close to the number of character sets it is now if not for the Bloom Filter.  In other words, if not for the usage of a Bloom Filter, the project would be far from complete.  I think this would even be true of using Lucene or Sphinx but not as bad.</p>
<h4>The Inspiration</h4>
<p>This project comes from one of my <a href="http://www.snpp.com/episodes/9F15.html">favorite Simpsons episodes</a> which has a scene where Mr. Burns brings Homer to his mansion (<a href="http://www.youtube.com/watch?v=JcSUWP0QNeY">YouTube Video</a>). One of his rooms has a thousand monkeys at a thousand typewriters. One of the monkeys writes a slightly incorrect line from Charles Dickens &#8220;It was the best of times, it was blurst of times.&#8221;  The joke is a play on the theory that a million monkeys sitting at a million typewriters will eventually produce Shakespeare.  And that is what I did.  I created millions of monkeys on Amazon EC2 (then my home computer) and put them at virtual typewriters (aka <a href="http://en.wikipedia.org/wiki/Infinite_monkey_theorem">Infinite Monkey Theorem</a>).</p>
<p><iframe src="http://www.youtube.com/embed/8MCHJGNmSts" frameborder="0" width="560" height="345"></iframe></p>
<h4>Less Technical Explanation</h4>
<p>Instead of having real monkeys typing on keyboards, I have virtual, computerized monkeys that output random gibberish. This is supposed to mimic a monkey randomly mashing the keys on a keyboard. The computer program I wrote compares that monkey’s gibberish to every work of Shakespeare to see if it actually matches a small portion of what Shakespeare wrote. If it does match, the portion of gibberish that matched Shakespeare is marked with green in the images below to show it was found by a monkey. The table below shows the exact number of characters and percentage the monkeys have found in Shakespeare. The parts of Shakespeare that have not been found are colored white. This process is repeated over and over until the monkeys have created every work of Shakespeare through random gibberish.</p>
<h4>Technical Explanation</h4>
<p><iframe width="560" height="315" src="http://www.youtube.com/embed/JZpM_MlZFqE" frameborder="0" allowfullscreen></iframe></p>
<p>For this project, I used Hadoop, Amazon EC2, and Ubuntu Linux.  Since I don’t have real monkeys, I have to create fake Amazonian Map Monkeys.  The Map Monkeys create random data in <a href="http://en.wikipedia.org/wiki/ASCII">ASCII</a> between a and z.  It uses <a href="http://www.cs.gmu.edu/~sean/research/">Sean Luke’s Mersenne Twister</a> to make sure I have fast, random, well behaved monkeys.  Once the monkey’s output is mapped, it is passed to the reducer which runs the characters through a Bloom Field membership test.  If the monkey output passes the membership test, the Shakespearean works are checked using a string comparison.  If that passes, a genius monkey has written 9 characters of Shakespeare.  The source material is all of Shakespeare’s works as taken from <a href="http://www.gutenberg.org/wiki/Main_Page">Project Gutenberg</a>.</p>
<p><a href="http://www.jesse-anderson.com/wp-content/uploads/2011/10/stories_final.png"><img src="http://www.jesse-anderson.com/wp-content/uploads/2011/10/stories_final-300x100.png" alt="" title="stories_final" width="300" height="100" class="aligncenter size-medium wp-image-149 colorbox-147" /></a><br />
This chart shows the total number of character groups as more and more iterations of the checks are run.</p>
<p><a href="http://www.jesse-anderson.com/wp-content/uploads/2011/10/totalchars_final.png"><img src="http://www.jesse-anderson.com/wp-content/uploads/2011/10/totalchars_final-300x100.png" alt="" title="totalchars_final" width="300" height="100" class="aligncenter size-medium wp-image-150 colorbox-147" /></a><br />
This chart shows percent complete as more and more iterations are run for each story.</p>
<p>For the curious, the computer I ran the monkeys on is a Core 2 Duo 2.66GHZ with 4 GB RAM running Ubuntu 10.10 64-bit.</p>
<h4>A Few Words To Try and Prevent The Usual Comments</h4>
<p>I realize there are different interpretations to this saying/theorem and I have done 2 <a title="A Million Amazonian Monkeys" href="http://www.jesse-anderson.com/2011/06/a-million-amazonian-monkeys/">different</a> <a title="A Few More Million Amazonian Monkeys" href="http://www.jesse-anderson.com/2011/08/a-few-more-million-amazonian-monkeys/">ones</a> already.  I understand the definition of infinite and infinite monkey theorem and I realize that this project does not have infinite resources.  This project was funded and written by myself and was not supported by any grant money or federal money.  No monkeys were harmed during the making of this code.  This project is my attempt to find a creative way to attain an answer without infinite resources.  It is a fun side project.  If you still feel angry or slighted or feel the need to set me straight, please read this sign:</p>
<p><a href="http://www.jesse-anderson.com/wp-content/uploads/2011/09/keepcalm.jpg"><img class="aligncenter size-full wp-image-142 colorbox-147" title="keepcalm" src="http://www.jesse-anderson.com/wp-content/uploads/2011/09/keepcalm.jpg" alt="" width="189" height="267" /></a><a href="http://www.jesse-anderson.com/wp-content/uploads/2011/09/keepcalm.jpg"><br />
</a></p>
<p>Thanks to my wife Sara, daughter Ashley, David Weinberg, Ryan Polk, and Tim Dailey.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jesse-anderson.com/2011/10/a-few-million-monkeys-randomly-recreate-every-work-of-shakespeare/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Few Million Monkeys Randomly Recreate Shakespeare</title>
		<link>http://www.jesse-anderson.com/2011/09/a-few-million-monkeys-randomly-recreate-shakespeare/</link>
		<comments>http://www.jesse-anderson.com/2011/09/a-few-million-monkeys-randomly-recreate-shakespeare/#comments</comments>
		<pubDate>Fri, 23 Sep 2011 21:35:50 +0000</pubDate>
		<dc:creator>Jesse</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Magnum Opus]]></category>
		<category><![CDATA[amazon]]></category>
		<category><![CDATA[amazon ec2]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[infinite monkey theorem]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[monkeys]]></category>
		<category><![CDATA[shakespeare]]></category>

		<guid isPermaLink="false">http://www.jesse-anderson.com/?p=141</guid>
		<description><![CDATA[Friends, Romans, countrymen, lend me your ears; I come to recreate Shakespeare, not to praise him. - Monkey Julius Caesar Update 1: The monkeys recreated every work of Shakespeare and went viral. See the project project postmortem for my thoughts on going viral and what I learned during the project. Update 2: I created a [...]]]></description>
			<content:encoded><![CDATA[<p>Friends, Romans, countrymen, lend me your ears;<br />
I come to recreate Shakespeare, not to praise him.<br />
- <em>Monkey Julius Caesar</em></p>
<p><strong>Update 1</strong>: The monkeys recreated every work of Shakespeare and went viral.  See the project <a href="http://www.jesse-anderson.com/2011/10/a-few-million-monkeys-randomly-recreate-every-work-of-shakespeare/" title="A Few Million Monkeys Randomly Recreate Every Work Of Shakespeare">project postmortem</a> for my thoughts on going viral and what I learned during the project.</p>
<p><strong>Update 2</strong>: I created a <a href="http://www.jesse-anderson.com/2011/10/million-monkeys-visualization/" title="Million Monkeys Visualization">new visualization</a> of the monkeys&#8217; data.</p>
<p>Today (2011-09-23) at <strong>2:30 PST</strong> the monkeys successfully randomly recreated <a href="http://www.gutenberg.org/ebooks/1137">A Lover&#8217;s Complaint</a>, The Tempest (2011-09-26), As You Like It (2011-09-28), Loves Labours Lost (2011-09-29), Much Ado About Nothing (2011-09-29), The Merchant Of Venice (2011-09-29), The Sonnets (2011-09-29), The Third Part Of King Henry The Sixth (2011-09-29), The Two Gentlemen Of Verona (2011-09-29), A Midsummer Nights Dream (2011-09-30), As You Like It (2011-09-30), The Life Of King Henry The Fifth (2011-09-30), The First Part Of Henry The Sixth (2011-09-30), The Tragedy Of Titus Andronicus (2011-09-30), The Winters Tale (2011-09-30), Measure for Measure (2011-10-01), The First Part Of King Henry The Fourth (2011-10-01), and The History Of Troilus (2011-10-01), Cressida (2011-10-01), Cymbeline (2011-10-02), King Richard The Second (2011-10-02), The Comedy Of Errors (2011-10-02), The Life Of Timon Of Athens (2011-10-02), The Tragedy Of Macbeth (2011-10-02), The Tragedy Of Othello Moor Of Venice (2011-10-02), Twelfth Night Or What You Will (2011-10-02), Alls Well That Ends Well (2011-10-03), King Henry The Eighth (2011-10-03), The Second Part Of King Henry The Sixth (2011-10-03), The Tragedy Of Hamlet Prince Of Denmark (2011-10-03), The Tragedy Of Julius Caesar (2011-10-03), The Tragedy Of Romeo And Juliet (2011-10-03), King John (2011-10-04), King Richard III (2011-10-04), Second Part Of King Henry IV (2011-10-04), The Tragedy Of Antony And Cleopatra (2011-10-04), The Tragedy Of Coriolanus (2011-10-04), The Tragedy Of King Lear (2011-10-04), and The Taming Of The Shrew (2011-10-06). This is the first time a work of Shakespeare has actually been randomly reproduced.  Furthermore, this is the largest work ever randomly reproduced.  It is one small step for a monkey, one giant leap for virtual primates everywhere.</p>
<p>The monkeys will continue typing away until every work of Shakespeare is randomly created.  Until then, you can continue to view the monkeys&#8217; progress on <a title="A Few More Million Amazonian Monkeys" href="http://www.jesse-anderson.com/2011/08/a-few-more-million-amazonian-monkeys/">that page</a>.  I am making the raw data available to anyone who wants it.  Please use the Contact page to ask for the URL.  If you have a Hadoop cluster that I could run the monkeys project on, please contact me as well.</p>
<p>This project originally started on August 21, 2011.  Over the course of the project, over 6.5 trillion character groups have been randomly generated and checked out of the 5.5 trillion possible combinations.</p>
<p>So far, the project has appeared on <a href="http://idle.slashdot.org/story/11/09/26/0139253/a-few-million-virtual-monkeys-randomly-recreate-shakespeare">Slashdot</a>, <a href="http://www.foxnews.com/scitech/2011/09/07/works-shakespeare-produced-by-millions-monkeys/">Fox News</a>, <a href="http://www.engadget.com/2011/08/23/simulated-monkey-typing-project-is-the-best-blurst-of-times">Engadget</a>, <a href="http://japanese.engadget.com/2011/08/28/infinite-monkey-theorem/">Japanese Engadget</a>, and <a href="http://developers.solidot.org/developers/11/08/26/117201.shtml">Solidot</a>.  The radio interviews are <a href="http://blogs.abc.net.au/wa/2011/09/a-few-million-monkeys-randomly-re-create-shakespeare.html?site=perth&#038;program=720_afternoons">Australian Broadcasting Company</a>, <a href="http://kyxy.radio.com/author/littletommykyxyy/">Little Tommy, Jeff and Jer in San Diego</a> and <a href="http://www.radionz.co.nz/national/programmes/checkpoint/audio/2499182/virtual-monkeys-begin-to-recreate-the-works-of-shakespeare">Radio New Zealand</a>.  If you would like to do a story, please contact me via the Contact page.</p>
<h4>The Inspiration</h4>
<p>This project comes from one of my <a href="http://www.snpp.com/episodes/9F15.html">favorite Simpsons episodes</a> which has a scene where Mr. Burns brings Homer to his mansion (<a href="http://www.youtube.com/watch?v=JcSUWP0QNeY">YouTube Video</a>). One of his rooms has a thousand monkeys at a thousand typewriters. One of the monkeys writes a slightly incorrect line from Charles Dickens &#8220;It was the best of times, it was blurst of times.&#8221;  The joke is a play on the theory that a million monkeys sitting at a million typewriters will eventually produce Shakespeare.  And that is what I did.  I created millions of monkeys on Amazon EC2 (then my home computer) and put them at virtual typewriters (aka <a href="http://en.wikipedia.org/wiki/Infinite_monkey_theorem">Infinite Monkey Theorem</a>).</p>
<p><iframe src="http://www.youtube.com/embed/8MCHJGNmSts" frameborder="0" width="560" height="345"></iframe></p>
<h4>Less Technical Explanation</h4>
<p>Instead of having real monkeys typing on keyboards, I have virtual, computerized monkeys that output random gibberish. This is supposed to mimic a monkey randomly mashing the keys on a keyboard. The computer program I wrote compares that monkey’s gibberish to every work of Shakespeare to see if it actually matches a small portion of what Shakespeare wrote. If it does match, the portion of gibberish that matched Shakespeare is marked with green in the images below to show it was found by a monkey. The table below shows the exact number of characters and percentage the monkeys have found in Shakespeare. The parts of Shakespeare that have not been found are colored white. This process is repeated over and over until the monkeys have created every work of Shakespeare through random gibberish.</p>
<h4>Technical Explanation</h4>
<p><iframe width="560" height="315" src="http://www.youtube.com/embed/JZpM_MlZFqE" frameborder="0" allowfullscreen></iframe></p>
<p>For this project, I used Hadoop, Amazon EC2, and Ubuntu Linux.  Since I don’t have real monkeys, I have to create fake Amazonian Map Monkeys.  The Map Monkeys create random data in <a href="http://en.wikipedia.org/wiki/ASCII">ASCII</a> between a and z.  It uses <a href="http://www.cs.gmu.edu/~sean/research/">Sean Luke’s Mersenne Twister</a> to make sure I have fast, random, well behaved monkeys.  Once the monkey’s output is mapped, it is passed to the reducer which runs the characters through a Bloom Field membership test.  If the monkey output passes the membership test, the Shakespearean works are checked using a string comparison.  If that passes, a genius monkey has written 9 characters of Shakespeare.  The source material is all of Shakespeare’s works as taken from <a href="http://www.gutenberg.org/wiki/Main_Page">Project Gutenberg</a>.</p>
<p>The monkeys’ data from Amazon’s cloud is updated on this site every 30 minutes.  The images below show green for every character group that was found and white for those that are still missing.  The images output is kind of like the animations for defrag utilities.  As the monkeys progress through the works, more and more character groups will be found and show green.</p>
<p><a href="http://www.jesse-anderson.com/wp-content/uploads/2011/09/totalchars.png"><img class="aligncenter size-medium wp-image-144 colorbox-141" title="totalchars" src="http://www.jesse-anderson.com/wp-content/uploads/2011/09/totalchars-300x100.png" alt="" width="300" height="100" /></a></p>
<p>This chart shows the total number of character groups as more and more iterations of the checks are run.</p>
<p><a href="http://www.jesse-anderson.com/wp-content/uploads/2011/09/stories.png"><img class="aligncenter size-medium wp-image-143 colorbox-141" title="stories" src="http://www.jesse-anderson.com/wp-content/uploads/2011/09/stories-300x100.png" alt="" width="300" height="100" /></a></p>
<p>This chart shows percent complete as more and more iterations are run for each story.</p>
<p>For the curious, the computer I ran the monkeys on is a Core 2 Duo 2.66GHZ with 4 GB RAM running Ubuntu 10.10 64-bit.</p>
<h4>A Few Words To Try and Prevent The Usual Comments</h4>
<p>I realize there are different interpretations to this saying/theorem and I have done 2 <a title="A Million Amazonian Monkeys" href="http://www.jesse-anderson.com/2011/06/a-million-amazonian-monkeys/">different</a> <a title="A Few More Million Amazonian Monkeys" href="http://www.jesse-anderson.com/2011/08/a-few-more-million-amazonian-monkeys/">ones</a> already.  I understand the definition of infinite and infinite monkey theorem and I realize that this project does not have infinite resources.  This project was funded and written by myself and was not supported by any grant money or federal money.  No monkeys were harmed during the making of this code.  This project is my attempt to find a creative way to attain an answer without infinite resources.  It is a fun side project.  If you still feel angry or slighted or feel the need to set me straight, please read this sign:</p>
<p><a href="http://www.jesse-anderson.com/wp-content/uploads/2011/09/keepcalm.jpg"><img class="aligncenter size-full wp-image-142 colorbox-141" title="keepcalm" src="http://www.jesse-anderson.com/wp-content/uploads/2011/09/keepcalm.jpg" alt="" width="189" height="267" /></a><a href="http://www.jesse-anderson.com/wp-content/uploads/2011/09/keepcalm.jpg"><br />
</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.jesse-anderson.com/2011/09/a-few-million-monkeys-randomly-recreate-shakespeare/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Few More Million Amazonian Monkeys</title>
		<link>http://www.jesse-anderson.com/2011/08/a-few-more-million-amazonian-monkeys/</link>
		<comments>http://www.jesse-anderson.com/2011/08/a-few-more-million-amazonian-monkeys/#comments</comments>
		<pubDate>Mon, 22 Aug 2011 00:27:24 +0000</pubDate>
		<dc:creator>Jesse</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[amazon]]></category>
		<category><![CDATA[amazon ec2]]></category>
		<category><![CDATA[elastic mapreduce]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[infinite monkey theorem]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[million monkeys]]></category>
		<category><![CDATA[shakespeare]]></category>
		<category><![CDATA[simpsons]]></category>

		<guid isPermaLink="false">http://www.jesse-anderson.com/?p=138</guid>
		<description><![CDATA[Update 5: The monkeys recreated every work of Shakespeare and went viral. See the project project postmortem for my thoughts on going viral and what I learned during the project. Update 6: I created a new visualization of the monkeys&#8217; data. Update 4: The monkeys recreated &#8220;A Lover&#8217;s Complaint&#8221;. Check out the write up. Update [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Update 5</strong>: The monkeys recreated every work of Shakespeare and went viral. See the project <a title="A Few Million Monkeys Randomly Recreate Every Work Of Shakespeare" href="http://www.jesse-anderson.com/2011/10/a-few-million-monkeys-randomly-recreate-every-work-of-shakespeare/">project postmortem</a> for my thoughts on going viral and what I learned during the project.</p>
<p><strong>Update 6</strong>: I created a <a href="http://www.jesse-anderson.com/2011/10/million-monkeys-visualization/" title="Million Monkeys Visualization">new visualization</a> of the monkeys&#8217; data.</p>
<p><strong>Update 4</strong>: The monkeys recreated &#8220;A Lover&#8217;s Complaint&#8221;. <a href="http://www.jesse-anderson.com/2011/09/a-few-million-monkeys-randomly-recreate-shakespeare/">Check out the write up</a>.</p>
<p><strong>Update 3</strong>: Welcome <a href="http://idle.slashdot.org/story/11/09/26/0139253/a-few-million-virtual-monkeys-randomly-recreate-shakespeare">Slashdot</a>, <a href="http://www.foxnews.com/scitech/2011/09/07/works-shakespeare-produced-by-millions-monkeys/">Fox News</a>, <a href="http://www.engadget.com/2011/08/23/simulated-monkey-typing-project-is-the-best-blurst-of-times">Engadget</a> and <a href="http://japanese.engadget.com/2011/08/28/infinite-monkey-theorem/">Japanese Engadget</a>. So far, the monkeys have ran through 7.5 trillion <del datetime="2011-10-20T01:43:50+00:00">6.5 trillion</del> <del datetime="2011-09-29T16:28:12+00:00">5 trillion (2011-09-22)</del> <del datetime="2011-09-23T05:15:10+00:00">4 trillion (2011-09-16)</del> <del datetime="2011-09-17T18:52:01+00:00">3 trillion (2011-09-10)</del> <del datetime="2011-09-11T16:06:14+00:00">2.5 trillion (2011-09-07)</del> <del datetime="2011-09-07T15:18:01+00:00">2 trillion (2011-09-05)</del> <del datetime="2011-09-05T17:56:45+00:00">1.5 trillion (2011-09-01)</del> <del datetime="2011-09-02T04:25:08+00:00">1 trillion (2011-08-28)</del> <del datetime="2011-08-29T03:10:01+00:00">515,912,000,000 (2011-08-25)</del> character groups.</p>
<p>In a <a title="A Million Amazonian Monkeys" href="http://www.jesse-anderson.com/2011/06/a-million-amazonian-monkeys/">recent post</a>, I described a recent project to recreate Shakespeare using Hadoop and Amazon EC2.  This time, I am going to recreate <strong>every work</strong> of Shakespeare randomly.</p>
<p><iframe src="http://www.youtube.com/embed/8MCHJGNmSts" frameborder="0" width="560" height="345"></iframe></p>
<p><iframe src="http://www.youtube.com/embed/JZpM_MlZFqE" frameborder="0" width="560" height="315"></iframe></p>
<p>This project comes from one of my <a href="http://www.snpp.com/episodes/9F15.html">favorite Simpsons episodes</a> which has a scene where Mr. Burns brings Homer to his mansion (<a href="http://www.youtube.com/watch?v=JcSUWP0QNeY">YouTube Video</a>). One of his rooms has a thousand monkeys at a thousand typewriters. One of the monkeys writes a slightly incorrect line from Charles Dickens ‘It was the best of times, it was blurst of times.’  The joke is a play on the theory that a million monkeys sitting at a million typewriters will eventually produce Shakespeare.  And that is what I did (am doing).  I created millions of monkeys on Amazon and put them at virtual typewriters (aka <a href="http://en.wikipedia.org/wiki/Infinite_monkey_theorem">Infinite Monkey Theorem</a>).</p>
<h4>Less Technical Explanation</h4>
<p>Instead of having real monkeys typing on keyboards, I have virtual, computerized monkeys that output random gibberish. This is supposed to mimic a monkey randomly mashing the keys on a keyboard. The computer program I wrote compares that monkey&#8217;s gibberish to every work of Shakespeare to see if it actually matches a small portion of what Shakespeare wrote. If it does match, the portion of gibberish that matched Shakespeare is marked with green in the images below to show it was found by a monkey. The table below shows the exact number of characters and percentage the monkeys have found in Shakespeare. The parts of Shakespeare that have not been found are colored white. This process is repeated over and over until the monkeys have created every work of Shakespeare through random gibberish.</p>
<h4>Technical Explanation</h4>
<p>For this project, I used Hadoop, Amazon EC2, and Ubuntu Linux.  Since I don’t have real monkeys, I have to create fake Amazonian Map Monkeys.  The Map Monkeys create random data in <a href="http://en.wikipedia.org/wiki/ASCII">ASCII</a> between a and z.  It uses <a href="http://www.cs.gmu.edu/~sean/research/">Sean Luke’s Mersenne Twister</a> to make sure I have fast, random, well behaved monkeys.  Once the monkey’s output is mapped, it is passed to the reducer which runs the characters through a Bloom Field membership test.  If the monkey output passes the membership test, the Shakespearean works are checked using a string comparison.  If that passes, a genius monkey has written 9 characters of Shakespeare.  The source material is all of Shakespeare’s works as taken from <a href="http://www.gutenberg.org/wiki/Main_Page">Project Gutenberg</a>.</p>
<p>The monkeys&#8217; data from Amazon&#8217;s cloud is updated on this site every 30 minutes.  The images below show green for every character group that was found and white for those that are still missing.  The images output is kind of like the animations for defrag utilities.  As the monkeys progress through the works, more and more character groups will be found and show green.</p>
<h4>The Tabular Output Of What Has Been Found</h4>
<div id="result">Loading Results&#8230; (Will only work on jesse-anderson.com due to browser security restrictions, <a href="http://www.jesse-anderson.com/2011/08/a-few-more-million-amazonian-monkeys/">go here</a>)</div>
<h4>Every Work Of Shakespeare</h4>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="All Works of Shakespeare" src="http://www.jesse-anderson.com/currentstories/All Works of Shakespeare.png" alt="All Works of Shakespeare" width="720" height="5133" /><p class="wp-caption-text">All Works of Shakespeare</p></div>
<h4>Progress Through Individual Works Of Shakespeare</h4>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="A Lovers Complaint" src="http://www.jesse-anderson.com/currentstories/A Lovers Complaint.png" alt="A Lovers Complaint" width="720" /><p class="wp-caption-text">A Lovers Complaint</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="Loves Labours Lost" src="http://www.jesse-anderson.com/currentstories/Loves Labours Lost.png" alt="Loves Labours Lost" width="720" /><p class="wp-caption-text">Loves Labours Lost</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The Merchant Of Venice" src="http://www.jesse-anderson.com/currentstories/The Merchant Of Venice.png" alt="The Merchant Of Venice" width="720" /><p class="wp-caption-text">The Merchant Of Venice</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The Tragedy Of Julius Caesar" src="http://www.jesse-anderson.com/currentstories/The Tragedy Of Julius Caesar.png" alt="The Tragedy Of Julius Caesar" width="720" /><p class="wp-caption-text">The Tragedy Of Julius Caesar</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="A Midsummer Nights Dream" src="http://www.jesse-anderson.com/currentstories/A Midsummer Nights Dream.png" alt="A Midsummer Nights Dream" width="720" /><p class="wp-caption-text">A Midsummer Nights Dream</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="Measure For Measure" src="http://www.jesse-anderson.com/currentstories/Measure For Measure.png" alt="Measure For Measure" width="720" /><p class="wp-caption-text">Measure For Measure</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The Merry Wives Of Windsor" src="http://www.jesse-anderson.com/currentstories/The Merry Wives Of Windsor.png" alt="The Merry Wives Of Windsor" width="720" /><p class="wp-caption-text">The Merry Wives Of Windsor</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The Tragedy Of King Lear" src="http://www.jesse-anderson.com/currentstories/The Tragedy Of King Lear.png" alt="The Tragedy Of King Lear" width="720" /><p class="wp-caption-text">The Tragedy Of King Lear</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="Much Ado About Nothing" src="http://www.jesse-anderson.com/currentstories/Much Ado About Nothing.png" alt="Much Ado About Nothing" width="720" /><p class="wp-caption-text">Much Ado About Nothing</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The Tragedy Of Macbeth" src="http://www.jesse-anderson.com/currentstories/The Tragedy Of Macbeth.png" alt="The Tragedy Of Macbeth" width="720" /><p class="wp-caption-text">The Tragedy Of Macbeth</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="Alls Well That Ends Well" src="http://www.jesse-anderson.com/currentstories/Alls Well That Ends Well.png" alt="Alls Well That Ends Well" width="720" /><p class="wp-caption-text">Alls Well That Ends Well</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The Sonnets" src="http://www.jesse-anderson.com/currentstories/The Sonnets.png" alt="The Sonnets" width="720" /><p class="wp-caption-text">The Sonnets</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The Tragedy Of Othello Moor Of Venice" src="http://www.jesse-anderson.com/currentstories/The Tragedy Of Othello Moor Of Venice.png" alt="The Tragedy Of Othello Moor Of Venice" width="720" /><p class="wp-caption-text">The Tragedy Of Othello Moor Of Venice</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="As You Like It" src="http://www.jesse-anderson.com/currentstories/As You Like It.png" alt="As You Like It" width="720" /><p class="wp-caption-text">As You Like It</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The Comedy Of Errors" src="http://www.jesse-anderson.com/currentstories/The Comedy Of Errors.png" alt="The Comedy Of Errors" width="720" /><p class="wp-caption-text">The Comedy Of Errors</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The Taming Of The Shrew" src="http://www.jesse-anderson.com/currentstories/The Taming Of The Shrew.png" alt="The Taming Of The Shrew" width="720" /><p class="wp-caption-text">The Taming Of The Shrew</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The Tragedy Of Romeo And Juliet" src="http://www.jesse-anderson.com/currentstories/The Tragedy Of Romeo And Juliet.png" alt="The Tragedy Of Romeo And Juliet" width="720" /><p class="wp-caption-text">The Tragedy Of Romeo And Juliet</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="Cymbeline" src="http://www.jesse-anderson.com/currentstories/Cymbeline.png" alt="Cymbeline" width="720" /><p class="wp-caption-text">Cymbeline</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The Tempest" src="http://www.jesse-anderson.com/currentstories/The Tempest.png" alt="The Tempest" width="720" /><p class="wp-caption-text">The Tempest</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The Tragedy Of Titus Andronicus" src="http://www.jesse-anderson.com/currentstories/The Tragedy Of Titus Andronicus.png" alt="The Tragedy Of Titus Andronicus" width="720" /><p class="wp-caption-text">The Tragedy Of Titus Andronicus</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="King Henry The Eighth" src="http://www.jesse-anderson.com/currentstories/King Henry The Eighth.png" alt="King Henry The Eighth" width="720" /><p class="wp-caption-text">King Henry The Eighth</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The First Part Of King Henry The Fourth" src="http://www.jesse-anderson.com/currentstories/The First Part Of King Henry The Fourth.png" alt="The First Part Of King Henry The Fourth" width="720" /><p class="wp-caption-text">The First Part Of King Henry The Fourth</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="Second Part Of King Henry IV" src="http://www.jesse-anderson.com/currentstories/Second Part Of King Henry IV.png" alt="Second Part Of King Henry IV" width="720" /><p class="wp-caption-text">Second Part Of King Henry IV</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The First Part Of Henry The Sixth" src="http://www.jesse-anderson.com/currentstories/The First Part Of Henry The Sixth.png" alt="The First Part Of Henry The Sixth" width="720" /><p class="wp-caption-text">The First Part Of Henry The Sixth</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The Second Part Of King Henry The Sixth" src="http://www.jesse-anderson.com/currentstories/The Second Part Of King Henry The Sixth.png" alt="The Second Part Of King Henry The Sixth" width="720" /><p class="wp-caption-text">The Second Part Of King Henry The Sixth</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The Third Part Of King Henry The Sixth" src="http://www.jesse-anderson.com/currentstories/The Third Part Of King Henry The Sixth.png" alt="The Third Part Of King Henry The Sixth" width="720" /><p class="wp-caption-text">The Third Part Of King Henry The Sixth</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The Two Gentlemen Of Verona" src="http://www.jesse-anderson.com/currentstories/The Two Gentlemen Of Verona.png" alt="The Two Gentlemen Of Verona" width="720" /><p class="wp-caption-text">The Two Gentlemen Of Verona</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="King John" src="http://www.jesse-anderson.com/currentstories/King John.png" alt="King John" width="720" /><p class="wp-caption-text">King John</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The History Of Troilus And Cressida" src="http://www.jesse-anderson.com/currentstories/The History Of Troilus And Cressida.png" alt="The History Of Troilus And Cressida" width="720" /><p class="wp-caption-text">The History Of Troilus And Cressida</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The Tragedy Of Antony And Cleopatra" src="http://www.jesse-anderson.com/currentstories/The Tragedy Of Antony And Cleopatra.png" alt="The Tragedy Of Antony And Cleopatra" width="720" /><p class="wp-caption-text">The Tragedy Of Antony And Cleopatra</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The Winters Tale" src="http://www.jesse-anderson.com/currentstories/The Winters Tale.png" alt="The Winters Tale" width="720" /><p class="wp-caption-text">The Winters Tale</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="King Richard III" src="http://www.jesse-anderson.com/currentstories/King Richard III.png" alt="King Richard III" width="720" /><p class="wp-caption-text">King Richard III</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The Life Of King Henry The Fifth" src="http://www.jesse-anderson.com/currentstories/The Life Of King Henry The Fifth.png" alt="The Life Of King Henry The Fifth" width="720" /><p class="wp-caption-text">The Life Of King Henry The Fifth</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The Tragedy Of Coriolanus" src="http://www.jesse-anderson.com/currentstories/The Tragedy Of Coriolanus.png" alt="The Tragedy Of Coriolanus" width="720" /><p class="wp-caption-text">The Tragedy Of Coriolanus</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="Twelfth Night Or What You Will" src="http://www.jesse-anderson.com/currentstories/Twelfth Night Or What You Will.png" alt="Twelfth Night Or What You Will" width="720" /><p class="wp-caption-text">Twelfth Night Or What You Will</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="King Richard The Second" src="http://www.jesse-anderson.com/currentstories/King Richard The Second.png" alt="King Richard The Second" width="720" /><p class="wp-caption-text">King Richard The Second</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The Life Of Timon Of Athens" src="http://www.jesse-anderson.com/currentstories/The Life Of Timon Of Athens.png" alt="The Life Of Timon Of Athens" width="720" /><p class="wp-caption-text">The Life Of Timon Of Athens</p></div>
<div class="wp-caption aligncenter" style="width: 730px"><img class="colorbox-138"  title="The Tragedy Of Hamlet Prince Of Denmark" src="http://www.jesse-anderson.com/currentstories/The Tragedy Of Hamlet Prince Of Denmark.png" alt="The Tragedy Of Hamlet Prince Of Denmark" width="720" /><p class="wp-caption-text">The Tragedy Of Hamlet Prince Of Denmark</p></div>
<p><strong>Update</strong>: I was running this on a free micro instance (600 MB RAM) from Amazon. Alas, the monkeys needed more RAM than the free micro instance had and the processes get out of memory errors. I have moved the Hadoop server to my home computer which is much faster and has more memory.</p>
<p><strong>Update 2</strong>: I updated the Hadoop configuration to have less idle CPU time. This will significantly increase the monkey power and find more character groups.</p>
<p><strong>Update 4</strong>: I made a small change to how memory is allocated for the random character groups. It should help speed things up again.</p>
<p><script type="text/javascript">// <![CDATA[
      jQuery.noConflict();
      // Put all your code in your document ready area            
      jQuery(document).ready(function($){
      // Do jQuery stuff using $        
      $('#result').load('http://www.jesse-anderson.com/currentstories/totals.xml');      });
// ]]&gt;</script></p>
]]></content:encoded>
			<wfw:commentRss>http://www.jesse-anderson.com/2011/08/a-few-more-million-amazonian-monkeys/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Pitching Agile</title>
		<link>http://www.jesse-anderson.com/2011/07/pitching-agile/</link>
		<comments>http://www.jesse-anderson.com/2011/07/pitching-agile/#comments</comments>
		<pubDate>Mon, 25 Jul 2011 01:43:29 +0000</pubDate>
		<dc:creator>Jesse</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://www.jesse-anderson.com/?p=134</guid>
		<description><![CDATA[Once you have decided to implement Agile Software Development methodology at your company, there is some ground work you should do beforehand.  One needs to get as many people to buy in or support moving to Agile as possible.  This presentation outlines  how to formulate an argument for Agile depending on the person’s department or position. [...]]]></description>
			<content:encoded><![CDATA[<p>Once you have decided to implement Agile Software Development methodology at your company, there is some ground work you should do beforehand.  One needs to get as many people to buy in or support moving to Agile as possible.  This presentation outlines  how to formulate an argument for Agile depending on the person’s department or position.</p>
<p>Part 1<br />
<iframe src="http://www.youtube.com/embed/-Jubzd_duVQ?hl=en&amp;fs=1" frameborder="0" width="425" height="349"></iframe></p>
<p>Part 2<br />
<iframe src="http://www.youtube.com/embed/NcJRPwCeBR4?hl=en&amp;fs=1" frameborder="0" width="425" height="349"></iframe></p>
<p>Part 3<br />
<iframe src="http://www.youtube.com/embed/AJQi_yQgWj0?hl=en&amp;fs=1" frameborder="0" width="425" height="349"></iframe></p>
<div id="__ss_8678962" style="width: 425px;">
<p><strong style="display: block; margin: 12px 0 4px;"><a title="Pitching agile" href="http://www.slideshare.net/JesseAnderson/pitching-agile" target="_blank">Pitching agile</a></strong> <iframe src="http://www.slideshare.net/slideshow/embed_code/8678962" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" width="425" height="355"></iframe></p>
<div style="padding: 5px 0 12px;">View more <a href="http://www.slideshare.net/" target="_blank">presentations</a> from <a href="http://www.slideshare.net/JesseAnderson" target="_blank">Jesse Anderson</a></div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.jesse-anderson.com/2011/07/pitching-agile/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Post Agile Checklist</title>
		<link>http://www.jesse-anderson.com/2011/07/post-agile-checklist/</link>
		<comments>http://www.jesse-anderson.com/2011/07/post-agile-checklist/#comments</comments>
		<pubDate>Mon, 25 Jul 2011 01:07:58 +0000</pubDate>
		<dc:creator>Jesse</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://www.jesse-anderson.com/?p=131</guid>
		<description><![CDATA[You have put a lot of time and effort into implementing Agile Software Development methodology in your team or company; but are you using Agile to its fullest extent and making the best use of it?  In the following presentation, I go over the things that  teams often forget about utilizing,  or if they are already utilizing them, [...]]]></description>
			<content:encoded><![CDATA[<p>You have put a lot of time and effort into implementing Agile Software Development methodology in your team or company; but are you using Agile to its fullest extent and making the best use of it?  In the following presentation, I go over the things that  teams often forget about utilizing,  or if they are already utilizing them, I go over the methods to improve them.</p>
<p>Part 1<br />
<iframe src="http://www.youtube.com/embed/hitnPE2jawQ?hl=en&amp;fs=1" frameborder="0" width="425" height="349"></iframe></p>
<p>Part 2<br />
<iframe src="http://www.youtube.com/embed/MpIBMEStuEo?hl=en&amp;fs=1" frameborder="0" width="425" height="349"></iframe></p>
<div style="width: 425px;"><strong style="display: block; margin: 12px 0 4px;"><a title="Post agile checklist" href="http://www.slideshare.net/JesseAnderson/post-agile-checklist" target="_blank">Post agile checklist</a></strong> <iframe src="http://www.slideshare.net/slideshow/embed_code/8678948" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" width="425" height="355"></iframe></div>
<div id="__ss_8678948" style="width: 425px;">
<div style="padding: 5px 0 12px;">View more <a href="http://www.slideshare.net/" target="_blank">presentations</a> from <a href="http://www.slideshare.net/JesseAnderson" target="_blank">Jesse Anderson</a></div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.jesse-anderson.com/2011/07/post-agile-checklist/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Oracle Database 9i, 10g, and 11g Programming Techniques and Solutions Review</title>
		<link>http://www.jesse-anderson.com/2011/07/oracle-database-9i-10g-and-11g-programming-techniques-and-solutions-review/</link>
		<comments>http://www.jesse-anderson.com/2011/07/oracle-database-9i-10g-and-11g-programming-techniques-and-solutions-review/#comments</comments>
		<pubDate>Mon, 04 Jul 2011 17:06:45 +0000</pubDate>
		<dc:creator>Jesse</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[10g]]></category>
		<category><![CDATA[and 11g Programming Techniques and Solutions]]></category>
		<category><![CDATA[book review]]></category>
		<category><![CDATA[Oracle Database 9i]]></category>
		<category><![CDATA[tom kyte]]></category>

		<guid isPermaLink="false">http://www.jesse-anderson.com/?p=124</guid>
		<description><![CDATA[This is a review of Expert Oracle Database Architecture: Oracle Database 9i, 10g, and 11g Programming Techniques and Solutions, Second Edition by Tom Kyte.  Overall, I highly recommend this book to the target audience.  The target audience is not a new programmer or someone who wants to learn SQL.  It is someone who wants to [...]]]></description>
			<content:encoded><![CDATA[<p>This is a review of <em>Expert Oracle Database Architecture: Oracle Database 9i, 10g, and 11g Programming Techniques and Solutions, Second Edition</em> by Tom Kyte.  Overall, I highly recommend this book to the target audience.  The target audience is not a new programmer or someone who wants to learn SQL.  It is someone who wants to learn the deep inner workings of Oracle and get copious information on the topics.  If this sounds like what you are for, this book is exactly what you need and Tom will guide you through it.</p>
<p>I enjoyed Tom&#8217;s writing style and many examples.  Often, he is giving the terminal or command prompt output while showing how to run something or why something will not work.</p>
<p>His early treatise on Database Engineers changed my mind about things.  He strongly believes that a team should have a Database Engineer.  My previous opinion was that software could help abstract that away and deal with it.  Tom shows through concrete examples how even using Hibernate, a database cannot be completely abstracted away.</p>
<p>The level of detail in the book&#8217;s chapters is great.  He seems to give a true insider&#8217;s (he is an Oracle employee) view of the Oracle Database.  It seems like he spent a lot of time digging through code or talking with the developers about the exact behavior of a feature.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jesse-anderson.com/2011/07/oracle-database-9i-10g-and-11g-programming-techniques-and-solutions-review/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>RIM Employee&#8217;s E-Mail To CEO</title>
		<link>http://www.jesse-anderson.com/2011/06/rim-employees-e-mail-to-ceo/</link>
		<comments>http://www.jesse-anderson.com/2011/06/rim-employees-e-mail-to-ceo/#comments</comments>
		<pubDate>Thu, 30 Jun 2011 19:24:46 +0000</pubDate>
		<dc:creator>Jesse</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.jesse-anderson.com/?p=122</guid>
		<description><![CDATA[I haven&#8217;t read a more poignant, thoughtful, and truthful E-mail to a CEO than this one.  RIM has been on a downward spiral for years now and this Research in Motion employee really nails why it is happening to RIM.  Although it addresses the problems at RIM, the E-mail could equally apply to a lot [...]]]></description>
			<content:encoded><![CDATA[<p>I haven&#8217;t read a more poignant, thoughtful, and truthful E-mail to a CEO than <a href="http://www.mobilecrunch.com/2011/06/30/rim-employee-to-ceos-i-have-lost-confidence/">this one</a>.  RIM has been on a downward spiral for years now and this Research in Motion employee really nails why it is happening to RIM.  Although it addresses the problems at RIM, the E-mail could equally apply to a lot of technology companies and maybe even yours.</p>
<p>Update: <a href="http://www.bgr.com/2011/06/30/open-letter-to-blackberry-bosses-senior-rim-exec-tells-all-as-company-crumbles-around-him/">RIM responded</a> with a swing and miss and more employees <a href="http://www.bgr.com/2011/07/01/more-letters-to-rim-employees-rally-alongside-anonymous-exec/">chimed in</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jesse-anderson.com/2011/06/rim-employees-e-mail-to-ceo/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Microsoft Kinect SDK</title>
		<link>http://www.jesse-anderson.com/2011/06/microsoft-kinect-sdk/</link>
		<comments>http://www.jesse-anderson.com/2011/06/microsoft-kinect-sdk/#comments</comments>
		<pubDate>Fri, 24 Jun 2011 03:55:03 +0000</pubDate>
		<dc:creator>Jesse</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[kinect]]></category>
		<category><![CDATA[kinect sdk]]></category>

		<guid isPermaLink="false">http://www.jesse-anderson.com/?p=118</guid>
		<description><![CDATA[Microsoft has released an SDK for the Kinect.  If you were waiting to check out programming with the Kinect, now is the time.  There is good documentation and better yet, a video quickstart.  It was much easier than trudging through countless forums to figure things out. &#160;]]></description>
			<content:encoded><![CDATA[<p>Microsoft has released an <a href="http://research.microsoft.com/en-us/um/redmond/projects/kinectsdk/default.aspx">SDK for the Kinect</a>.  If you were waiting to check out programming with the Kinect, now is the time.  There is <a href="http://research.microsoft.com/en-us/um/redmond/projects/kinectsdk/guides.aspx">good documentation</a> and better yet, a <a href="http://channel9.msdn.com/series/KinectSDKQuickstarts/">video quickstart</a>.  It was much easier than trudging through countless forums to figure things out.</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jesse-anderson.com/2011/06/microsoft-kinect-sdk/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

