Hadoop Cheat Sheet

Jesse Anderson
March 24, 2016
Blog, Business, Data Engineering
No Comments

Blog Summary: (AI Summaries by Summarizes)

Hadoop has a large developer community with projects that have names that don't correlate to their function.
This cheat sheet helps keep track of the different projects in the Hadoop ecosystem and their respective functions.
The projects are broken up into three categories: Distributed Systems, Processing Data, and Getting Data In/Out.
Distributed Systems:
Processing Data:

Hadoop has a vast and vibrant developer community. Following the lead of Hadoop’s name, the projects in the Hadoop ecosystem all have names that don’t correlate to their function. This makes it really hard to figure out what each piece does or is used for. This is a cheat sheet to help you keep track of things. It is broken up into their respective general functions.

Distributed Systems
Name	What It Is	What It Does	How It Helps
HDFS	A Distributed Filesystem for Hadoop	Acts as the filesystem or storage for Hadoop.	Improves the data input performance of MapReduce jobs with data locality. Creates a replicated, scalable file system.
Cassandra	A NoSQL database	A highly scalable database.	Allows you to scale a database in linear fashion. Can handle huge databases without bogging down.
HBase	A NoSQL database	Uses HDFS to create a highly scalable database.	Allows high scalability. Allows you to do random reads and writes with HDFS.
Zookeeper	A Distributed Synchronization Service	Provides synchronization of data amongst distributed nodes.	Allows a cluster to maintain consistent, distributed data across all nodes in a cluster.
Processing Data
Name	What It Is	What It Does	How It Helps
MapReduce	Distributed Programming Model and Software Framework	Breaks up a job into multiple tasks and processes them simultaneously.	Framework abstracts the difficult pieces of distributed systems. Allows vast quantities of data to be processed simultaneously.
Spark	General Purpose Processing Framework	Breaks up a job into multiple tasks and processes them simultaneously.	Framework abstracts the difficult pieces of distributed systems. Has more built-in functionality than MapReduce, like SQL.
Hive	Data Warehouse System	Allows use of query language to process data.	Helps SQL programmers harness MapReduce by creating SQL-like queries.
Pig	Data Analysis Platform	Processes data using a scripting language	Helps programmers use a scripting language to harness MapReduce power.
Mahout	Machine Learning Library	Use a prewritten library to run machine learning algorithms on MapReduce.	Prevents you from having to rewrite machine learning algorithms to use MapReduce. Speeds up development time by using existing code.
Giraph	Graph Processing Library	Use a prewritten library to run graph algorithms on MapReduce.	Prevents you from having to rewrite graph algorithms to use MapReduce. Speeds up development time by using existing code.
MRUnit	Unit Test Framework for MapReduce	Run tests to verify your MapReduce job functions correctly.	Run programmatic tests to verify that a MapReduce program acts correctly. Has objects that allow you to mock up inputs and assertions to verify the results.
Getting Data In/Out
Name	What It Is	What It Does	How It Helps
Avro	Data Serialization System	Gives an easy method to input and output data from MapReduce or Spark jobs.	Creates domain objects to store data. Makes easier data serialization and deserialization for MapReduce jobs.
Sqoop	Bulk Data Transfer	Moves data between Relational Databases and Hadoop.	Allows data dumps from the Relational Database to be placed in Hadoop for processing later. Moves data output from a MapReduce job to be placed back in a Relational Database.
Flume	Data Aggregator	Handles large amounts of log data in a scalable fashion.	Moves large amounts of log data into HDFS. Since Flume scales so well, it can handle a lot of incoming data.
Kafka	Distributed Publish/Subscribe	Handles very high throughput and low latency message passing in a scalable fashion.	Decouples systems to allow many subscribers of published data.
Administration
Name	What It Is	What It Does	How It Helps
Hue	Browser Based Interface for Hadoop	Allows users to interact with the Hadoop cluster over a web browser.	Makes it easier for users to interact with the Hadoop cluster. Granular permissions allow administrators to configure users’
Oozie	Workflow Engine for Hadoop	Makes creating complex workflows in Hadoop easier to create.	Allows you to create a complex workflow that leverages other projects like Hive, Pig, and MapReduce. Built-in logic allows users to handle failures of steps gracefully.
Cloudera Manager	Browser Based Manager for Hadoop	Allows easy configuration and configuration of Hadoop cluster.	Eases the burden of dealing with and monitoring a large Hadoop Cluster. Helps install and configuration the Hadoop software.

The easiest way to install all of these programs is via CDH or Cloudera Distribution for Hadoop. The free edition of Cloudera Manager makes it even easier to create your cluster.

Hadoop Cheat Sheet

Distributed Systems

Processing Data

Getting Data In/Out

Administration

Related Posts

Gemini Batch API for Java

Unapologetically Technical Episode 20 – Shane Murray

Unapologetically Technical Episode 19 – Jacopo Tagliabue

Unapologetically Technical Episode 18 – Adrian Woodhead

Unapologetically Technical Episode 17 – Semih Salihoglu

Unapologetically Technical Episode 16 – David Jayatillake

Unapologetically Technical Episode 15 – Frances Perry

Unapologetically Technical Episode 14 – Cliff Crosland

Data Teams Survey 2020-2024 Analysis

Join the Newsletter