Hadoop Cheat Sheet

Blog Summary: (AI Summaries by Summarizes)
  • Hadoop has a large developer community with projects that have names that don't correlate to their function.
  • This cheat sheet helps keep track of the different projects in the Hadoop ecosystem and their respective functions.
  • The projects are broken up into three categories: Distributed Systems, Processing Data, and Getting Data In/Out.
  • Distributed Systems:
  • Processing Data:

Hadoop has a vast and vibrant developer community. Following the lead of Hadoop’s name, the projects in the Hadoop ecosystem all have names that don’t correlate to their function. This makes it really hard to figure out what each piece does or is used for. This is a cheat sheet to help you keep track of things. It is broken up into their respective general functions.

Distributed Systems

NameWhat It IsWhat It DoesHow It Helps
HDFSA Distributed Filesystem for HadoopActs as the filesystem or storage for Hadoop.Improves the data input performance of MapReduce jobs with data locality. Creates a replicated, scalable file system.
CassandraA NoSQL databaseA highly scalable database.Allows you to scale a database in linear fashion. Can handle huge databases without bogging down.
HBaseA NoSQL databaseUses HDFS to create a highly scalable database.Allows high scalability. Allows you to do random reads and writes with HDFS.
ZookeeperA Distributed Synchronization ServiceProvides synchronization of data amongst distributed nodes.Allows a cluster to maintain consistent, distributed data across all nodes in a cluster.

Processing Data

NameWhat It IsWhat It DoesHow It Helps
MapReduceDistributed Programming Model and Software FrameworkBreaks up a job into multiple tasks and processes them simultaneously.Framework abstracts the difficult pieces of distributed systems. Allows vast quantities of data to be processed simultaneously.
SparkGeneral Purpose Processing FrameworkBreaks up a job into multiple tasks and processes them simultaneously.Framework abstracts the difficult pieces of distributed systems. Has more built-in functionality than MapReduce, like SQL.
HiveData Warehouse SystemAllows use of query language to process data.Helps SQL programmers harness MapReduce by creating SQL-like queries.
PigData Analysis PlatformProcesses data using a scripting languageHelps programmers use a scripting language to harness MapReduce power.
MahoutMachine Learning LibraryUse a prewritten library to run machine learning algorithms on MapReduce.Prevents you from having to rewrite machine learning algorithms to use MapReduce. Speeds up development time by using existing code.
GiraphGraph Processing LibraryUse a prewritten library to run graph algorithms on MapReduce.Prevents you from having to rewrite graph algorithms to use MapReduce. Speeds up development time by using existing code.
MRUnitUnit Test Framework for MapReduceRun tests to verify your MapReduce job functions correctly.Run programmatic tests to verify that a MapReduce program acts correctly. Has objects that allow you to mock up inputs and assertions to verify the results.

Getting Data In/Out

NameWhat It IsWhat It DoesHow It Helps
AvroData Serialization SystemGives an easy method to input and output data from MapReduce or Spark jobs.Creates domain objects to store data. Makes easier data serialization and deserialization for MapReduce jobs.
SqoopBulk Data TransferMoves data between Relational Databases and Hadoop.Allows data dumps from the Relational Database to be placed in Hadoop for processing later. Moves data output from a MapReduce job to be placed back in a Relational Database.
FlumeData AggregatorHandles large amounts of log data in a scalable fashion.Moves large amounts of log data into HDFS. Since Flume scales so well, it can handle a lot of incoming data.
KafkaDistributed Publish/SubscribeHandles very high throughput and low latency message passing in a scalable fashion.Decouples systems to allow many subscribers of published data.

Administration

NameWhat It IsWhat It DoesHow It Helps
HueBrowser Based Interface for HadoopAllows users to interact with the Hadoop cluster over a web browser.Makes it easier for users to interact with the Hadoop cluster. Granular permissions allow administrators to configure users’
OozieWorkflow Engine for HadoopMakes creating complex workflows in Hadoop easier to create.Allows you to create a complex workflow that leverages other projects like Hive, Pig, and MapReduce. Built-in logic allows users to handle failures of steps gracefully.
Cloudera ManagerBrowser Based Manager for HadoopAllows easy configuration and configuration of Hadoop cluster.Eases the burden of dealing with and monitoring a large Hadoop Cluster. Helps install and configuration the Hadoop software.

The easiest way to install all of these programs is via CDH or Cloudera Distribution for Hadoop. The free edition of Cloudera Manager makes it even easier to create your cluster.

Related Posts

The Difference Between Learning and Doing

Blog Summary: (AI Summaries by Summarizes)Learning options trading involves data and programming but is not as technical as data engineering or software engineering.Different types of

The Data Discovery Team

Blog Summary: (AI Summaries by Summarizes)Data discovery team plays a crucial role in searching for data in the IT landscape.Data discovery team must make data