Hadoop The Definitive Guide 3rd Edition Review

Blog Summary: (AI Summaries by Summarizes)
  • The 3rd edition of "Hadoop The Definitive Guide" covers the latest changes to the 1.x and 2.x APIs.
  • The book extensively discusses the new distributed resource management system named YARN.
  • The new edition also covers the new features of HDFS, including high availability and federation.
  • "Hadoop The Definitive Guide" is a comprehensive reference book for Hadoop, covering virtually anything you want to know about the system.
  • The book is recommended as an essential part of your Hadoop bookshelf and serves as a great reference for Hadoop's ecosystem projects.

My original review of Hadoop The Definitive Guide (TDG) was for the 2nd edition. Recently, the 3rd edition was released. I reread the book in its entirety.

The new edition covers the latest changes to the 1.x (0.20) and the 2.x (0.23).  The book’s examples now use the 2.x API throughout. Those still using the 1.x API won’t be left in the lurch because it is still discussed in the book.

There is extensive discussion of the runtime changes that come with Hadoop 2.0. This is the new distributed resource management system named YARN (Yet Another Resource Negotiator). While YARN is not recommended for production clusters, it is the future and very important to keep an eye on. TDG shows the new programming model for acquiring resources for a job. It also shows how YARN will make Hadoop more extensible for running other types of jobs.

Another addition is the new features of HDFS. The first is high availability (HA). HA address the single point of failure of having a single NameNode daemon. With the new HA feature, there is an active NameNode and standby NameNode running. TDG shows how this new failover mechanism works and the necessary settings. The second feature is federation. This allows a filesystem to have multiple NameNodes running different parts of the filesystem. Once again, TDG tells you how to set these things up and how it improves scalability.

The word “Definitive” in the book’s title is well founded. You can find virtually anything you want about Hadoop in this book. If you need to find that elusive parameter for changing spill size, TDG has it. A quick search will give you the parameter name, default value, and what it changes. If you need to know how a client HDFS reads a file and makes the relevant remote procedure calls (RPC), TDG has it. With a distributed system, these sorts of calls aren’t as straightforward to track.

A cover-to-cover read may not be for everyone. TDG serves as a great reference. I recommend getting the PDF version because it facilitates a much quicker search. The smaller chapters on Hadoop’s ecosystem projects are very handy. You may not use Hive or Pig on a daily basis and TDG can refresh your memory.

I highly recommend TDG and it is an essential part of your Hadoop bookshelf. As an Instructor and Curriculum Developer for Cloudera, I refer to the book extensively.

Related Posts

The Difference Between Learning and Doing

Blog Summary: (AI Summaries by Summarizes)There are several types of learning videos: hype, low effort, novice, and professional.It is important to avoid hype, low-effort, and

The Data Discovery Team

Blog Summary: (AI Summaries by Summarizes)The concept of a “data discovery team” is introduced, which focuses on searching for data in an enterprise data reality.Data

Black and white photo of three corporate people discussing with a view of the city's buildings

Current 2023 Announcements

Blog Summary: (AI Summaries by Summarizes)Confluent’s Current Conference featured several announcements that are important for both technologists and investors.Confluent has two existing moats (replication and

zoomed in line graph photo

Data Teams Survey 2023 Follow-Up

Blog Summary: (AI Summaries by Summarizes)Many companies, regardless of size, are using data mesh as a methodology.Smaller companies may not necessarily need a data mesh

Laptop on a table showing a graph of data

Data Teams Survey 2023 Results

Blog Summary: (AI Summaries by Summarizes)A survey was conducted between January 24, 2023, and February 28, 2023, to gather data for the book “Data Teams”

Black and white photo of three corporate people discussing with a view of the city's buildings

Analysis of Confluent Buying Immerok

Blog Summary: (AI Summaries by Summarizes)Confluent has announced the acquisition of Immerok, which represents a significant shift in strategy for Confluent.The future of primarily ksqlDB

Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes)Google created MapReduce and GFS in 2004 for scalable systems.Apache Hadoop was created in 2005 by Doug Cutting based on