Big Data’s Required and Recommended Technical Skills

Blog Summary: (AI Summaries by Summarizes)
  • Hadoop requires technical skills to get started with big data.
  • For developers, required skills include intermediate to advanced Java knowledge and a general understanding of Linux.
  • Recommended skills for developers include knowledge of Scala and SQL, as well as a background in distributed systems.
  • For administrators, required skills include very good knowledge of Linux, especially from the command line.
  • Recommended skills for administrators include knowledge of security topics such as authentication with Kerberos and encryption.

A common question beginners ask about Hadoop are the technical skills needed to get started. This helps level set what skills you need before you embark on a big data journey.

For developers and administrators, I divide up the skills as those that required and those that are nice to have or recommended.

Developer Skills

Required

The majority of Hadoop and Hadoop ecosystem is written in Java. You should have an intermediate to advanced level of Java knowledge. You should understand things like generics, inheritance, and abstract classes.

You should have a general understanding of Linux. You’ll need to be relatively familiar with using a Linux command line to issue commands. Most work can be done in the GUI, but some Hadoop-specific interactions are on the command line.

Recommended

Hadoop and the its ecosystem support other languages to varying extents. Since Scala is a JVM language you can use it throughout. For technologies like Apache Spark, Scala is often used. Other languages, will work to varying degrees of effectiveness and gotchas.

Another helpful language is SQL. Apache Hive uses a SQL-like language to process data and Spark has SQL built in.

More helpful, but not required is a background in distributed systems. Big Data frameworks like Hadoop are distributed systems. These frameworks make it easier to work with distributed systems, but don’t completely mask all of the complexity. At some point in your journey, you need to learn these concepts to really master Big Data frameworks.

Administrator Skills

Required

Hadoop runs on Linux. The majority of clusters run on RHEL/CentOS or Ubuntu. The Big Data nature of things will stress things in ways you might not have seen before. It will expose weird problems you only see at scale. To diagnose and fix these issues, you’ll need to very good with Linux, especially from the command line. Most of the computers in your Hadoop cluster will be sitting in a data center’s rack or in the cloud.

Some of the Hadoop companies like Cloudera and Hortonworks are making cluster administration easier with web-based GUIs. This will help in detecting and monitoring Hadoop clusters. Despite these programs, you’ll still need to know how to troubleshoot a computer with a Linux command line.

Recommended

If you’re planning on administrating an enterprise cluster, you probably be dealing with security. This is everything from authentication with Kerberos to line encryption to at rest encryption. It’s the administrators job to set all of this up and keep things secure. Having some knowledge of these topics will give you an advantage job seeking.

Related Posts

zoomed in line graph photo

Data Teams Survey 2023 Follow-Up

Blog Summary: (AI Summaries by Summarizes)Many companies, regardless of size, are using data mesh as a methodology.Smaller companies may not necessarily need a data mesh

Laptop on a table showing a graph of data

Data Teams Survey 2023 Results

Blog Summary: (AI Summaries by Summarizes)A survey was conducted between January 24, 2023, and February 28, 2023, to gather data for the book “Data Teams”

Black and white photo of three corporate people discussing with a view of the city's buildings

Analysis of Confluent Buying Immerok

Blog Summary: (AI Summaries by Summarizes)Confluent has announced the acquisition of Immerok, which represents a significant shift in strategy for Confluent.The future of primarily ksqlDB

Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes)Google created MapReduce and GFS in 2004 for scalable systems.Apache Hadoop was created in 2005 by Doug Cutting based on

Big Data Institute horizontal logo

Independent Anniversary

Blog Summary: (AI Summaries by Summarizes)The author founded Big Data Institute eight years ago as an independent, big data consulting company.Independence allows for an unbiased