Big Data’s Required and Recommended Technical Skills

Jesse Anderson
June 29, 2016
Blog, Business, Data Engineering, Data Engineering is hard
No Comments

Blog Summary: (AI Summaries by Summarizes)

Hadoop requires technical skills to get started with big data.
For developers, required skills include intermediate to advanced Java knowledge and a general understanding of Linux.
Recommended skills for developers include knowledge of Scala and SQL, as well as a background in distributed systems.
For administrators, required skills include very good knowledge of Linux, especially from the command line.
Recommended skills for administrators include knowledge of security topics such as authentication with Kerberos and encryption.

A common question beginners ask about Hadoop are the technical skills needed to get started. This helps level set what skills you need before you embark on a big data journey.

For developers and administrators, I divide up the skills as those that required and those that are nice to have or recommended.

Developer Skills

Required

The majority of Hadoop and Hadoop ecosystem is written in Java. You should have an intermediate to advanced level of Java knowledge. You should understand things like generics, inheritance, and abstract classes.

You should have a general understanding of Linux. You’ll need to be relatively familiar with using a Linux command line to issue commands. Most work can be done in the GUI, but some Hadoop-specific interactions are on the command line.

More helpful, but not required is a background in distributed systems. Big Data frameworks like Hadoop are distributed systems. These frameworks make it easier to work with distributed systems, but don’t completely mask all of the complexity. At some point in your journey, you need to learn these concepts to really master Big Data frameworks.

Administrator Skills

Required

Hadoop runs on Linux. The majority of clusters run on RHEL/CentOS or Ubuntu. The Big Data nature of things will stress things in ways you might not have seen before. It will expose weird problems you only see at scale. To diagnose and fix these issues, you’ll need to very good with Linux, especially from the command line. Most of the computers in your Hadoop cluster will be sitting in a data center’s rack or in the cloud.

Some of the Hadoop companies like Cloudera and Hortonworks are making cluster administration easier with web-based GUIs. This will help in detecting and monitoring Hadoop clusters. Despite these programs, you’ll still need to know how to troubleshoot a computer with a Linux command line.

If you’re planning on administrating an enterprise cluster, you probably be dealing with security. This is everything from authentication with Kerberos to line encryption to at rest encryption. It’s the administrators job to set all of this up and keep things secure. Having some knowledge of these topics will give you an advantage job seeking.

Big Data’s Required and Recommended Technical Skills

Developer Skills

Required

Recommended

Administrator Skills

Required

Recommended

Related Posts

Unapologetically Technical Episode 20 – Shane Murray

Unapologetically Technical Episode 19 – Jacopo Tagliabue

Unapologetically Technical Episode 18 – Adrian Woodhead

Unapologetically Technical Episode 17 – Semih Salihoglu

Unapologetically Technical Episode 16 – David Jayatillake

Unapologetically Technical Episode 15 – Frances Perry

Unapologetically Technical Episode 14 – Cliff Crosland

Data Teams Survey 2020-2024 Analysis

Data Teams Survey 2024 Results

Join the Newsletter