Q and A: Ingesting into Hadoop

Jesse Anderson
January 25, 2017
Blog, Business, Data Engineering, Data Engineering is hard
No Comments

Blog Summary: (AI Summaries by Summarizes)

Apache Sqoop is a tool that can move data from a RDBMS and put it into HDFS or HBase, and vice versa.
There are several ways to do simple file transfers into HDFS, including using Apache Oozie, Hue's REST interface, Hadoop's WebHDFS REST or FUSE interfaces, or writing a custom program that implements FTP, HTTP, etc and puts the files into HDFS with the HDFS API.
Getting data into HBase is a more difficult problem that requires writing custom code and depends on the use case.
Qualified Data Engineers are important in helping the team understand the use case and how the data pipeline should be created.

Today’s blog post comes from a question from a subscriber on my mailing list. The question come from Guruprasad B.R.:

What are the best ways to Ingest data in to Big Data (HBase/HDFS) from different sources like FTP, Web, Email, RDBMS,..etc

There are a couple parts to this question and they’re technical:

How do I get data into HDFS?
How do I get data into HBase?
How does the source of data dictate how it’s ingested?

Sqoop

I’ll start off with the easy one. How do you get data from a RDBMS into HDFS and HBase? You’d use Apache Sqoop. It can take data from both a RDBMS and put it into HDFS or HBase.

It can go the other way around too. Sqoop can move data from HDFS or HBase and put it back into the RDBMS.

Simple File Transfer

There are a few ways to do simple file transfers into HDFS. You could use:

Apache Oozie to move files as part of a workflow
Use Hue’s REST interface
Use Hadoop’s WebHDFS REST or FUSE interfaces
Write a custom program that implements FTP, HTTP, etc and puts the files into HDFS with the HDFS API

The right tool for the job depends on your use case.

Getting Data In

The far more difficult problem is how to use the data or get it into HBase. For that, you’ll need to write custom code. The suggestions above only get you to the point where you’re using HDFS as a backup. The real value is working with the data.

The programs you need to write and the right tools for the job depends on your use case. This where qualified Data Engineers are important. They’ll help the team understand the use case and how the data pipeline should be created.

Q and A: Ingesting into Hadoop

Sqoop

Simple File Transfer

Getting Data In

Related Posts

Unapologetically Technical Episode 20 – Shane Murray

Unapologetically Technical Episode 19 – Jacopo Tagliabue

Unapologetically Technical Episode 18 – Adrian Woodhead

Unapologetically Technical Episode 17 – Semih Salihoglu

Unapologetically Technical Episode 16 – David Jayatillake

Unapologetically Technical Episode 15 – Frances Perry

Unapologetically Technical Episode 14 – Cliff Crosland

Data Teams Survey 2020-2024 Analysis

Data Teams Survey 2024 Results

Join the Newsletter