Q and A: Ingesting into Hadoop

Blog Summary: (AI Summaries by Summarizes)
  • Apache Sqoop is a tool that can move data from a RDBMS and put it into HDFS or HBase, and vice versa.
  • There are several ways to do simple file transfers into HDFS, including using Apache Oozie, Hue's REST interface, Hadoop's WebHDFS REST or FUSE interfaces, or writing a custom program that implements FTP, HTTP, etc and puts the files into HDFS with the HDFS API.
  • Getting data into HBase is a more difficult problem that requires writing custom code and depends on the use case.
  • Qualified Data Engineers are important in helping the team understand the use case and how the data pipeline should be created.

Today’s blog post comes from a question from a subscriber on my mailing list. The question come from Guruprasad B.R.:

What are the best ways to Ingest data in to Big Data (HBase/HDFS) from different sources like FTP, Web, Email, RDBMS,..etc

There are a couple parts to this question and they’re technical:

  • How do I get data into HDFS?
  • How do I get data into HBase?
  • How does the source of data dictate how it’s ingested?

Sqoop

I’ll start off with the easy one. How do you get data from a RDBMS into HDFS and HBase? You’d use Apache Sqoop. It can take data from both a RDBMS and put it into HDFS or HBase.

It can go the other way around too. Sqoop can move data from HDFS or HBase and put it back into the RDBMS.

Simple File Transfer

There are a few ways to do simple file transfers into HDFS. You could use:

  • Apache Oozie to move files as part of a workflow
  • Use Hue’s REST interface
  • Use Hadoop’s WebHDFS REST or FUSE interfaces
  • Write a custom program that implements FTP, HTTP, etc and puts the files into HDFS with the HDFS API

The right tool for the job depends on your use case.

Getting Data In

The far more difficult problem is how to use the data or get it into HBase. For that, you’ll need to write custom code. The suggestions above only get you to the point where you’re using HDFS as a backup. The real value is working with the data.

The programs you need to write and the right tools for the job depends on your use case. This where qualified Data Engineers are important. They’ll help the team understand the use case and how the data pipeline should be created.

Related Posts

zoomed in line graph photo

Data Teams Survey 2023 Follow-Up

Blog Summary: (AI Summaries by Summarizes)Many companies, regardless of size, are using data mesh as a methodology.Smaller companies may not necessarily need a data mesh

Laptop on a table showing a graph of data

Data Teams Survey 2023 Results

Blog Summary: (AI Summaries by Summarizes)A survey was conducted between January 24, 2023, and February 28, 2023, to gather data for the book “Data Teams”

Black and white photo of three corporate people discussing with a view of the city's buildings

Analysis of Confluent Buying Immerok

Blog Summary: (AI Summaries by Summarizes)Confluent has announced the acquisition of Immerok, which represents a significant shift in strategy for Confluent.The future of primarily ksqlDB

Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes)Google created MapReduce and GFS in 2004 for scalable systems.Apache Hadoop was created in 2005 by Doug Cutting based on

Big Data Institute horizontal logo

Independent Anniversary

Blog Summary: (AI Summaries by Summarizes)The author founded Big Data Institute eight years ago as an independent, big data consulting company.Independence allows for an unbiased