Doing Big Data ASAP

Blog Summary: (AI Summaries by Summarizes)
  • Apache Hive is a Big Data technology that uses a SQL-like language for its queries, reducing programming overhead to process data.
  • To run Hive, you can turn to the cloud and use services like Amazon Web Services or Google Dataproc to spin up a cluster.
  • Uploading data to the cloud provider is necessary, and then you can write and run your SQL query.
  • Hive has two ways to add new functionality with user-defined functions and transforms.
  • Creating a data pipeline requires more effort and understanding than just spinning up a cluster with Hive.

I had an interesting question at TDWI Boston that I haven’t been asked before:

If you absolutely had to do something with Hadoop and Big Data tomorrow, how would you do it?

I’ll answer this from a technical and then a management point of view.

Technical

I call Apache Hive the Big Data technology you already know. This is because most people already know SQL. Hive uses a SQL-like language for its queries. By using SQL, it reduces the programming overhead to process the data.

That works around the programming side, but doesn’t give you a cluster to run on. I’d turn to the cloud. Both Amazon Web Services and Google Dataproc support spinning up a cluster with Hive.

Depending on which cloud provider you choose, you’ll need to upload the data. From there it’s a question of writing and running your SQL query. You’d be limited to processing the data with whatever SQL can do, but you’d be at least processing data.

Hive has two ways to add new functionality with user defined functions and transforms. I cover both of these programming interfaces in my Professional Data Engineering course.

Management

Getting a cluster up and running could give the impression that Big Data is easy when it’s not. The method I described above doesn’t represent the work it takes to create a data pipeline. It represents more of a one off and isn’t automated. It skips all of the planning and understanding necessary.

To do a data pipeline right, you will need to put more effort. The team will also need training on the technologies. Without this proper training, the team will be very limited in what they can achieve.

Related Posts

zoomed in line graph photo

Data Teams Survey 2023 Follow-Up

Blog Summary: (AI Summaries by Summarizes)Many companies, regardless of size, are using data mesh as a methodology.Smaller companies may not necessarily need a data mesh

Laptop on a table showing a graph of data

Data Teams Survey 2023 Results

Blog Summary: (AI Summaries by Summarizes)A survey was conducted between January 24, 2023, and February 28, 2023, to gather data for the book “Data Teams”

Black and white photo of three corporate people discussing with a view of the city's buildings

Analysis of Confluent Buying Immerok

Blog Summary: (AI Summaries by Summarizes)Confluent has announced the acquisition of Immerok, which represents a significant shift in strategy for Confluent.The future of primarily ksqlDB

Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes)Google created MapReduce and GFS in 2004 for scalable systems.Apache Hadoop was created in 2005 by Doug Cutting based on

Big Data Institute horizontal logo

Independent Anniversary

Blog Summary: (AI Summaries by Summarizes)The author founded Big Data Institute eight years ago as an independent, big data consulting company.Independence allows for an unbiased