Doing Big Data ASAP

Jesse Anderson
April 5, 2017
Blog, Business, Data Engineering
No Comments

Blog Summary: (AI Summaries by Summarizes)

Apache Hive is a Big Data technology that uses a SQL-like language for its queries, reducing programming overhead to process data.
To run Hive, you can turn to the cloud and use services like Amazon Web Services or Google Dataproc to spin up a cluster.
Uploading data to the cloud provider is necessary, and then you can write and run your SQL query.
Hive has two ways to add new functionality with user-defined functions and transforms.
Creating a data pipeline requires more effort and understanding than just spinning up a cluster with Hive.

I had an interesting question at TDWI Boston that I haven’t been asked before:

If you absolutely had to do something with Hadoop and Big Data tomorrow, how would you do it?

I’ll answer this from a technical and then a management point of view.

Technical

I call Apache Hive the Big Data technology you already know. This is because most people already know SQL. Hive uses a SQL-like language for its queries. By using SQL, it reduces the programming overhead to process the data.

That works around the programming side, but doesn’t give you a cluster to run on. I’d turn to the cloud. Both Amazon Web Services and Google Dataproc support spinning up a cluster with Hive.

Depending on which cloud provider you choose, you’ll need to upload the data. From there it’s a question of writing and running your SQL query. You’d be limited to processing the data with whatever SQL can do, but you’d be at least processing data.

Hive has two ways to add new functionality with user defined functions and transforms. I cover both of these programming interfaces in my Professional Data Engineering course.

Management

Getting a cluster up and running could give the impression that Big Data is easy when it’s not. The method I described above doesn’t represent the work it takes to create a data pipeline. It represents more of a one off and isn’t automated. It skips all of the planning and understanding necessary.

To do a data pipeline right, you will need to put more effort. The team will also need training on the technologies. Without this proper training, the team will be very limited in what they can achieve.

Doing Big Data ASAP

Technical

Management

Related Posts

Unapologetically Technical Episode 20 – Shane Murray

Unapologetically Technical Episode 19 – Jacopo Tagliabue

Unapologetically Technical Episode 18 – Adrian Woodhead

Unapologetically Technical Episode 17 – Semih Salihoglu

Unapologetically Technical Episode 16 – David Jayatillake

Unapologetically Technical Episode 15 – Frances Perry

Unapologetically Technical Episode 14 – Cliff Crosland

Data Teams Survey 2020-2024 Analysis

Data Teams Survey 2024 Results

Join the Newsletter