What is the next Big Data trend

and what should you be doing about it?

When people ask me what they should be learning next, I tell them to start learning real-time Big Data systems. Real-time Big Data is something I’ve been focusing on for past 5+ years. This is because I saw it as the next trend in Big Data and I was right.

Companies are doing massive rollouts of real-time systems. They’re taking their existing batch Big Data systems and upgrading to real-time Big Data systems. Companies of all sizes and throughout the world are making this leap.

They’re doing these rollouts because batch systems were inherently limiting. Often, the company wanted to do things in real-time, but couldn’t due to the technical limitations. Now, they’re looking and reacting to their data as it happens instead of hours later.

But there’s a consistent problem.

There aren’t enough people with the skills in real-time Big Data systems. The companies can barely find and hire these people. There is a high demand for people with real-time Big Data skills and a low supply of people with these skills, due to its relative newness and complexity.

That’s where you come in. You already have the Big Data skills with batch systems. With the right training and skills you can fill those open positions that need real-time Big Data.

Learning real-time Big Data is difficult because of the sharp increase in complexity.

You already know about this complexity increase because you’ve been working with batch Big Data. My experience is that batch Big Data is 10 times more complex than small data systems. With real-time Big Data systems, that’s another 5 to 10 times more complex than batch Big Data systems.

There are common reasons for this increase in complexity. With real-time, you’ll be using even more of the Big Data ecosystem. You’ll also need to learn, understand, and implement systems with brand-new technologies. These technologies have new concepts that you haven’t seen in batch. You need to understand the various failure scenarios and what they mean in real-time.

Then, there are the tradeoffs between each system. In order to achieve real-time or near real-time, each system will need to strike a balance in throughput and latency. Each system brings new concepts and implementations that are different from batch.

These tradeoffs and differences are why companies need people who understand real-time technologies.

Creating real-time systems is more than just learning the API calls. You’ve been dealing with batch Big Data and know that you need to understand the architecture of the underlying system. Otherwise, you’ll have a program that compiles but never works in production. You need a deeper understanding of the systems to create a solution or pass an interview.

There are four general types of technologies in real-time Big Data systems:

  • Processors
  • Analytics
  • Ingestion and dissemination
  • Storage

A processor is the part that processes the incoming data. As data comes into a system, it needs to be changed and transformed. The processor is responsible for getting the data ready for subsequent usage.

Analytics is the part that creates some kind of value out of the data. This is most important part of the pipeline for the business. This is where you take the data and show what’s happening. On the simple side, this could be counting interactions in real-time. On the complex side, this could be a real-time data science or machine learning model.

In order to move data around and save it, you will need a system for ingestion and dissemination. When you’re moving at Big Data scale and in real-time, the system needs to be able to scale. It needs to provide the data at a fast speed to many different systems doing processing and analytics.

Storage is another issue for real-time systems. Storing many small files leads to issues on many Big Data systems. Not all processing and analytics should be done in real-time. You will still need to go back and process in batch. A good storage mechanism is crucial to a real-time data pipeline.

Some technologies may be a mix of 2 or more of these types. This is where things get really cloudy. You need to deeply understand each technology and the pieces that are required to create a real-time data pipeline.

Introducing Real-time Systems with Spark Streaming and Kafka

This is the class that I’ve used to teach Data Engineers, Software Engineers, Data Scientists, Data Analysts, and managers the skills to create real-time Big Data systems.

This class covers the technologies and concepts you need to know when creating real-time data pipelines. I use my extensive knowledge and experience to teach you what you need to know. I only focus on the technologies I’m seeing in use at companies.

The class is entirely virtual and you can go at your own pace. The course comes with everything you need to get started creating your own real-time data pipelines:

  • 6 hours of video lecture and explanations of code
  • More than 6 hours of exercises designed to reinforce and practice what you’ve learned
  • All slides that you see in the videos
  • The exercise guide to help you through the exercises
  • A virtual machine (VM) that has Kafka and Spark Streaming loaded and configured
  • Sample solutions that I’ve written to give you an example of what the code should look like. During the videos, I go through these sample solutions to help you understand what the code should look like and do.
  • Maven project files so you aren’t dealing with IDE and classpath issues

I don’t just cover a few technologies. I show you the open source and cloud ecosystem of real-time products. This will give you the well-rounded skills that companies want.

What does this class cover?

Let me share what each chapter covers and teaches:

Chapter 1 – Real-time Data Pipelines

  • Introduces what a real-time data pipeline is and the parts that make up a full-fledged real-time data pipeline such as: processors, analytics, ingestion, and storage
  • Shows the technologies that are commonly used in real-time data pipelines
  • Recommends the ways a team should break down a real-time data pipeline into smaller and more accomplishable pieces
  • Considers the important pros and cons to creating real-time data pipelines

Chapter 2 – Using the Cloud

  • Introduces the main cloud providers and their distinguishing characteristics
  • Shows the ecosystem of real-time technologies that are available as open source and as managed services in the cloud

Chapter 3 – Ingesting Data

  • Introduces the problems associated with ingesting real-time Big Data with two concepts I call the First Mile and Last Mile problems
  • Shows how the ecosystem of real-time ingestion technologies works to solve the first mile and last mile problems
  • Considers the benefits of doing ETL in real-time and contrasts that with batch ETL systems

Chapter 4 – Kafka

  • Goes deeply into Kafka, how it works, and its architecture
  • Shows why Kafka is such a common technology used by companies for their real-time data pipelines
  • Teaches how to write your own producers and consumers with Kafka

Chapter 5 – Processing Data

  • Introduces the problems associated with processing large amounts of data in real-time
  • Teaches the advanced concepts you need to know for processing like delivery guarantees, backpressure, idempotent systems, and failovers
  • Shows the ecosystem of real-time processing technologies that are available as open source and managed services in the cloud

Chapter 6 – Spark Streaming

  • Goes deeply into Spark Streaming, how it works in real-time, and its architecture
  • Teaches how to write your own Spark Streaming code that receives data in real-time from network sockets and Kafka
  • Shows the considerations you need to take when using Spark Streaming such as: micro-batch sizes, failures of drivers, failures of workers, and how to deal with failures in Spark Streaming

Chapter 7 – Data Products

 

  • Introduces the steps to take when creating a real-time data product
  • Shows the architectures and tricks that make data pipeline projects successful
  • Teaches you how to create one of the most common real-time use cases, a real-time dashboard that is powered by Kafka, Spark Streaming, and D3.js.

Who is this class designed for?

This class isn’t designed for everyone. To be successful with this class you should:

  • Be familiar with batch Big Data
  • Be familiar with batch processing with Apache Spark
  • Have an intermediate-level knowledge of Java

This class does not:

  • Require previous familiarity with Apache Kafka
  • Require previous familiarity with Apache Spark Streaming
  • Require previous knowledge or experience with cloud providers or their technologies

Where else has this class been taught?

I’ve been teaching this class extensively at O’Reilly’s Strata conferences and companies around the world. This is because I’m recognized expert in the field and I was one of the first people teaching real-time Big Data technologies like Apache Kafka and Spark Streaming.

How do you know if this course really works? This course already runs at companies. It has taken teams of developers and made them teams of Data Engineers. This course already runs at training facilities. It has already taken students who were Software Developers and made them Data Engineers who got their Dream Jobs.

Big Data is changing constantly, how do I know this course is up-to-date? This course already runs at companies and those companies expect that their students are learning from up-to-date materials. The materials and code are updated to the latest versions of CDH. My courses cover current and future technologies. Many of my students are hired because they’ve learned a future technology that the company wants to start using.

Which technologies should you learn? I’ve curated and tested this course to teach the technologies and concepts that companies need and are using in production. Even better are the technologies and concepts it doesn’t cover. This course removes the unnecessary concepts for developers and technologies that don’t make sense or aren’t used. Given my industry expertise, we even cover up and coming technologies that will set apart on your job search.

How will you be productive and start coding? Installing Big Data tools is an ordeal unto itself (trust me). You don’t want to waste hours getting things installed and configured before you can even start being productive. I’ve created a virtual machine that gets you up and running quickly. Everything is already installed and configured for you. It has Hadoop, Spark, many ecosystem projects, and Eclipse installed. You just install VirtualBox, import the VM, and you’re ready to go. No wasting time.

How will you practice the skills that you need to master? The course makes heavy use of exercises to practice the skills that you have just learned. There is a full exercise guide that gives you instructions on what to do. These exercises gradually increase in difficulty as you start to master new skills. Each programming exercise has a full sample solution that you can peek at if you get stuck or want to compare your solution with mine. At the end of most modules there is a final. This final helps you check if you have really mastered the skills you need.

Does this course just cover real-time Big Data technologies? This course focuses only on real-time technologies. It only shows batch processing as a means of comparison between batch and real-time. It does show how to use D3, which is a visualization technology.

Do you have to go in order? I highly recommend you go in order. Advanced programmers can skip around if they feel it’s necessary, but they will miss important concepts. This is something I can’t do in a class.

How long will this class take to complete? This class can be done in 2-3 days of concerted effort. Or it could be done over 1-2 weeks with less time put in.

How does this compare to training from company X? There are various sources out there for Big Data training. There is a vast difference in quality, veracity, and teaching out there. The majority of them are on the lower end of quality. Purchasing a low quality course isn’t just a waste of money; it’s a waste of your time and you won’t get the job. Quality training is the difference between being successful and failure.

Can I get my company to reimburse me? Yes, other students who have purchased this course have had their purchase reimbursed by their company. Many companies have continuing education budgets or new projects have money allotted for training. This is especially true for new and difficult initiatives like Big Data. I will help you however I can to get your purchase reimbursed by your company. Send this PDF to your boss or Human Resources department to convince them to reimburse you.

What have others said about Real-time Systems with Spark Streaming and Kafka?

Best class I’ve taken in years!!

Richard C.

It is a great introductory class into the world of Kafka and Streaming services. This class is also great to learn of all the different technologies that one can use for streaming and which one you should be using for your use case.

Wesam H.

Jesse is one of the best presenters I have seen. I was extremely impressed not only with his technical knowledge but also with his real world knowledge of people skills (office politics and what makes a person a good candidate for being a good Big Data Engineer).

Shawn J.

Very informative. The instructor clearly stated the expectations. This is the first time i have attended a workshop with all the tools setup properly considering the complexity of the subject matter, and still have that seamless experience. Over-all it was very helpful in understanding the subject matter.

John B.

100% Money Back Guarantee


I stand behind this course 100%. I want you to love this course 100% percent too. If you don’t love this course, I’ll give you 100% of your money back. That’s right 100% money back guarantee, no matter how deep you are in the course.

Go through the materials. See that they’re the best. Go through the exercises and see yourself becoming the Data Engineer you want to become. I’m confident you’ll be successful.

I’ve built my teaching methods over years of teaching Data Engineering classes. These methods are honed over class after class. No one else is offering classes like these that are so comprehensive. No one else is teaching with such innovative methods. No one else is teaching practical skills.

This course isn’t for everyone as we established before. This course is for people who want to learn real-time Big Data systems. Even within that group, not everyone has the programming skills to create real-data pipelines and I understand that. I’ll give you your money back.

Here is my simple offer: if you don’t love this course within 60 days, I insist that you get 100% of your money back. Guaranteed. Join at the level that’s right for you and see how you can get the real-time Big Data skills you need to get ahead.

How can you get instant access to Real-time Systems with Spark Streaming and Kafka?

Part of making my materials as accessible as possible is to make it easy for people to pay for it. I have two methods of paying for a course, one is an installment plan and the other is an outright purchase. The installment plan breaks up the course’s payments into monthly payments over 6 months. With the outright purchase, you pay for the entire course at once.

Why Real-time Systems with Spark Streaming and Kafka is right for you

I wrote this class distilling my years of experience in designing, architecting, and programming real-time Big Data systems. I didn’t stop there. I sent this class to other world-class Big Data experts to review and give feedback. I wanted this class to represent the best of real-time Big Data.

To help you show your real-time skills, we end the class by creating one of the most common deliverables in real-time Big Data – a real-time dashboard. The class builds up until you have the skills to a create a real-time dashboard that is powered by Kafka, Spark Streaming, and a web browser.

I created a video showing the dashboard in action. Watch it and see the sorts of skills you’re going to come away with in this class:

I invite you to join me and enroll in this class.

Get ahead of the next trend in Big Data. Learn how to create real-time systems.

This course is sold on an individual basis. People sharing access will be removed from the course and no refunds will be given.

For group, team, and company rates, go here.

Share This