What is the next Big Data trend and what should you be doing about it?

When people ask me what they should be learning next, I tell them to start learning real-time Big Data systems. Real-time Big Data is something I’ve been focusing on for the past 5+ years. This is because I saw it as the next trend in Big Data and I was right.

Companies are doing massive rollouts of real-time systems. They’re taking their existing batch Big Data systems and upgrading to real-time Big Data systems. Companies of all sizes and throughout the world are making this leap.

They’re doing these rollouts because batch systems were inherently limiting. Often, the company wanted to do things in real-time, but couldn’t due to technical limitations. Now, they’re looking and reacting to their data as it happens instead of hours later.


But there’s a consistent problem.


But there’s a consistent problem. There aren’t enough people with the skills in real-time Big Data systems. The companies can barely find and hire these people. There is a high demand for people with real-time Big Data skills and a low supply of people with these skills, due to its relative newness and complexity.

That’s where you come in. You already have the Big Data skills with batch systems. With the right training and skills, you can fill those open positions that need real-time Big Data.


Learning real-time Big Data is difficult because of the sharp increase in complexity.


You already know about this complexity increase because you’ve been working with batch Big Data. My experience is that batch Big Data is 10 times more complex than small data systems. With real-time Big Data systems, that’s another 5 to 10 times more complex than batch Big Data systems.

There are common reasons for this increase in complexity. With real-time, you’ll be using even more of the Big Data ecosystem. You’ll also need to learn, understand, and implement systems with brand-new technologies. These technologies have new concepts that you haven’t seen in batch. You need to understand the various failure scenarios and what they mean in real-time.

Then, there are the tradeoffs between each system. To achieve real-time or near real-time, each system will need to strike a balance in throughput and latency. Each system brings new concepts and implementations that are different from a batch.


These tradeoffs and differences are why companies need people who understand real-time technologies.


Creating real-time systems is more than just learning the API calls. You’ve been dealing with batch Big Data and know that you need to understand the architecture of the underlying system. Otherwise, you’ll have a program that compiles but never works in production. You need a deeper understanding of the systems to create a solution or pass an interview.

There are four general types of technologies in real-time Big Data systems:


  • Processor
  • Analytics
  • Ingestion and dissemination
  • Storage

A processor is a part that processes the incoming data. As data comes into a system, it needs to be changed and transformed. The processor is responsible for getting the data ready for subsequent usage.

Analytics is the part that creates some kind of value out of the data. This is the most important part of the pipeline for the business. This is where you take the data and show what’s happening. On the simple side, this could be counting interactions in real-time. On the complex side, this could be a real-time data science or machine learning model.

To move data around and save it, you will need a system for ingestion and dissemination. When you’re moving at a Big Data scale and in real-time, the system needs to be able to scale. It needs to provide the data at a fast speed to many different systems doing processing and analytics.

Storage is another issue for real-time systems. Storing many small files leads to issues on many Big Data systems. Not all processing and analytics should be done in real-time. You will still need to go back and process in batch. A good storage mechanism is crucial to a real-time data pipeline.

Some technologies may be a mix of 2 or more of these types. This is where things get cloudy. You need to deeply understand each technology and the pieces that are required to create a real-time data pipeline.

Introducing Real-time Systems with Spark Streaming and Kafka

This is the class that I’ve used to teach Data Engineers, Software Engineers, Data Scientists, Data Analysts, and managers the skills to create real-time Big Data systems.

This class covers the technologies and concepts you need to know when creating real-time data pipelines. I use my extensive knowledge and experience to teach you what you need to know. I only focus on the technologies I’m seeing in use at companies.

The class is entirely virtual and you can go at your own pace. The course comes with everything you need to get started creating your own real-time data pipelines:

  • 6 hours of video lecture and explanations of code
  • More than 6 hours of exercises designed to reinforce and practice what you’ve learned
  • All slides that you see in the videos
  • The exercise guide to help you through the exercises
  • A virtual machine (VM) that has Kafka and Spark Streaming loaded and configured
  • Sample solutions that I’ve written to give you an example of what the code should look like. During the videos, I go through these sample solutions to help you understand what the code should look like and do.
  • Maven project files so you aren’t dealing with IDE and classpath issues

I don’t just cover a few technologies. I show you the open-source and cloud ecosystem of real-time products. This will give you the well-rounded skills that companies want.


What does this class cover?


Let me share what each chapter covers and teaches:

Chapter 1 – Real-time Data Pipelines

  • Introduces what a real-time data pipeline is and the parts that make up a full-fledged real-time data pipeline such as processors, analytics, ingestion, and storage
  • Shows the technologies that are commonly used in real-time data pipelines
  • Recommends the ways a team should break down a real-time data pipeline into smaller and more accomplishable pieces
  • Considers the important pros and cons of creating real-time data pipelines

Chapter 2 – Using the Cloud

  • Introduces the main cloud providers and their distinguishing characteristics
  • Shows the ecosystem of real-time technologies that are available as open-source and as managed services in the cloud

Chapter 3 – Ingesting Data

  • Introduces the problems associated with ingesting real-time Big Data with two concepts I call the First Mile and Last Mile problems
  • Shows how the ecosystem of real-time ingestion technologies works to solve the first mile and last mile problems
  • Considers the benefits of doing ETL in real-time and contrasts that with batch ETL systems

Chapter 4 – Kafka

  • Goes deeply into Kafka, how it works, and its architecture
  • Shows why Kafka is such a common technology used by companies for their real-time data pipelines
  • Teaches how to write your producers and consumers with Kafka

Chapter 5 – Processing Data

  • Introduces the problems associated with processing large amounts of data in real-time
  • Teaches the advanced concepts you need to know for processing like delivery guarantees, backpressure, idempotent systems, and failovers
  • Shows the ecosystem of real-time processing technologies that are available as-open source and managed services in the cloud

Chapter 6 – Spark Streaming

  • Goes deeply into Spark Streaming, how it works in real-time, and its architecture
  • Teaches how to write your own Spark Streaming code that receives data in real-time from network sockets and Kafka
  • Shows the considerations you need to take when using Spark Streaming such as micro-batch sizes, failures of drivers, failures of workers, and how to deal with failures in Spark Streaming

Chapter 7 – Data Products

  • Introduces the steps to take when creating a real-time data product
  • Shows the architectures and tricks that make data pipeline projects successful
  • Teaches you how to create one of the most common real-time use cases, a real-time dashboard that is powered by Kafka, Spark Streaming, and D3.js.

Who is this class designed for?


This class isn’t designed for everyone. To be successful with this class you should:

  • Be familiar with batch Big Data
  • Be familiar with batch processing with Apache Spark
  • Have an intermediate-level knowledge of Java

This class does not:

  • Require the previous familiarity with Apache Kafka
  • Require the previous familiarity with Apache Spark Streaming
  • Require previous knowledge or experience with cloud providers or their technologies


Where else has this class been taught?


I’ve been teaching this class extensively at O’Reilly’s Strata conferences and companies around the world. This is because I’m a recognized expert in the field and I was one of the first people teaching real-time Big Data technologies like Apache Kafka and Spark Streaming.


How do you know if this course works? This course already runs at companies. It has taken teams of developers and made their teams of Data Engineers. This course already runs at training facilities. It has already taken students who were Software Developers and made them Data Engineers who got their Dream Jobs.

Big Data is changing constantly, how do I know this course is up-to-date? This course already runs at companies and those companies expect that their students are learning from up-to-date materials. The materials and code are updated to the latest versions of CDH. My courses cover current and future technologies. Many of my students are hired because they’ve learned a future technology that the company wants to start using.

Which technologies should you learn? I’ve curated and tested this course to teach the technologies and concepts that companies need and are using in production. Even better are the technologies and concepts it doesn’t cover. This course removes the unnecessary concepts for developers and technologies that don’t make sense or aren’t used. Given my industry expertise, we even cover up-and-coming technologies that will set you apart on your job search.

How will you be productive and start coding? Installing Big Data tools is an ordeal unto itself (trust me). You don’t want to waste hours getting things installed and configured before you can even start being productive. I’ve created a virtual machine that gets you up and running quickly. Everything is already installed and configured for you. It has Hadoop, Spark, many ecosystem projects, and Eclipse installed. You just install VirtualBox, import the VM, and you’re ready to go. No wasting time.

How will you practice the skills that you need to master? The course makes heavy use of exercises to practice the skills that you have just learned. There is a full exercise guide that gives you instructions on what to do. These exercises gradually increase in difficulty as you start to master new skills. Each programming exercise has a full sample solution that you can peek at if you get stuck or want to compare your solution with mine. At the end of most modules, there is a final. This final helps you check if you have mastered the skills you need.

Does this course just cover real-time Big Data technologies? This course focuses only on real-time technologies. It only shows batch processing as a means of comparison between batch and real-time. It does show how to use D3, which is a visualization technology.

Do you have to go in order? I highly recommend you go in order. Advanced programmers can skip around if they feel it’s necessary, but they will miss important concepts. This is something I can’t do in a class.

How long will this class take to complete? This class can be done in 2-3 days of concerted effort. Or it could be done over 1-2 weeks with less time put in.

How does this compare to training from company X? There are various sources out there for Big Data training. There is a vast difference in quality, veracity, and teaching out there. The majority of them are on the lower end of quality. Purchasing a low-quality course isn’t just a waste of money; it’s a waste of your time and you won’t get the job. Quality training is the difference between being successful and failure.

Can I get my company to reimburse me? Yes, other students who have purchased this course have had their purchase reimbursed by their company. Many companies have continuing education budgets or new projects have money allotted for training. This is especially true for new and difficult initiatives like Big Data. I will help you however I can to get your purchase reimbursed by your company. Send this PDF to your boss or Human Resources department to convince them to reimburse you.


What have others said about Real-time Systems with Spark Streaming and Kafka?

100% Money Back Guarantee

I stand behind this course 100%. I want you to love this course 100% percent too. If you don’t love this course, I’ll give you 100% of your money back. That’s right 100% money-back guarantee, no matter how deep you are in the course.

Go through the materials. See that they’re the best. Go through the exercises and see yourself becoming the Data Engineer you want to become. I’m confident you’ll be successful.

I’ve built my teaching methods over years of teaching Data Engineering classes. These methods are honed over class after class. No one else is offering classes like these that are so comprehensive. No one else is teaching with such innovative methods. No one else is teaching practical skills.

This course isn’t for everyone as we established before. This course is for people who want to learn real-time Big Data systems. Even within that group, not everyone has the programming skills to create real-data pipelines and I understand that. I’ll give you your money back.

Here is my simple offer: if you don’t love this course within 60 days, I insist that you get 100% of your money back. Guaranteed. Join at the level that’s right for you and see how you can get the real-time Big Data skills you need to get ahead.

How can you get instant access to Real-time Systems with Spark Streaming and Kafka?

Developer Level – $890

  • All course videos
  • The Virtual Machine loaded with your IDE and all Big Data programs loaded and configured
  • The exercise guides to help you practice the concepts and code
  • Email questions
  • (Limited Offer) 1 hour remote session
  • Pay in 6 monthly installments of $189 or pay in full for $890 and save $244.

Platinum Level – $1980

  • All course videos
  • The Virtual Machine loaded with your IDE and all Big Data programs loaded and configured
  • The exercise guides to help you practice the concepts and code
  • Unlimited email questions for 2 months to ask questions about the course
  • (Limited Offer) 1 hour remote session
  • Pay in 6 monthly installments of $359 or pay in full for $1980 and save $174.

This course is sold on an individual basis. People sharing access will be removed from the course and no refunds will be given.

For group, team, and company rates, go here.