In this video, I live code a dedupe algorithm. If you’re not familiar with this algorithm, you need to take several data files and remove the duplicates. I show the simple version. Then, I show a more complicated version that adds some custom logic. If you want...
Sometimes companies will start writing code or designing a solution before I train there. This is usually a bad idea. It really shows the difference between Big Data and small data. Making a mistake with small data isn’t costly and doesn’t take long to...
Facebook Twitter LinkedIn Digg Google+ reddit Hacker News Delicious Working with complex and multi-module Maven projects can be a handful. These are a few tips to make that easier. I’m going to use Apache Beam as an example of a multi-module Maven project. The...
In a previous post, I showed how to use Beam’s Regex class to split up a string. In this post, I’m going to going to show some other features of the Regex class. The Regex class gives you a distributed way to work with strings. I tried to make the...
There’s this friendly game in Big Data frameworks. It’s what’s the fewest lines of code it takes to do WordCount. I’m a committer on Apache Beam and most of my time is dedicated to making things easier for developers to use Beam. I also help...