You don’t have previous Big Data experience, but want to get hired as a Data Engineer. Don’t worry, you can get hired. You’ll need a well executed personal project that gets you noticed and shows your skills. I’ve verified this with hiring managers all over the place. They will hire a brand-new person if they have an awesome personal project.
You’ll obviously need the Big Data skills to complete the project and get the job. The next big hangup is coming up with an idea. You can constrain this idea search by looking at available datasets. This helps narrow down your search and keeps you focused.
I’m going to share a few datasets that are both novel and interesting. These are the kinds of personal projects that will get you noticed.
Planet has an API to go through satellite imagery over time. They have a free tier for their API. This could be fascinating way to add or process imagery for your personal project.
The GDELT Project
The GDELT Project is a site that monitors the world’s broadcast, print, and web news. All of this is done in real-time. They automatically translate from over 100 languages. You could start comparing how news is covered in the same language in the same country.
The project has created and participated in demos and challenges. They visualized the interconnectedness of the media ecosystem. They’re looking at fake news. You can find more of their projects that used Big Query from Google Cloud.
Depending on the city you live in, they may have a municipal data dashboard. I live in Reno and the local citizens have curated the city’s data.
Although Reno is an example, many other cities give their data. It could give your personal project a great local feel and interest.
You can find similar data at the state or province.
There is a large dataset on Reddit of Jeopardy questions. Could you use the GDELT or Wikipedia datasets to answer the questions?
Google Cloud has the GitHub Dataset. You run some interesting analysis. Felipe Hoffa and answered some age old programming questions.
I’d be remiss in not pointing out some of the public datasets in Amazon Web Services.
Rainbow Six Siege released their dataset of 20 GB. This could be an interesting project if you like games.
There is an entire subreddit dedicated to datasets.
Springboard has a list of data sources for data science projects.
Update: Here is another list of more machine learning focused datasets.
What to do now?
Get the technical skills and create your project. If you didn’t see an interesting data, make one up (but be sure to point that out).
In my Platinum level of Professional Data Engineering, I share the tips and strategies that made one of my personal projects go viral.