10 GitHub Repositories to Master Data Engineering

10 GitHub Repositories to Master Data Engineering blog cover photo


Image by Author | DALLE-3 & Canva 

 

Data Engineering is rapidly growing, and companies are now hiring more data engineers than data scientists. Operational jobs like data engineering, cloud architecture, and MLOps engineering are in high demand.  

As a data engineer, you need to master containerization, infrastructure as code, workflow orchestration, analytical engineering, batch processing, and streaming tools. Apart from these tools, you need to master cloud infrastructure and manage services like Databricks and Snowflakes. 

In this blog, we will learn about 10 GitHub repositories that will help you master all core tools and concepts. These GitHub repositories contain courses, experiences, roadmaps, a list of essential tools, projects, and a handbook. All you need to do is bookmark them while learning to become a professional data engineer.

 

1. Awesome Data Engineering

 

The Awesome Data Engineering repository contains a list of tools, frameworks, and libraries for data engineering, making it an excellent starting point for anyone looking to dive into the field.

It covers tools on databases, data ingestion, files system, streaming, batch processing, data lake management, workflow orchestration, monitoring, testing, and charts and dashboards.

Link: igorbarinov/awesome-data-engineering

 

2. Data Engineering Zoomcamp

 

Data Engineering Zoomcamp is a complete course that provides a hands-on learning experience in data engineering. You learn new concepts and tools using video tutorials, quizzes, projects, homework, and community-driven assessments. 

The Data Engineering Zoomcamp covers:

  1. Containerization and Infrastructure as Code
  2. Workflow Orchestration
  3. Data Ingestion
  4. Data Warehouse
  5. Analytics Engineering
  6. Batch processing
  7. Streaming

 
Link: DataTalksClub/data-engineering-zoomcamp

 

3. The Data Engineering Cookbook

 

The Data Engineering Cookbook is a collection of articles and tutorials that cover various aspects of data engineering, including data ingestion, data processing, and data warehousing.

The Data Engineering Cookbook includes:

  1. Basic Engineering Skills
  2. Advanced Engineering Skills
  3. Free Hands On Courses / Tutorials
  4. Case Studies
  5. Best Practices Cloud Platforms
  6. 130+ Data Sources Data Science
  7. 1001 Interview Questions
  8. Recommended Books, Courses, and Podcasts

 
Link: andkret/Cookbook

 

4. Data Engineer Roadmap

 

The Data Engineer Roadmap repository provides a step-by-step guide to becoming a data engineer. This repository covers everything from the basics of data engineering to advanced topics like Infrastructures as a code and cloud computing.

The Data Engineer Roadmap includes:

  1. CS fundamentals
  2. Learning Python
  3. Testing
  4. Database
  5. Data Warehouse
  6. Cluster Computing
  7. Data Processing
  8. Messaging
  9. Workflow Scheduling
  10. Network
  11. Infrastructures as a Code
  12. CI/CD
  13. Data Security and Privacy

 
Link: datastacktv/data-engineer-roadmap

 

5. Data Engineering HowTo

 

Data Engineering HowTo is a beginner-friendly resource for learning data engineering from scratch. It contains a list of tutorials, courses, books, and other resources to help you build a solid foundation in data engineering concepts and best practices. If you’re new to the field, this repository will help you navigate the vast landscape of data engineering with ease.

How To Become a Data Engineer includes:

  1. Useful articles and blogs
  2. Talks
  3. Algorithms & Data Structures
  4. SQL
  5. Programming
  6. Databases
  7. Distributed Systems
  8. Books
  9. Courses
  10. Tools
  11. Cloud Platforms
  12. Communities
  13. Jobs
  14. Newsletters

 
Link: adilkhash/Data-Engineering-HowTo

 

6. Awesome Open Source Data Engineering

 

Awesome Open Source Data Engineering is a list of open-source data engineering tools that is a goldmine for anyone looking to contribute to or use them to build real-world data engineering projects. It contains a wealth of information on open-source tools and frameworks, making it an excellent resource for anyone looking to explore alternative data engineering solutions.

The repository includes open-source tools on:

  1. Analytics
  2. Business Intelligence
  3. Data Lakehouse
  4. Change Data Capture
  5. Datastores
  6. Data Governance and Registries
  7. Data Virtualization
  8. Data Orchestration
  9. Formats
  10. Integration
  11. Messaging Infrastructure
  12. Specifications and Standards
  13. Stream Processing
  14. Testing
  15. Monitoring and Logging
  16. Versioning
  17. Workflow Management

 
Link: gunnarmorling/awesome-opensource-data-engineering

 

7. Pyspark Example Project

 

Pyspark Example Project repository provides a practical example of implementing best practices for PySpark ETL jobs and applications. 

PySpark is a popular tool for data processing, and this repository will help you master it. You will learn how to structure your code, handle data transformations, and optimize your PySpark workflows efficiently.

The project covers:

  1. Structure of an ETL Job
  2. Passing Configuration Parameters to the ETL Job
  3. Packaging ETL Job Dependencies
  4. Running the ETL job
  5. Debugging Spark Jobs
  6. Automated Testing
  7. Managing Project Dependencies

 
Link: AlexIoannides/pyspark-example-project

 

8. Data Engineer Handbook

 

Data Engineer Handbook is a comprehensive collection of resources covering all aspects of data engineering. It includes tutorials, articles, and books on all the topics related to data engineering. Whether you are looking for a quick reference guide or in-depth knowledge, this handbook has something for data engineers of all levels.

The Handbook includes:

  1. Great Books
  2. Communities to Follow
  3. Companies to Keep an Eye On
  4. Blogs to Read
  5. Whitepapers
  6. Great YouTube Channels
  7. Great Podcasts
  8. Newsletters
  9. LinkedIn, Twitter, TikTok, and Instagram Influencers to Follow
  10. Courses
  11. Certifications
  12. Conferences

 
Link: DataExpert-io/data-engineer-handbook

 

9. Data Engineering Wiki

 

The Data Engineering Wiki repository is a community-driven wiki that provides a comprehensive resource for learning data engineering. This repository covers a wide range of topics, including data pipelines, data warehousing, and data modeling.

Data Engineering Wiki includes:

  1. Data Engineering Concepts
  2. Frequently Asked Questions about Data Engineering
  3. Guides on How to Make Data Engineering Decisions
  4. Commonly Used Tools for Data Engineering
  5. Step-by-Step Guides for Data Engineering Tasks
  6. Learning Resources

 
Link: data-engineering-community/data-engineering-wiki

 

10. Data Engineering Practice

 

Data Engineering Practice offers a hands-on approach to learning data engineering. It provides practice projects and exercises to help you apply your knowledge and skills in real-world scenarios. By working through these projects, you will gain practical experience and build a portfolio that showcases your data engineering capabilities.

Data Engineering Practice Problems include exercises on:

  1. Downloading Files
  2. Web Scraping + Downloading + Pandas
  3. Boto3 AWS + s3 + Python.
  4. Convert JSON to CSV + Ragged Directories
  5. Data Modeling for Postgres + Python
  6. Ingestion and Aggregation with PySpark
  7. Using Various PySpark Functions
  8. Using DuckDB for Analytics and Transforms
  9. Using Polars Lazy Computation

 
Link: danielbeach/data-engineering-practice

 

Final Words

 

Mastering data engineering requires dedication, persistence, and a passion for learning new concepts and tools. These 10 GitHub repositories provide a wealth of information and resources to help you become a professional data engineer and keep you updated on current trends. 

Whether you are just starting or an experienced data engineer, I encourage you to explore these resources, contribute to open-source projects, and stay engaged with the vibrant data engineering community on GitHub.
 
 

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master’s degree in technology management and a bachelor’s degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

Leave a Reply

Your email address will not be published. Required fields are marked *