Data Engineering Projects
Data Engineering
Data Engineering Projects
1. Data pipelines with Apache Airflow
2. Data Lakes with Apache Spark
1. Data pipelines with Apache Airflow
Automate Data Warehouse ETL process with Apache Airflow : Automation is at the heart of data engineering and Apache Airflow makes it possible to build reusable production-grade data pipelines that cater to the needs of Data Scientists. In this project, I took the role of a Data Engineer to:
Develop a data pipeline that automates data warehouse ETL by building airflow operators that handle the extraction, transformation, validation and loading of data from S3 -> Redshift -> S3.
Build a reusable production-grade data pipeline that incorporates data quality checks and allows for easy backfills.
Keywords: Apache Airflow, AWS Redshift, Python, ETL, Data Engineering
2. Data Lakes with Apache Spark
Develop an ETL pipeline for a Data Lake : As a data engineer, I was tasked with building an ETL pipeline that extracts data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This allows Data Scientists to continue finding insights from the data stored in the Data Lake.
Developed python scripts that make use of PySpark to wrangle the data loaded from S3.
Designed a star schema to store the transformed data back into S3 as partitioned parquet files.
Keywords: Apache EMR, Data Lakes, PySpark, Python, Data Wrangling, Data Engineering