Data Engineering Project - Data Pipeline + Data Sharing Platform

Project Description

A fun project aims to design a data pipeline and a data sharing platform for a medical data sharing platform. The platform receives monthly data from hospitals in .csv format, which contains structured data of medical test results. The data size is > 500TB, and new data comes in monthly. The platform needs to support queries for data filtering and aggregation.

Software Architecture

Solution Analysis

monthly new data comes in, which is stored in S3 bucket
EMR cluster is triggered by an event to process the data, initiates 2 automated steps:
- Step 1: Spark App: ETL job to process the data and store it in S3 bucket in parquet format
- Step 2: Crawler to crawl the new data and update the schema in Glue Data Catalog (metadata used by Athena for SQL queries on S3 patitioned parquet data)
Athena is used to query the data (Parquet) in S3 bucket, theres two options:
- Athena query engine: for ad-hoc queries
- API Gateway + Lambda: for predefined queries
AWS QuickSight service can be used to visualize the data and get data insights. Works on top of Athena.
front-end website to interact with the data, using API Gateway + Lambda to query the data.
Designed a few tricks for consistency and atuomation.

Cost Management Strategy

traffic control
caching (redis?) frequent SQL requests
parquet file gzip/snappy, its more efficient in storage and querying
- gzip: 1.5x compression, snappy: 2x compression
- query gzip is slightly slower than snappy, choose compression method wisely based on the use case
Athen:
- ANSI SQL query cost checks,
- avoid proceeding full scan of the db, as it comes with higher cost
- use partitioning in S3 bucket, to reduce the amount of data scanned
Transient Elastic MapReduce Cluster - Triggered by specific events with termination after job completion
Pay per use services: Athena, API Gateway, Lambda are prioritized in this system.

Adaptability

The system is designed to be scalable, as the data size is > 500TB and new data comes in monthly.
The system is designed to be cost-effective, with pay-per-use services prioritized.
The system is designed to be secure, with strict IAM Role Permissions and API Gateway traffic control.
The system is designed to be user-friendly, with a front-end website for data interaction.
Automation among the services, less maintainance and human intervention required.
Any part of the system can be moved to local servers or other cloud providers with minimal changes.
Difference data sources can be integrated with the system with minimal changes.

Future Optimizations:

SQS connection between crawler and EMR cluster to automate crawling after data injection
Cost management:
- traffic control
- caching (redis?) frequent SQL requests
- parquet file gzip, athena can read on zipped data, its more efficient and less cost
- SQL query cost checks, avoid proceeding full scan of the db, as it comes with higher cost
Security:
- API Gateway traffic control: only allowed ip can interact with query endpoint
- Even more Strict IAM Role Permissions
website:
- User Login + Authentication
- aggregation and other filtering features
Download file feature.
optimize MR cluster configurations
Airflow for scheduling and monitoring
Terraform for infrastructure as code

Full stack NextJS APP:

Deployed on Amplify, full code in the repo (It's private at the moment). It's not the main priority of the project, but it's a good alternative to interact with the data. heres a screenshot of the feature:

Generate SQL query based on user input
Query the data using API Gateway + Lambda
Display the data in a table format
~~Download the data in .csv format~~

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
lambda-functions		lambda-functions
spark-apps		spark-apps
.gitignore		.gitignore
README.md		README.md
System_Architecture.png		System_Architecture.png
column_type_mapping.json		column_type_mapping.json
crawler.py		crawler.py
demo-web.png		demo-web.png
gen_schema.py		gen_schema.py
mapping.json		mapping.json
random_generator.py		random_generator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Project - Data Pipeline + Data Sharing Platform

Project Description

Software Architecture

Solution Analysis

Cost Management Strategy

Adaptability

Future Optimizations:

Full stack NextJS APP:

About

Releases

Packages

Languages

emmhh/data-pipeline-spark

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Project - Data Pipeline + Data Sharing Platform

Project Description

Software Architecture

Solution Analysis

Cost Management Strategy

Adaptability

Future Optimizations:

Full stack NextJS APP:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages