A fun project aims to design a data pipeline and a data sharing platform for a medical data sharing platform. The platform receives monthly data from hospitals in .csv format, which contains structured data of medical test results. The data size is > 500TB, and new data comes in monthly. The platform needs to support queries for data filtering and aggregation.
- monthly new data comes in, which is stored in S3 bucket
- EMR cluster is triggered by an event to process the data, initiates 2 automated steps:
- Step 1: Spark App: ETL job to process the data and store it in S3 bucket in parquet format
- Step 2: Crawler to crawl the new data and update the schema in Glue Data Catalog (metadata used by Athena for SQL queries on S3 patitioned parquet data)
- Athena is used to query the data (Parquet) in S3 bucket, theres two options:
- Athena query engine: for ad-hoc queries
- API Gateway + Lambda: for predefined queries
- AWS QuickSight service can be used to visualize the data and get data insights. Works on top of Athena.
- front-end website to interact with the data, using API Gateway + Lambda to query the data.
- Designed a few tricks for consistency and atuomation.
- traffic control
- caching (redis?) frequent SQL requests
- parquet file gzip/snappy, its more efficient in storage and querying
- gzip: 1.5x compression, snappy: 2x compression
- query gzip is slightly slower than snappy, choose compression method wisely based on the use case
- Athen:
- ANSI SQL query cost checks,
- avoid proceeding full scan of the db, as it comes with higher cost
- use partitioning in S3 bucket, to reduce the amount of data scanned
- Transient Elastic MapReduce Cluster - Triggered by specific events with termination after job completion
- Pay per use services: Athena, API Gateway, Lambda are prioritized in this system.
- The system is designed to be scalable, as the data size is > 500TB and new data comes in monthly.
- The system is designed to be cost-effective, with pay-per-use services prioritized.
- The system is designed to be secure, with strict IAM Role Permissions and API Gateway traffic control.
- The system is designed to be user-friendly, with a front-end website for data interaction.
- Automation among the services, less maintainance and human intervention required.
- Any part of the system can be moved to local servers or other cloud providers with minimal changes.
- Difference data sources can be integrated with the system with minimal changes.
- SQS connection between crawler and EMR cluster to automate crawling after data injection
- Cost management:
- traffic control
- caching (redis?) frequent SQL requests
- parquet file gzip, athena can read on zipped data, its more efficient and less cost
- SQL query cost checks, avoid proceeding full scan of the db, as it comes with higher cost
- Security:
- API Gateway traffic control: only allowed ip can interact with query endpoint
- Even more Strict IAM Role Permissions
- website:
- User Login + Authentication
- aggregation and other filtering features
- Download file feature.
- optimize MR cluster configurations
- Airflow for scheduling and monitoring
- Terraform for infrastructure as code
Deployed on Amplify, full code in the repo (It's private at the moment). It's not the main priority of the project, but it's a good alternative to interact with the data. heres a screenshot of the feature: