Hello world!
# Microtypo - Data Engineer Project
This is an end-to-end data engineer project to monitor my keystores, with the data flow is designed for maximizing automation and availability.
Related services are deployed on bare-metal Kubernetes, including most modern data tools and frameworks (OSS).
Repositories
Services
- This is the python package that will be installed on one's machine
- It's job is to capture keystores and mouse clicks using pynput and running as a click application
- After receiving 100 records (keystrokes), the timestamps of these records will be shuffled and then write as a .csv file under ~/microtypo/records/[timestamp].csv
- For an interval of 10 minutes (configurable), the above-mentioned .csv file is then uploaded on Amazon S3 using Boto3 with prefix such as records/2024-06/U000002/
- This process is to allow Typorio to upload records whenever internet connection is available, and leveraging S3's amazing SLA attributes as the data lake to store raw .csv files.
- An implementation of Dagster data pipelines orchestration, can be visited at: https://dagster.microtypo.com/
- A Dagster Sensor is set up to subscribe for new files arrive under a particular S3 Bucket (dev/stage/prod), with month-partitioned prefix to avoid hitting list_objects_v2 api limitation (1,000 objects)
- After a new s3 key is detected, a dagster runner is spin-up to download the file, shuffle all timestamps for all records (similar as above), and then:
- Overwrite to a .parquet file stored in local Minio storage with new records using Polars with its awesome functionalities
- Append new rows to the SQL data warehouse, implemented using StackGres with 2 nodes: primary and replication
- Additionally, a Dagster Schedule is set up to run dbt command for an interval of 1 hour (cron expressions), materializing defined models for data visualization using Lightdash
- Notes: there are some unrelated pipelines in the Dagster UI, that I researched and practiced building more complex pipelines, for example:
- A minimal implementation of dbt, following this structuring approach with staging, intermediate and mart dbt models
Others
- Kube: Kubernetes, Docker, Registry, Helm, Helmfile, Kubeadm, K9s
- Infra: S3, IAM, Cloudflare, Terraform, Terragrunt, Taskfile
Demo
Contact