Hello world!

# Microtypo - Data Engineer Project

This is an end-to-end data engineer project to monitor my keystores, with the data flow is designed for maximizing automation and availability.

Related services are deployed on bare-metal Kubernetes, including most modern data tools and frameworks (OSS).

Repositories


Services

Typorio

  • This is the python package that will be installed on one's machine
  • It's job is to capture keystores and mouse clicks using pynput and running as a click application
  • After receiving 100 records (keystrokes), the timestamps of these records will be shuffled and then write as a .csv file under ~/microtypo/records/[timestamp].csv
  • For an interval of 10 minutes (configurable), the above-mentioned .csv file is then uploaded on Amazon S3 using Boto3 with prefix such as records/2024-06/U000002/
  • This process is to allow Typorio to upload records whenever internet connection is available, and leveraging S3's amazing SLA attributes as the data lake to store raw .csv files.

Pipelines

  • An implementation of Dagster data pipelines orchestration, can be visited at: https://dagster.microtypo.com/
  • A Dagster Sensor is set up to subscribe for new files arrive under a particular S3 Bucket (dev/stage/prod), with month-partitioned prefix to avoid hitting list_objects_v2 api limitation (1,000 objects)
  • After a new s3 key is detected, a dagster runner is spin-up to download the file, shuffle all timestamps for all records (similar as above), and then:
    • Overwrite to a .parquet file stored in local Minio storage with new records using Polars with its awesome functionalities
    • Append new rows to the SQL data warehouse, implemented using StackGres with 2 nodes: primary and replication
  • Additionally, a Dagster Schedule is set up to run dbt command for an interval of 1 hour (cron expressions), materializing defined models for data visualization using Lightdash
  • Notes: there are some unrelated pipelines in the Dagster UI, that I researched and practiced building more complex pipelines, for example:

Dbt

  • A minimal implementation of dbt, following this structuring approach with staging, intermediate and mart dbt models

Others

  • Kube: Kubernetes, Docker, Registry, Helm, Helmfile, Kubeadm, K9s
  • Infra: S3, IAM, Cloudflare, Terraform, Terragrunt, Taskfile

Demo

  • Dagster UI
    • View-only
  • Apache Superset dashboard with prebuilt charts
    • username: admin
    • password: password
  • Lightdash dashboard for custom metrics
    • username: admin
    • password: password$

Contact