Loading...
Loading...
Data and ML engineering covers everything from raw data to deployed models. This roadmap is for builders, not researchers the goal is to ship reliable systems, not to write papers.
The data triad
Pandas (or Polars) for in-memory wrangling, SQL for everything in a warehouse, Python as the glue. You'll use all three every day.
Just enough to be dangerous
Probability, distributions, regression, classification, evaluation metrics. You don't need calculus from scratch you need to read sklearn and Hugging Face docs without confusion.
dbt, Airflow, Dagster, Prefect
Modern data eng is built around dbt for transformations and an orchestrator (Airflow / Dagster / Prefect) for scheduling. Lineage, tests, and observability matter as much as the SQL.
BigQuery, Snowflake, DuckDB, Iceberg
BigQuery, Snowflake, and Redshift dominate enterprise. DuckDB and ClickHouse are excellent for cheaper analytics. Iceberg / Delta tables let you separate storage from compute properly.
Training, serving, monitoring
Pick a stack: PyTorch + Hugging Face for deep learning, sklearn + XGBoost for tabular. Track experiments with W&B or MLflow. Serve with vLLM, Triton, or BentoML. Monitor for drift.
RAG, fine-tuning, evals
Even classical ML teams now have LLM workloads. Learn fine-tuning (LoRA / QLoRA), RAG architecture, and how to evaluate generative outputs systematically.
Kafka, Flink, Materialize
When batch isn't enough. Kafka for the substrate, Flink or Materialize for stateful streaming. Most products don't need this but when they do, nothing else works.
We pair these roadmaps with hands-on engagements pair-programming, code review, and architecture support.