This cutting-edge training course on End-to-End Data Engineering on the Lakehouse Architecture is designed to equip data professionals with the practical skills and modern tools necessary to build scalable, unified, and efficient data pipelines using the Lakehouse paradigm. Combining the strengths of data lakes and data warehouses, the Lakehouse architecture enables real-time analytics, machine learning, and BI directly on raw data without complex data movement. Participants will master data ingestion, transformation, governance, and orchestration using technologies such as Apache Spark, Delta Lake, Databricks, and open table formats like Iceberg and Hudi—positioning themselves at the forefront of data engineering innovation.
Duration: 10 Days
Target Audience
- Data Engineers
- Cloud Data Architects
- BI and Analytics Engineers
- Data Platform Engineers
- ETL Developers
- Data Lake Administrators
- ML Engineers
- Software Engineers transitioning to data roles
Course Objectives
- Understand the core principles of Lakehouse architecture
- Learn how to implement scalable ELT pipelines on data lakes
- Integrate Delta Lake, Apache Iceberg, or Apache Hudi for data reliability
- Design efficient data lake schemas and partitioning strategies
- Use Apache Spark and SQL for large-scale data transformation
- Manage metadata, transactions, and schema evolution
- Enable real-time and batch data access on a unified platform
- Orchestrate workflows with tools like Airflow and dbt
- Ensure data quality, governance, and lineage
- Optimize performance, cost, and data retrieval speeds
- Build end-to-end pipelines for BI, analytics, and ML workloads
Module 1: Introduction to Lakehouse Architecture
- Lakehouse vs. traditional data warehouse and data lake
- Benefits of unifying storage and analytics
- Open formats: Delta Lake, Iceberg, Hudi
- Key components and ecosystem overview
- Use cases in analytics, BI, and machine learning
Module 2: Ingestion Strategies for the Lakehouse
- ELT vs. ETL in Lakehouse environments
- Streaming vs. batch ingestion
- Tools for ingestion: Apache NiFi, Airbyte, Kafka, Spark
- Handling schema drift and data validation
- Data partitioning and file formats
Module 3: Working with Apache Spark on the Lakehouse
- Introduction to Spark DataFrames and SQL
- Optimizing transformations and joins
- Structured Streaming for real-time pipelines
- Performance tuning and job optimization
- Working with large-scale JSON, Parquet, and ORC files
Module 4: Delta Lake Fundamentals
- ACID transactions and time travel
- Schema enforcement and evolution
- Managing Delta logs and versions
- Vacuuming and data retention policies
- Delta Lake vs. traditional lake formats
Module 5: Apache Iceberg and Hudi Deep Dive
- Table structure and metadata layers
- Data compaction and clustering
- Querying with Spark, Trino, Presto
- Use cases for versioning and rollback
- Performance comparisons and trade-offs
Module 6: Unified Data Modeling and Storage Design
- Bronze, Silver, and Gold layer modeling
- Medallion architecture design principles
- Data normalization and denormalization
- Partitioning, bucketing, and clustering strategies
- Handling slowly changing dimensions (SCDs)
Module 7: Data Transformation with dbt on the Lakehouse
- Introduction to dbt core and dbt Cloud
- Building modular SQL models
- Testing and documenting transformations
- Orchestrating dbt with Airflow
- Lineage visualization and deployment best practices
Module 8: Metadata Management and Governance
- Cataloging data with Unity Catalog, Hive Metastore, AWS Glue
- Implementing data classifications and tags
- Role-based access control and fine-grained permissions
- Managing table versions and audit logs
- Lineage tracking with OpenMetadata or DataHub
Module 9: Workflow Orchestration and Automation
- Creating DAGs in Airflow for Lakehouse pipelines
- Managing dependencies and retries
- Scheduling workflows and integrating alerts
- Parameterization and configuration
- Using Prefect or Dagster as alternatives
Module 10: Data Quality and Observability
- Writing validation rules with Great Expectations
- Detecting anomalies and outliers in data
- Monitoring freshness, volume, and distribution
- Building dashboards with Superset or Grafana
- Incident management and resolution
Module 11: Real-Time Analytics and Streaming
- Streaming ingestion with Kafka and Spark Structured Streaming
- Building real-time dashboards and microservices
- Window functions and watermarking
- Use cases in IoT, finance, and e-commerce
- Trade-offs of stream vs. micro-batch
Module 12: Machine Learning on the Lakehouse
- Integrating ML models into pipelines
- Feature engineering and feature stores
- Using MLflow for tracking and deployment
- Batch inference vs. real-time scoring
- Lakehouse as a foundation for MLOps
Module 13: Performance Optimization and Cost Management
- Query optimization with Z-Ordering and caching
- Choosing the right file size and format
- Auto-scaling compute resources
- Storage tiering and archival
- Tracking cost per query and job
Module 14: Security and Compliance in the Lakehouse
- Implementing encryption at rest and in transit
- Fine-grained access with row- and column-level security
- Data masking and tokenization techniques
- GDPR and HIPAA compliance measures
- Audit trails and anomaly detection
Module 15: Building Production-Ready Lakehouse Pipelines
- CI/CD and version control integration
- Deploying and monitoring end-to-end workflows
- Handling failures, retries, and rollbacks
- Documentation and handover practices
- Case study: From ingestion to BI using the Lakehouse stack
Training Approach
This course will be delivered by our skilled trainers who have vast knowledge and experience as expert professionals in the fields. The course is taught in English and through a mix of theory, practical activities, group discussion and case studies. Course manuals and additional training materials will be provided to the participants upon completion of the training.
Tailor-Made Course
This course can also be tailor-made to meet organization requirement. For further inquiries, please contact us on: Email: info@skillsforafrica.org, training@skillsforafrica.org Tel: +254 702 249 449
Training Venue
The training will be held at our Skills for Africa Training Institute Training Centre. We also offer training for a group at requested location all over the world. The course fee covers the course tuition, training materials, two break refreshments, and buffet lunch.
Visa application, travel expenses, airport transfers, dinners, accommodation, insurance, and other personal expenses are catered by the participant
Certification
Participants will be issued with Skills for Africa Training Institute certificate upon completion of this course.
Airport Pickup and Accommodation
Airport pickup and accommodation is arranged upon request. For booking contact our Training Coordinator through Email: info@skillsforafrica.org, training@skillsforafrica.org Tel: +254 702 249 449
Terms of Payment: Unless otherwise agreed between the two parties’ payment of the course fee should be done 7 working days before commencement of the training.