Sameh Sharaf

Senior Data Engineer

15 years building large-scale data infrastructure and exploring AI/ML frontiers.

About

I'm a Senior Data Engineer with over 15 years of experience building and delivering large-scale data solutions across multiple industries and continents. Currently based in Berlin, Germany, I specialize in building robust ETL pipelines, data lakehouses, and cloud-native infrastructure that powers data-driven decision making at scale.

I've led and built 100+ production pipelines to modern data lakehouse architecture, implementing comprehensive monitoring, fault handling, and data quality governance. My career spans work with enterprise clients across retail, finance, healthcare, and telecommunications, delivering solutions on AWS, GCP, and MS Azure that handle petabytes of data efficiently.

Currently, I am expanding into AI/ML through Georgia Tech's Machine Learning program and hands-on projects involving LLMs, deep learning, and agentic engineering. I'm passionate about bridging the gap between data engineering and machine learning operations, building the infrastructure that enables AI systems to thrive in production environments.

Skills & Technologies

Big Data & Databases

  • Apache Spark
  • Databricks
  • AWS Redshift
  • PostgreSQL
  • Oracle PL/SQL
  • DynamoDB
  • Google BigQuery
  • Elastic Stack

Cloud & Orchestration

  • AWS (S3, Lambda, EC2)
  • Google Cloud Platform
  • Microsoft Azure
  • Apache Airflow
  • MLflow
  • Docker & Kubernetes

AI/ML & Programming

  • AWS SageMaker & Bedrock
  • LLMs & Agentic Engineering
  • Deep Learning (RNN, CNN)
  • PyTorch & Scikit-Learn
  • Python
  • Go
  • FastAPI

DevOps & Monitoring

  • GitLab CI/CD
  • DataDog
  • Grafana
  • Bugsnag
  • Sentry

Featured Projects

Data Lakehouse Migration

Led the migration of 100+ production ETL pipelines from legacy Oracle and Exasol systems to a modern data lakehouse architecture on AWS and Databricks. Implemented comprehensive monitoring, fault handling, and data quality governance, delivering significant cost savings through infrastructure modernisation and optimised cloud resource utilisation.

Apache Airflow • AWS • Databricks • Python • Data Governance

Podcast Transcript & Emotion Recognition Pipeline

Built end-to-end subtitle generation pipeline integrating HuggingFace speech-to-text and emotion recognition models to automatically transcribe podcast audio across multiple European languages. Generated emotionally-aware subtitles by mapping recognised emotional tone to text-to-speech output for localised, expressive delivery.

NLP • Speech-to-Text • Emotion Recognition • Python

DQA: Data Quality Assessment Service

Designed and built internal data quality service adopted by analysts and integrated across data pipelines at Sertis. Provided automated summary statistics and configurable validation rules via publish-subscribe architecture, exposing REST APIs and Web UI for self-serve data quality monitoring across the organisation.

Python • REST APIs • Data Validation • Microservices

Petabyte-Scale Infrastructure

Sole data engineer responsible for end-to-end data infrastructure handling petabytes of data across real-time streaming and batch pipelines. Designed and administered AWS Redshift data warehouse, managed Qubole platform including Hive, Spark, and Airflow clusters while maintaining data governance and quality standards at scale.

AWS Redshift • Apache Spark • Airflow • Qubole • Data Warehousing

Demand Forecasting ML Pipeline

Delivered two-phase engagement for multinational consumer goods company. Phase one: designed ETL pipeline to Azure SQL DB enabling downstream analytics. Phase two: architected cloud infrastructure on Azure and Databricks to host and serve demand forecasting ML model for production deployment.

Microsoft Azure • Databricks • Azure SQL DB • ML Infrastructure • ETL

Experience

Sep 2020 - Present
Senior Data Engineer
Zalando • Berlin, Germany

Leading data lakehouse migration of 100+ production pipelines from Oracle and Exasol to AWS and Databricks. Architecting robust ETL infrastructure with comprehensive monitoring, fault handling, and data quality governance. Driving cost optimisation through infrastructure modernisation.

Jun 2018 - Aug 2020
Data Engineer
Sertis • Bangkok, Thailand

Architected cloud and on-premise data solutions for 10+ enterprise clients across retail, finance, healthcare, and telco on AWS, GCP, and Azure. Developed internal self-serve tooling for data scientists covering infrastructure provisioning and data wrangling, containerised with Docker and Kubernetes.

Dec 2016 - Jun 2018
Data Engineer
iflix • Kuala Lumpur, Malaysia

Sole data engineer responsible for end-to-end infrastructure handling petabytes of data. Designed and administered AWS Redshift data warehouse with significant improvements in query performance. Managed Qubole platform including Hive, Spark, and Airflow clusters at scale.

Aug 2015 - Dec 2016
Software Engineer (Data / Full Stack)
Jirnexu • Kuala Lumpur, Malaysia

Sole data and software engineer driving full development of SORA, an in-house CRM system. Engineered near real-time data synchronisation from DynamoDB to Redshift using DynamoDB Streams. Built dynamic query builder for non-technical teams to access customer data.

Nov 2010 - Jun 2015
BI Engineer
MTN Syria • Damascus, Syria

Contributed to revenue assurance initiatives identifying and recovering millions in lost revenue. Automated reconciliation processes and built monitoring systems to detect revenue leakage. Improved billing accuracy through Oracle PL/SQL development and BSCS iX administration.

Get In Touch

I'm open to new opportunities, collaborations, and conversations about data engineering and AI/ML. Feel free to reach out through any of the following channels: