Sameh Sharaf

Data & AI Consultant

15 years architecting data infrastructure and AI systems across enterprise, startups, and emerging markets — available for consulting engagements and select full-time roles.

Available for engagements

About

With over 15 years of experience delivering large-scale data and AI solutions across multiple industries and continents, at work I help companies architect, build, and modernise the data infrastructure that powers their decisions and AI ambitions. Currently based in Berlin, Germany, my focus is on robust ETL pipelines, data lakehouses, and cloud-native systems that scale.

At work, I bring both strategic clarity and hands-on execution — from migrating 100+ production pipelines to modern lakehouse architecture, to implementing monitoring, fault handling, and data quality governance. My career spans enterprise clients across retail, finance, healthcare, and telecommunications, delivering solutions on AWS, GCP, and MS Azure at petabyte scale.

I'm also deepening my AI/ML practice through Georgia Tech's Machine Learning program and hands-on work with LLMs, deep learning, and agentic systems — with a focus on the infrastructure that makes AI reliable in production. I work as a consultant and am open to select full-time roles where I can drive meaningful impact.

Skills & Technologies

Big Data & Databases

  • Apache Spark
  • Databricks
  • AWS Redshift
  • PostgreSQL
  • Oracle PL/SQL
  • DynamoDB
  • Google BigQuery
  • Elastic Stack

Cloud & Orchestration

  • AWS (S3, Lambda, EC2)
  • Google Cloud Platform
  • Microsoft Azure
  • Apache Airflow
  • MLflow
  • Docker & Kubernetes

AI/ML & Programming

  • AWS SageMaker & Bedrock
  • LLMs & Agentic Engineering
  • Deep Learning (RNN, CNN)
  • PyTorch & Scikit-Learn
  • Python
  • Go
  • FastAPI

DevOps & Monitoring

  • GitLab CI/CD
  • DataDog
  • Grafana
  • Bugsnag
  • Sentry

Featured Projects

Data Lakehouse Migration — Retail / E-commerce

Led the migration of 100+ production ETL pipelines from legacy Oracle and Exasol systems to a modern data lakehouse architecture on AWS and Databricks. Introduced right-sized cluster management — replacing a one-size-fits-all approach — achieving up to 35% faster pipeline execution and direct reductions in cloud infrastructure costs. Implemented comprehensive monitoring, fault handling, and data quality governance throughout.

Apache Airflow • AWS • Databricks • Python • Data Governance

Unified AI & Data Infrastructure — Multinational Consumer Goods

Proposed and architected a unified cloud infrastructure on Azure and Databricks to consolidate data processing and AI model development across a multinational consumer goods company. The solution was adopted globally, with the company's engineering teams worldwide implementing it across their respective tech departments. Collaborated with data scientists to deploy and productionise demand forecasting models, making them reliable and scalable in production.

Microsoft Azure • Databricks • Azure SQL DB • ML Infrastructure • ETL

Real-Time Event Streaming & Fraud Detection — Telecommunications

Built a near real-time data streaming pipeline processing events emitted by mobile devices, enabling instant monitoring of activity by demographics and region using attributes encoded in each event. The processed stream fed directly into an AI clustering model to detect and track suspicious activity patterns — delivering an operational intelligence capability the client had no visibility into before.

Real-Time Streaming • Event Processing • Python • AI/ML • Cloud Infrastructure

Legacy Data Platform Migration — Retail

Fully migrated a retail client's data infrastructure from a legacy OLAP data cube model to a modern data lake architecture on Microsoft Azure. The new platform replaced rigid, hard-to-maintain cube structures with a flexible, scalable foundation — enabling faster analytics iteration and unlocking new data use cases for the business.

Microsoft Azure • Data Lake • ETL • Data Architecture

Petabyte-Scale Streaming & Batch Infrastructure — Media

Led end-to-end data infrastructure handling petabytes of data across real-time streaming and batch pipelines for a high-growth media streaming platform. Designed and administered an AWS Redshift data warehouse with significant query performance improvements, and managed the full Qubole platform — including Hive, Spark, and Airflow clusters — while maintaining data governance and quality standards at scale.

AWS Redshift • Apache Spark • Airflow • Qubole • Data Warehousing

Podcast Transcript & Emotion Recognition Pipeline

Built an end-to-end subtitle generation pipeline integrating HuggingFace speech-to-text and emotion recognition models to automatically transcribe podcast audio across multiple European languages. Generated emotionally-aware subtitles by mapping recognised emotional tone to text-to-speech output for localised, expressive delivery.

NLP • Speech-to-Text • Emotion Recognition • Python

DQA: Data Quality Assessment Service

Designed and built an internal data quality service adopted by analysts and integrated across data pipelines. Provided automated summary statistics and configurable validation rules via a publish-subscribe architecture, exposing REST APIs and a Web UI for self-serve data quality monitoring across the organisation.

Python • REST APIs • Data Validation • Microservices

Experience

Sep 2020 - Present
Senior Data Engineer
Zalando • Berlin, Germany

Leading data lakehouse migration of 100+ production pipelines from Oracle and Exasol to AWS and Databricks. Introduced right-sized cluster management achieving up to 35% faster pipeline execution and measurable cloud cost reduction. Architecting robust ETL infrastructure with comprehensive monitoring, fault handling, and data quality governance.

Jun 2018 - Aug 2020
Data Engineer
Sertis • Bangkok, Thailand

Architected cloud and on-premise data solutions for 10+ enterprise clients across retail, finance, healthcare, and telco on AWS, GCP, and Azure — delivering industry-specific outcomes including legacy platform migrations, real-time streaming pipelines, and AI infrastructure adopted globally. Collaborated with data science teams to build self-serve tooling covering infrastructure provisioning and data wrangling, containerised with Docker and Kubernetes.

Dec 2016 - Jun 2018
Data Engineer
iflix • Kuala Lumpur, Malaysia

Led end-to-end data infrastructure handling petabytes of data across real-time streaming and batch pipelines. Designed and administered AWS Redshift data warehouse with significant improvements in query performance. Managed Qubole platform including Hive, Spark, and Airflow clusters at scale.

Aug 2015 - Dec 2016
Software Engineer (Data / Full Stack)
Jirnexu • Kuala Lumpur, Malaysia

Led full development of SORA, an in-house CRM system built to serve non-technical teams. Engineered near real-time data synchronisation from DynamoDB to Redshift using DynamoDB Streams, and built a dynamic query builder enabling business teams to access customer data without engineering support.

Nov 2010 - Jun 2015
BI Engineer
MTN Syria • Damascus, Syria

Contributed to revenue assurance initiatives identifying and recovering millions in lost revenue. Automated reconciliation processes and built monitoring systems to detect revenue leakage. Improved billing accuracy through Oracle PL/SQL development and BSCS iX administration.

Work With Me

Available for consulting engagements — whether you need to modernise your data infrastructure, stand up an AI/ML platform, or get a senior hand on a critical project. Also open to select full-time roles. Let's talk.