More than 1M ETL jobs for one of our customers

Image

Our customers are interested in increasing profitability through Data Driven Marketing and not so much in the IT side of it. How hard it actually is to ensure smooth operations is solved by our tooling, making sure the customer can focus on what matters most.

Last week we reached a milestone. Our automated daily workflow processed the one millionth job for one of our customers. At HDA, we automatically run thousands of daily ETL jobs for all our customers, connecting to many different data sources, retrieving, cleaning and transforming data to store for reporting and visualization. The biggest challenge in this is to keep track of all jobs, whether they succeeded or failed, and in case of failure automatically (or sometimes manually) retry individual jobs to ensure all data will be available for that day.

The tool is developed in house as none of the commercially available data orchestration tooling fitted the requirements:

  • To schedule recurring and ongoing jobs; streaming, hourly, daily, weekly or more intricate ones like only on the second Monday of the month before 9:15.
  • Keep track of job states, retry if data sources are temporarily off line, report success and alert when there is a failure.
  • Connect to all major Data Sources (Google Analytics, Facebook, Salesforce, etc), but also customizable connectors to connect to other systems.
  • Handle not only standard ETL jobs, but also BI, AI and other processing in the same orchestration tool
  • Elastically scale up and down according to workload. It should be easy to speed up processing by just adding more compute power, without creating bottlenecks in the system.
  • Secure! Especially in light of the new GDPR (EU General Data Protection Act 2018), the tool needs to be fully compliant to the strictest regulations.

The tooling we use is our “elastic Big Data Box” or eBDB. The tool is highly scalable and runs in parallel on a few dozen machines in Amazon Web Services (AWS). It automatically scales up and down, based on the available workload. As AWS EC2 is pay per use, the tool is setup to ensure an affordable solution for our customers.