Integrating external data sources into the data warehouse

Klaudia Szpyrka
2 mins

  • Tech

As businesses grow, they tend to expand the range of external apps and systems they use to operate efficiently. Marketing, CRM, web traffic, advertising, recruitment and employee engagement platforms, just to name a few. Then there’s the multitude of tools that are used for software development – everything from project management software, code hosting platforms, to communication apps.

At some point in a business' development, various teams might want to make use of the data gathered by some or all of these systems, either separately or in combination with each other. Since each app and system offers different APIs, this means that for every data source you will have to develop the data extraction workflow from scratch, and thus typically it is at this point that data teams start to look for a data integration tool.

How to make a decision

There are many factors that need to be taken into account before selecting a data integration tool. The most important ones would generally be:

  • The pricing model 
  • How fast a new data extraction workflow can be setup 
  • How many prebuilt connections are available 
  • General ease of use of the tool 
  • Extensibility (whether the connections are entirely predefined or if custom code can be added) 
  • Change Data Capture (CDC) functionality 
  • Portability (whether connections can be smoothly moved between on-premise and cloud or between various cloud platforms) 
  • Tech support 
  • Access to logs and alerts in case of failures 

For the last couple years, there has been more and more software emerging on the market with different approaches to these factors. Companies therefore have to ultimately make a decision based on which aspects are most crucial for their use case.

Our choice at Intergiro

When we at Intergiro were looking for a data integration tool recently, we had two top priorities: being able to connect to a source application as quickly as possible with predefined, easy-to-set-up connections, and at the same time having the option to write our own code if needed. Tracking of data changes was also a must-have. As all of our other infrastructure sits in the Google Cloud Platform, we also wanted a tool that could be set up on that infrastructure. Last but not least, we wanted to integrate this tool with Slack or email so that we get alerted when a workflow breaks. 

After thorough research and multiple trials, we decided on not one, but two technologies that we now use simultaneously. 

Stitch

Stitch has over 130 pre-built connections ready to be set up within a couple minutes. The set-up process is surprisingly fast and straightforward, and once complete, the connection starts the extract and load process right away. Execution logs are easily accessible and in case of errors, e-mail notifications are set up by default. The only gap was found in connecting to custom sources not out-of-the-box. For this reason, an alternative we investigated was using Singer’s open source framework, which is at the heart of Stitch anyway. However, this increased the costs associated with Stitch and moved to be cost comparison to full ETL tools. Based on our cost considerations, we finally opted for the PAYG pro version, which allowed 10 source and 5m record inserts/changes monthly. This gives us a quick easy way to acquire data, analyse, model accordingly, and decide if it is useful to continue acquiring.

Google Cloud Composer

This is a fully managed workflow orchestration service built on Apache Airflow. Not only does it fill in the gap between the connections provided by Stitch and the ones we plan to develop ourselves, but it can also be used to orchestrate all kinds of other workflows. This service requires Python coding knowledge but comes with many predefined modules and functions thanks to the open source community. It also features clean documentation and plenty of learning resources, including books and classroom courses. Google Cloud Composer is popular enough in the data world that you can search for problem solutions on Stack Overflow or other programming communities, which might not be the case with many other data integration frameworks. 

With the choice of the two we can now perfectly adjust every workflow to the nature of the data source and bring some of the sources to the warehouse near instantly. This allows us to transfer data into our warehouse quickly while also working on a longer term solution through Airflow, which we have also leveraged to build many other data pipelines needed for our data platform and DWH.