Reverse ETL: Overview & Use Cases
- Chapter 1: Data Integration 101
- Chapter 2: Data Ingestion: Implementation Methods
- Chapter 3: Data Transformation Tools
- Chapter 4: Reverse ETL
- Chapter 5: Cloud Data Integration
- Chapter 6: Automated Data Mapping
- Chapter 7: Big Data Integration
- Chapter 8: No Code Data Integration
- Chapter 9: Data Integration Architecture
- Chapter 10: Enterprise Data Integration
- Chapter 11: Reinventing The Modern Data Stack
- Chapter 12: Data Audit
Reverse ETL (for extract, load, and transform), also known as data activation, is the process of loading analytics-ready data into operational systems to unlock its full potential. It is often the last step in the modern data stack, where business insights are turned into actions. It can help organizations improve customer satisfaction, automate operational workloads, and increase sales by empowering line-of-business users with the data they need.
In this article, we will discuss reverse ETL’s main benefits and limitations. We will go through an example in which we will build a hands-on reverse ETL pipeline that will extract data from a data warehouse and load it into a software-as-a-service (SaaS) application using SQL and the Python programming language. Finally, we will talk about tools and technologies that can help you build these pipelines easily and quickly, allowing you to focus on your core business.
Summary of key reverse ETL concepts
The table below briefly outlines what a reverse ETL pipeline is, the main benefits of building one, sample use cases where it is most suitable, downsides and limitations of reverse ETL, and tools and technologies.
Definition | Reverse ETL means loading data from analytical to operational systems, which is the opposite direction of traditional ETL and ELT pipelines. |
Main benefits | Reverse ETL operationalizes data and unlocks its potential value of data for business users |
Use cases | Reverse ETL’s ideal use cases include customer analytics such as customer lifetime value. |
Downsides and limitations | Too many update requests can overload operational systems, disrupting the business. Sync issues can be time-consuming. |
Tools and technologies | Custom solutions are arduous to build, costly to maintain, and very brittle. A data platform with reverse ETL capabilities is recommended. |
Understanding ETL
To better understand and evaluate reverse ETL, we need to understand ETL and its relatively newer version, ELT (note: L is before T). As mentioned earlier, ETL stands for extract, transform, and load. Traditionally, after data is extracted from a source system, it is transformed on dedicated processing servers before being loaded into a data warehouse. The data sources are usually transactional systems, and the data warehouses are usually used for analytics and insights.
Now, let’s discuss ELT. With the rise of powerful cloud data warehouses such as Redshift, Snowflake, and BigQuery, the trend has shifted towards ELT. These data warehouses can scale up and scale out to process large data sets on demand, so companies no longer have to wait for data transformations and are no longer bound to the limitations of a certain model. They can build different data models and perform the relevant transformations as the business demands.
What is the impact of GenAI on Data Engineering?
The main difference between ETL and ELT is that in ELT, all processing and analysis are performed in the data warehouse, enabling data centralization and flexible data models. Regardless of whether the transformations were done before or after the data was loaded in the data warehouse, the outcome of the transformations is the data that is analytics-ready and provides value. Reverse ETL closes the loop by taking this high-quality and valuable data from the data warehouse, transforming it as needed, and loading it to operational systems. Hence, the word “reverse” in the name refers to the reversal of source and target systems and not necessarily the order of the steps. The data warehouse becomes the source of the data, and the targets are operational systems such as those related to CRM, finance, and marketing.
A simple ETL/ELT diagram complemented with Reverse ETL
Main benefits of reverse ETL
Reverse ETL has many benefits, including increased return on investment (ROI) on the data analytics platform, improved targetability, and enhanced business user insight. While departments like sales, marketing, and finance reap the most benefits of reverse ETL, in reality, any department can take advantage of the insights generated in the data warehouse. For data teams, the main benefit is that building integrations with customer-facing SaaS applications is easier and quicker. In addition, the data models can be more flexible, allowing even more insights to be delivered to the right teams.
Another benefit and often the primary reason for building a reverse ETL solution is to have greater flexibility and richer functionality than off-the-shelf customer data platforms (CDP) provide. CDPs are software systems that enable organizations to unify customer data from multiple sources and provide it to various customer-facing applications in a consistent format. Some of the biggest players in this highly segmented market are Segment, Emarsys, and Exponea. CDPs can support various customer-facing applications, including customer relationship management (CRM), marketing automation, and e-commerce. CDPs have very limited transformation capabilities, and their data structures are extremely rigid. Reverse ETL opens the possibility of delivering customer insights to the whole organization instead of the select departments, such as sales and marketing.
Use cases of Reverse ETL
Reverse ETL is best suited to use cases like increasing customer lifetime value (CLV), reducing churn rate, and enhancing customer engagement. Let’s take the CLV use case as an example.
Companies often have a hard time evaluating and quantifying how valuable their customers are. Knowing more about the customers, such as their purchase patterns, demographics data, communication frequency, and website or store visits, helps them come up with a more accurate valuation. Most organizations have customer data, which is collected through the regular ETL/ELT tools. Here the customer data is collected from from many touch points like website visit logs, sales transactions, and telephone calls. This data is merged and enriched with other data sources in the data warehouse, such as customer demographics and aggregated analytics, to generate highly sought-after customer insights. Reverse ETL brings this data to the hands of the business users that can make the best use of it.
Example Reverse ETL pipeline
Let’s take the CLV use case further and build a simple reverse ETL pipeline to better internalize the concepts we have discussed. We will assume that the data has already been loaded into the data warehouse from multiple sources, then cleansed, transformed, aggregated, and analyzed using advanced machine learning algorithms if needed. The outcome of the analysis is saved to a customer insights table. Now, we want to bring these insights to our customer success team via a platform they are the most comfortable using. For this use case, we will use the following:
- Snowflake as the data warehouse
- Salesforce as the CRM application
- Python and SQL scripts to build a custom integration between these platforms.
It is worth noting that the scripts below make moderate usage of Python, such as HTTP API calls, database interactions, etc. Therefore, familiarity with Python is helpful, but don’t worry if you are not a Python user. The scripts are split into steps, and each step is explained in detail. The action we take at each step can easily be translated into other programming or scripting languages.
In three steps, we will do the following:
- Extract the data from the customer insights table in Snowflake
- Transform it to JSON
- Update (load) the standard Salesforce Customer object, which has a built-in field for holding the lifetime value.
Is your Data Integration ready to be Metadata-driven?
Setup
This use case is fully reproducible using the scripts below, with the exception of account-specific values and credentials. Before we start the three-step process, we setup a demo environment by creating a demo database in Snowflake, a custom schema to hold our tables, and the table into which we will load the data.
USE ROLE SYSADMIN; CREATE DATABASE DEMO; CREATE SCHEMA NEXLA; CREATE TABLE DEMO.NEXLA.CUSTOMER ( email VARCHAR, lifetime_value NUMBER ) ;
Also, we need some sample data to work with. We will use the Python faker package for this. Other packages that we will need are pandas, requests, and snowflake-python-connector. Use the following command to install them all:
> pip install faker pandas requests snowflake-connector-python
We start by importing the modules above and other built-in modules that we will need:
import faker import json import pandas as pd import random import requests import snowflake.connector
Then, we define some constant variables to be used throughout the script:
NUMBER_OF_CUSTOMERS = 5 SALESFORCE_BASE_URL = "https://example.my.salesforce.com" CUSTOMER_ENDPOINT = "/services/data/v56.0/sobjects/Customer/" QUERY_ENDPOINT = "/services/data/v56.0/query/?q=" SALESFORCE_API_TOKEN = "00D5f00..." # This is requested at runtime using OAuth SALESFORCE_API_HEADERS = { "Authorization": "Bearer " + SALESFORCE_API_TOKEN, "Content-Type": "application/json", }
Let’s create some fake customers using the faker module and save them to a CSV file. We generate a random email and a random number between 1 and 100, which is sample data representing the value of our customers. This would be generated by a data analysis method or machine learning model in a real use case.
fake = faker.Faker() customers = [ (fake.email(), random.randint(1, 100)) for _ in range(NUMBER_OF_CUSTOMERS) ] df = pd.DataFrame(customers, columns=["email", "lifetime_value"]) df.to_csv("customers.csv", index=False)
The next step is connecting to Snowflake and running a few queries. The first query creates an internal stage in Snowflake in our DEMO database and NEXLA schema, which we created before. Stages in Snowflake are schema-level objects that can be used to hold data files.
Note that while we could have used an INSERT statement to load this data into Snowflake, it is not a best practice for large files because loading large amounts of data with single or batch inserts is highly inefficient, more prone to network errors, and costlier because of the longer warehouse usage. Therefore, we go with the recommended PUT + COPY INTO commands.
The second query, the PUT statement, puts the customer file from our local machine to the Snowflake stage we just created. Finally, the third query loads the data in that file in the stage into a standard table.
connection = snowflake.connector.connect( user="user", password="password", account="ab12345.north-europe.azure" ) cursor = connection.cursor() cursor.execute("CREATE STAGE DEMO.NEXLA.DEMO_STAGE;") cursor.execute("PUT 'file:///path/to/customers.csv' @DEMO.NEXLA.DEMO_STAGE") cursor.execute( """ COPY INTO DEMO.NEXLA.CUSTOMER FROM @DEMO.NEXLA.DEMO_STAGE FILES = ('customers.csv.gz') FILE_FORMAT = ( TYPE=CSV SKIP_HEADER=1) """ )
So far, all the steps we have taken were to create a customer table with insights. In this case, it’s only the customer lifetime value metric, but this could have been a very wide table with many other metrics. Regardless of the data size and the number of metrics, we now want to build a reverse ETL pipeline to share these valuable insights with our customer success team.
Guide to Metadata-Driven Integration
-
Learn how to overcome constraints in the evolving data integration landscape -
Shift data architecture fundamentals to a metadata-driven design -
Implement metadata in your data flows to deliver data at time-of-use
Extract
The first step in our reverse ETL pipeline is to extract the data we need from our Snowflake data warehouse. We use a SELECT statement and utilize the OBJECT_CONSTRUCT function to format the output using JSON for easier processing in the transformation step.
extract = cursor.execute("SELECT OBJECT_CONSTRUCT(*) FROM DEMO.NEXLA.CUSTOMER").fetchall() def create_salesforce_customer_dict(): email_soql = "SELECT+Id,Email+FROM+Customer" request_url = SALESFORCE_BASE_URL + QUERY_ENDPOINT + email_soql response = requests.get(request_url, headers=SALESFORCE_API_HEADERS) customers = response.json()["records"] customer_dict = {customer["Email"]: customer["Id"] for customer in customers} return customer_dict
Transform
Before we can load this data into Salesforce using its REST API, we need to create a map of customer emails to Salesforce IDs. To accomplish that, we create the helper function above and use its output to look up customer IDs in Salesforce.
Load
At this point, we are now ready to update our customer data in Salesforce with the most recent lifetime value numbers. We iterate over our customer data extract one by one; it is worth noting here that it would be better to use batch API endpoints in a real use case with thousands or millions of customers. At each iteration, we transform the customer data to JSON format that can be sent to Salesforce REST API. Once all the customers are processed, the loading step is completed.
salesforce_customer_dict = create_salesforce_customer_dict() for customer in extract: data = json.loads(customer[0]) email = data["EMAIL"] salesforce_id = salesforce_customer_dict[email] request_url = SALESFORCE_BASE_URL + CUSTOMER_ENDPOINT + salesforce_id transform = {"TotalLifeTimeValue": data["LIFETIME_VALUE"]} load = requests.patch( request_url, headers=SALESFORCE_API_HEADERS, data=json.dumps(transform) )
Challenges and Limitations
As can be seen from this relatively simple example, setting up reverse ETL pipelines using custom connectors and scripts is not easy: It is arduous and time-consuming, and the solution we end up with is brittle. Sync failures are one of the most common issues in reverse ETL solutions, so it’s necessary to take care to keep the source and target mappings in sync.
Another downside of using reverse ETL is the risk of overloading operational systems. If the pipeline updates the data in the operational systems too often, it may hinder system functionality. We must be careful to stay within the API rate limits of the target application.
In addition, reverse ETL pipelines must always follow the best practices and recommendations for interacting with the specific system into which they are loading data. Unoptimized and non-standard data loading methods impact performance and may cause unintended side effects. For example, imagine a parent-child relationship between two objects in the target system; using the parent object to update the attributes of a child object might unintentionally update the same attribute on all other child objects.
As we demonstrated above, one can use the drivers and APIs provided by data warehouses and SaaS applications to build a custom tool. Some parts of the tool can be replaced with open-source alternatives. In the next section, we will discuss pros and cons of various tools and technologies that help solving this problem.
Tools and technologies
Building and maintaining custom solutions to every business problem requires allocating significant resources. A better alternative to build is to buy a dedicated reverse ETL solution or data integration platform with reverse ETL features. This helps the organization to focus on its core products and services and stay competitive.
Without a reverse ETL solution, bringing data from one system to another requires building point-to-point integrations. Building a connector that transfers data from system A to system B seems simple to build at first, but they quickly and exponentially grow in complexity as each new data source requires building integration with every other data source.
Solutions like CDPs partially solve the problem as they are built with only select teams in mind, usually sales and marketing. Customer insights can be valuable for every team in the organization, including senior leadership teams, provided that they can interact with this data using their preferred SaaS product.
Reverse ETL tools and technologies arose to address the challenge of handling data transfers from data warehouses to SaaS products. They take care of writing connectors and handling brittle endpoints and rate limits; they also handle mapping fields from the data warehouse to the target system. We don’t have to worry about maintaining the connectors over time because these solutions handle API changes for us. For these reasons, investing early in a data integration tool or platform that supports reverse ETL use cases is a good idea.
Empowering Data Engineering Teams
Platform | Data Extraction | Data Warehousing | No-Code Automation | Auto-Generated Connectors | Metadata-driven | Multi-Speed Data Integration |
---|---|---|---|---|---|---|
Informatica | + | + | - | - | - | - |
Fivetran | + | + | + | - | - | - |
Nexla | + | + | + | + | + | + |
Recap of reverse ETL concepts
Reverse ETL can be simply defined as bringing analytical data from the data warehouse to operational systems. It helps turn business insights into actions. It is widely used to increase customer lifetime value, reduce churn rate, and enhance customer engagement.
When building reverse ETL pipelines, as we did above, remember some of the challenges associated with reverse ETL, including difficulty setting up custom connectors and scripts and the risk of overloading operational systems. The best practice is to focus on building reverse ETL pipelines that bring valuable insights to business users. The integration tool of choice can take care of maintaining connectors, updating API versions, and optimally using rate limits.