Understanding ETL vs ELT. ETL process example & ETL toolsSep 27, 2022
Want to become more technical in just 5 weeks? Find out how the Skiplevel program can help.
Important Note: This article contains core technical terms and concepts in data, deployment, and software engineering that you’ll need prior knowledge of. If you’re unfamiliar with these fundamental terms and concepts, consider getting technical training through Skiplevel.co.
Why is understanding data processing important for product managers?
As a product manager, you’ve likely heard the terms data and business analytics, and ELT and ETL thrown around but don’t have a solid grasp of what they mean or their significance. It’s important to have a holistic mental map of the world of data analytics and data processing for a few reasons:
- To measure the impact of your product and product changes
- Extract business insights that are used to make future product decisions
- Understand how engineering work around data affects your product timeline and roadmap
- Be able to communicate, understand, and contribute during engineering discussions about setting up or migrating data processing systems
In this article, you’ll get the lay of the land of popular terms and concepts in data processing, especially as it relates to popular Cloud technology today. This will go a long way in helping you better understand, collaborate and communicate with your data and engineering teams about the work needed.
What is ETL and ELT?
ETL and ELT are core concepts in data processing that you should know as a product manager. They describe two different approaches to processing data from various sources to prepare for data analytics.
The purpose of ETL and ELT are the same: to prepare and store data for business analytics. The core difference between ETL and ELT lies in the order of operations.
ETL and ELT are nearly identical acronyms that both describe a data integration process. These acronyms stand for the following 3 processes:
- Extract: This initial step pulls raw data from different data sources for central storage in a data repository. The raw data can either be structured (i.e. text, numbers, JSON) or unstructured (i.e. images, videos, emails).
- Transform: The transform step takes raw data that can be messy, incomplete, inaccurate, and/or unusable/unreliable and cleans and processes it. Other transformations may be business related like converting currencies into USD.
- Load: Once the raw data is transformed, it’s loaded into a central data repository like a data lake or data warehouse.
ETL transforms your data before loading it into a data repository, while ELT transforms data only after loading it into a data repository.
What is ETL?
ETL (extract, transform, load) has been around for decades and has been the go-to approach for gathering and reforming data into a standardized format. In ETL, we extract data from various sources first before transforming the data and finally loading it into a data warehouse.
Real-world example of the ETL process
For example, imagine the ETL process for the sales data of a large multi-national brick-and-mortar retail company. The data from various sales sources including in-store POS systems, online website transactions, salesforce data, legacy systems, and other sources. At a pre-determined time like every Wednesday at 9am, all of the raw data from various sources would stream into a staging area, basically a separate database, and calculations, translations, and data analytics is performed on all the data. Currencies may be transformed into a standard USD currency, measurements may be standardized to US standards, and sensitive data might be encrypted, removed, or hidden.
The data is then moved and finally stored in an organized and formatted way in the data warehouse. The data in the staging area is cleared out in order to receive the next run, or the next batch of raw data that comes in and goes through the same process.
In this example, the transforming and the loading of the data must occur in the same run, even if transforming the data might takes days based on how much data there is and the complexity of the data. This makes ETL a much slower process compared to ELT.
What is ELT?
ELT (extract, load, transform) loads raw data into a data lake before transforming it. A core reason for organizations to move from ETL to ELT is when the complexity and amount of data increases to a point where it makes the transformation process very lengthy and intensive on infrastructure. Therefore, it’s more optimal to load the data into a data lake and transform the data as needed.
Real-world example of the ETL process
Let’s take the same example of sales data of a multi-national retail company and apply the ELT process to it. At a pre-determined time all the sales data comes in from their various sources and immediately gets stored into the data lake. When this data needs to be analyzed for business reasons, the data already stored in the data lake will go through the last transformation step. For example, business analysts may want to run business analytics reports once a month, at which point they will run transformations on the data already stored in the data lake.
Latency & Security Compliance Considerations in ETL vs. ELT
Notice based on the above real world examples of ETL that the transforming and loading of data must occur in the same run, even if transforming the data might takes days based on how much data there is and the complexity of the data. This is not the case with ELT processes since the loading of the data and the transforming of the data do not have to occur at the same time/in the same run. Therefore, ETL is much slower process compared to ELT and can be a bottleneck for making quick business decisions based on data.
However, ELT comes with more security and compliance concerns. Data privacy regulations such as HIPAA, GDPR, SOC2, or specific company data requirements might require removing, masking, or encrypting sensitive data before storing them in a data warehouse. In ETL, we transform the data before storing it into a permanent storage repository so compliance is straightforward. In the ELT process however, we store before we perform security transformations on the data increasing security risk.
When should you use ETL versus ELT?
As a product manager you might find yourself in an engineering meeting where developers might be discussing setting up data processing systems, or considering moving from ETL to ELT. In order to contribute during these discussions, it’s helpful to be able to connect the business/product considerations to technical discussions.
Below are a few product use cases where ELT is a more preferable choice:
- If latency or availability of data is important If data needs to be accessed quickly in order to make important business decisions, then ELT is a better choice. An example of this is the stock market, which generates a mountain of data that is consumed in real-time and decisions need to be made in near real-time. By loading the data into its final destination first, data becomes available quicker. The general rule to follow is if there’s tons of data to process making the transformation process even longer (multiple days), go with ELT.
- If you’re a growing company and data sources and formats change frequently: Often for products and companies that have not achieved stability or maturity, the data sources and formats of the data coming in can change frequently. In this case ELT is a much more flexible option. The transformation code and automations depend heavily on what the data sources are and the format of the data coming in. Therefore, it’s better to load the data first, and run the transformation process at a later time after the code and automations have been updated to reflect changes in the data and data sources.
- If your company is growing fast and scaling is important: Since data through the ELT gets loaded into a data warehouse living in the cloud, you can utilize cloud’s scaling and processing power to enable transformation, or scalable compute, on a large scale allowing for on-demand flexibility and scalability.
Streaming ELT Transformation workflow with dbt
A popular tool that’s emerged in the ELT transformation workflow worth mentioning is dbt so let’s quickly cover what it is.
The transformation process essentially runs raw data through code (i.e. SQL) that analyzes, cleans, organizes and transforms the data into its final format. Historically, making changes to this code is cumbersome and cannot be done collaboratively. This means every time a code change is made, the change itself, and who made the change is not recorded. Similarly, there wasn’t a reliable way to test changes to the code and quickly set the code changes live once successfully tested.
In comes Dbt. Dbt is an ELT tool that allows anyone (not just engineers) to collaborate on, test, document, and deploy the code used by the transformation process according to best software engineering practices.
This is a game changer in the data world for many reasons, but the most compelling reason is that dbt essentially allows anyone that knows SQL to build production-grade data pipelines. This frees up engineering resources while empowering business and data analysts to own their data processes essentially democratizing data engineering.
Other notable ETL & ELT tools
Since the extract, transform, and load processes all need to be done in the same run in ETL, ETL tools do it all. ETL tools are software designed to extracting data from disparate sources, transform data, and loading and consolidating the final data output into data warehouses.
Here are some popular ETL tools:
In contrast to ETL, the load and transform processes in ELT are completely independent from one another. You can run a load job a one point in time, and choose to run a transform on the loaded raw data at a completely separate point in time. Since you the ELT process is separated, there are many different ELT tools for different parts of the process.
Here are some popular tools for ETL used to extract and load data from disparate into data warehouses:
- Matillion (built specifically for Amazon Redshift, Azure Synapse, Google BigQuery, and Snowflake)
Final thought for Product Managers
Remember: As a product manager don’t focus on memorizing or feeling like you need to know intimately the nitty gritty details of how data processing works. Focus on acquiring a general mental map of the technical terms and concepts and the “whys” of data processing and data analytics. This will give you a good base to start and allow you to fill in the gaps later as you gain more experience working in data and with engineers.
Positive feedback is feedback too!
We often think of feedback as “critical feedback”, but positive feedback is just as important! Team cohesive and effective teamwork ultimately comes from a place of positivity and a sense of forward/upward momentum. It’s difficult to have these when just focusing on critical feedback. You want to know what you’re doing right along with ways you can improve. So as much as possible, ask for positive feedback like “What did you like about [x] that you’d like to see me continue doing?” and “What was your favorite part about [x]?”
If you want to level up your technical skills and your ability to communicate and collaborate with engineers, enroll in the Skiplevel program. The Skiplevel program is a comprehensive, on-demand course + community that helps you become more technical without learning how to code.
Become more technical without learning to code with the Skiplevel program.
The Skiplevel program is specially designed for the non-engineering professional to give you the strong technical foundation you need to feel more confident in your technical abilities in your day-to-day role and during interviews.