ELT: The New Standard for Data Movement (…and Why it’s Important)

For decades, ETL (Extract, Transform, and Load) has been synonymous with data movement. It’s a simple idea, you lift the data you want from a source, transform it to the format you want, and load it into your target system. With the rise of cloud computing, however, a shift is happening in how data engineers think about data transformation. While this shift has been happening for years, it’s easy to underestimate how important the change in mindset is, and how ignoring it can lead to increased costs and less efficient data solutions. You may already be aware of the data buzzword shifting from ETL to ELT, but what’s behind this change and how does it impact the way we design data processes?

What’s in a name?

Speaking for myself, old habits die hard, so it’s tempting sometimes to treat ETL and ELT as interchangeable terms for data movement processes. However, just swapping two little letters represents an important change in the thought process we use when designing data movement processes.

The old way…

Traditional ETL comes from a time where most, if not all data processes were running on-premises. ETL reads data from a source system into an ETL tool like SSIS or Informatica, and the transformation logic happens within that tool. SSIS, for example let users choose from pre-built steps, or define their own logic with C# scripts. Once that logic was applied, then the tool would load the resulting data into the target databases.

When your data is running in an SSIS process or something similar, perhaps even on the same server the data lives on to begin with, there’s not any compelling reason not to make the changes to the data during the load process itself. Smaller datasets in general meant that holding data in memory to do transformations wasn’t such a burden, so transforming the data in flight made a lot of sense.

One other issue that made ETL make sense was storage. In the days of on-premises architecture, storage wasn’t expanded as easily, so the prospect of potentially sizing up the hard drive on a physical machine to deal with data that is not even in its final form was appealing.

ELT, a new way of thinking.

The exponential growth of data stored, as well as the increasing availability of cloud computing platforms led to the development of the ELT mindset. Where ETL transforms the data in flight and loads the data in the intended format/model directly, ELT flips that process on its head. Data is directly extracted from source systems and immediately loaded into a staging area on the target system. From there, any transformation that happens is performed directly within the target system. This provides several advantages, particularly within data stacks that include cloud technology.

More Efficient Data Movement

Compared to ETL, ELT cuts the fat from the load process. Simply loading the data in a raw form rather than transforming it on the way means that data movement tools can be specialized to loading a bulk amount of data very efficiently. Once data is loaded within a system like Snowflake, it tends to be a lot more efficient to perform transformations there, rather than as part of the load process itself.

More Scalable.

With larger datasets, holding data in memory to perform operations like data cleansing as part of the load process slows down the load, or even make it impossible to run. Simply bulk loading and transforming in the target system eliminates a large amount of the compute time required to move data.

Not bringing an ELT mindset to your cloud computing project can lead to another mistake with scalability: adding unnecessary complexity to your data loads. A company that is just getting its feet wet with cloud computing might be tempted to use ETL tools in the same way that they used to when they were on-premises: creating integration packages to do small batch loads and move data from STAGING to PRODUCTION, for example. However, in cloud computing, loading and unloading data is often one of the most expensive processes, and an ETL process is by necessity extracting the data so the compute can be done elsewhere, often row-by-row.

One of my past clients used tools like ADF to perform data transformations, and the pipeline would extract the data from Snowflake, transform it via ADF, and then load it back into Snowflake. This is not an optimal pattern because it includes Snowflake compute to extract the data, Azure compute for the Data Factory pipeline to run and process the data, and more Snowflake compute to re-load the data into the target table.

More Stable

Stability is another key benefit of this new paradigm. An ETL process has to complete all the way through, including all the extraction, and transformation processes and then successfully load the data.

ELT on the other hand, loads first, which gives us an important checkpoint in the process. Once the data is on the target system, that makes it a lot easier to check the individual transformation steps. And if an ELT process fails, there are several different points of failure, and it’s easy to see the individual steps that are taken in the transformation process.

Conclusion

So, how do you know if you need to shift your thinking to ELT? Working with a data engineer can help answer that question for your individual organization, but some general pointers include the following:

If you are mixing several cloud technologies in your stack
If your long ETL processes are causing interruptions to service
If you have frequent failures in your ETL processes

Thanks for joining me on this breezy discussion of ETL vs ELT. If you’re not sure how to apply this advice to your own organization, I’m always excited to talk to new folks about how to improve their data, check out the contact page for info on how to reach me.

Geo Lawton: Data Consultant

ELT: The New Standard for Data Movement (…and Why it’s Important)

What’s in a name?

The old way…

ELT, a new way of thinking.

Conclusion

Leave a Reply Cancel reply