Data Ingestion: Batch vs. Stream, Which…

Erfan Hesami

Oct 29, 2024

Choosing Between Batch and Stream Processing for Optimal Data Pipelines

Read →

4 Comments

Ayoade Adegbite

Oct 31, 2024

I have a question but outside of batch or streaming.

At what point should a company start implementing an incremental load?

Expand full comment

Reply (1)

Comment deleted

Nov 1, 2024Edited

Comment deleted

Expand full comment

Pipeline to Insights

Nov 1, 2024Edited

To decide whether to implement incremental loading in your team or company, it’s helpful to understand the benefits, challenges, and scenarios where it excels.

a) Why Choose Incremental Loading?

- Efficiency: Incremental loading moves only new or updated data, unlike full loading, which copies everything. This minimises system load, speeds up ETL/ELT processes, and preserves historical accuracy.

- Reliability: Incremental loads tend to be faster, reduce the risk of ETL/ELT failures, and stabilise performance over time. They also allow historical data retention, even when records are deleted or modified at the source.

b) Ideal Scenarios for Incremental Loading

Incremental loading is well-suited when:

- Data Volumes Are Large: Incremental loading is especially effective for large datasets where a full load would consume significant resources.

- Source Systems Are Slow: When queries on source systems are time-consuming, incremental loads help avoid repeatedly accessing the same data, improving efficiency.

- Reliable Change Indicators Are Present: Datasets with updated_at fields or other reliable timestamps are ideal candidates, as they simplify change detection.

- Data Retention Requirements: Incremental loads allow retention of historical records even when data is purged from the source, making it useful in cases where long-term data preservation is required.

c) Good Candidates for Incremental Loading

- Immutable Event Streams: Append-only tables like event logs or "tall + skinny" tables are ideal since they continuously add new records.

- Reliable Tracking Fields: Datasets with an updated_at field or other reliable change markers make incremental updates straightforward.

Scenarios Where Full Loading is Better

Full loading may be preferable when:

- Small or Static Datasets: Small tables may not benefit significantly from incremental efficiencies.

- Frequent Structural Changes: Datasets with frequent column changes or renames are more challenging to manage incrementally.

- Unpredictable Updates: Data with irregular or complex updates may be better suited for full loads to ensure all changes are captured accurately.

- Complex Transformations: If row-by-row comparisons or calculations across multiple rows are needed, full loading may be simpler to implement.

d) How to Implement Incremental Loading

- Source Detection: Identify changes in the source by using timestamps, change-tracking fields, or Change Data Capture (CDC) in relational databases.

- Destination Comparison: Compare new data with the destination using techniques like row hashing, upserts, or selective updates. Resort to brute-force comparisons only as a last resort due to their high resource usage.

Incremental loading optimises ETL/ELT by saving time and reducing risks, but it is most effective when data is large, changes are easy to track, and historical data retention is a priority. For smaller, simpler datasets or unpredictable updates, full loads may be more efficient.

Expand full comment

Reply (1)

Pipeline to Insights

Nov 1, 2024

My experince with incremantal:

When I worked as an Analytics Engineer at Xero, I managed an invoice source table that contained petabytes of data. Each month, during the first week, our data team would dump this data into our data lake, and I had to load it and rebuild dependent models from scratch. This process took around 6 hours to complete ( only for one model!!!) . Initially, I attempted to reduce the time by scaling up our Snowflake warehouse cluster, but this had minimal impact on performance and significantly increased costs.

Realising it might be a good candidate for an incremental load, I redesigned the load process to only load new or updated records. This reduced the processing time from 6 hours to just 2 hours, without needing to rely on a costly, oversized cluster. This approach not only saved time but also cut down operational expenses.

Hope you find this helpfull :)

Expand full comment

Reply (1)

Ayoade Adegbite

Nov 1, 2024

Thank you for your response. It’s well detailed.

Expand full comment

Pipeline To Insights

Data Ingestion: Batch vs. Stream, Which…