Incremental Loading to Destination Tables

Summary

Exploring the option of enabling incremental loading to destination tables to improve sync performance. Concerns about potential drawbacks and whether the feature is well-vetted.


Question

curious – has anyone enabled loading data incrementally to destination tables?

i have a sync that doesn’t support incremental syncing, so the full overwrite takes 2 hours to complete (90+% of the time is spent extracting data). i imagine incrementally loading could be favorable here, but not sure if there a drawbacks from using this option. it’s kinda hidden so not sure if it’s vetted enough



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want to access the original thread.

Join the conversation on Slack

["incremental-loading", "destination-tables", "sync", "performance"]

just curious, what are you hoping to get out of incremental loading?

braindumping a few thoughts:
• incremental loading with full refresh overwrite probably doesn’t do what you want - in FR/OW mode, we first load to a temporary table instead of the real final table (b/c we want to avoid nuking + slowly rebuilding your live data every sync)
◦ if you’ve ever seen a blahblah_airbyte_tmp table while a sync runs, this is why :slight_smile:
• this option is mostly intended for incremental syncs, or append syncs, where the effects will be visible immediately
• but even then, it’s not super recommended - running the typing+deduping process incrementally will incur additional warehouse charges, so typically this is only a good idea if you actually need it for a specific reason

ahh, that all makes perfect sense, <@U02UC3SNJ72>! thanks for taking the time to write up the response

my thought was that enabling that could’ve maybe “streamed” new data during a sync, but it’s evident to me now that that’s not really feasible, given what you said

i’m definitely happy with how things work right now, so not enabling that checkbox is perfectly fine :ok_hand:

:tu: sounds good! one last question for my own curiosity - does your stream have a primary key? i.e. could we hypothetically “full refresh dedup” it? (not currently an option, just pondering)

… b/c that would make the incremental loading option actually do what you want (though the caveats around warehouse cost would still apply)

yes, it does, so that’d definitely be am option :raised_hands:

agree it’d lead to higher warehouse costs since it’d be up for longer. given our data stack size, we can still get away with jamming all read and write queries into a single warehouse, so this’d remove some spikiness at the expense of more active warehouse time

and i’m here for any other questions :slightly_smiling_face:

nice, makes sense. Like I said, still in the “just pondering” stage of things but I recall having a similar conversation once before.

I guess one final question, what’s the impact to you from spiky warehouse usage?

to be honest, it’s not discernible to end users running analytics. when Airbyte is loading data in Snowflake, some queries do definitely get queued, but only for a few seconds at most, and it hasn’t bothered anyone.

if we did start running into issues with jamming all queries into a single warehouse, the first thing i’d do is move Airbyte to its own warehouse (since it’s already our largest user) before thinking that incremental loads would be the best way to solve this :slightly_smiling_face:

so in summary, i think that incremental loading could certainly be a neat feature that i’d consider enabling, and it’d help at the margin as long as/the end user is aware that it’d lead to higher warehouse costs, but it’s definitely not a necessity in our current setup

now, whether the work you all are doing there could lead to data streaming-style incremental loading… that’d be neat for sure :wink:

cool, thanks for all the details! some conclusions I’m drawing:
• you don’t need the incremental loading thing right now
• honestly, most people don’t and I’m probably going to actually disable it - it’s sort of a footgun at the moment and pretty hard to reason about
• there’s some core features that might help make it more useful
• and those core features are something I should continue thinking about!
again, thanks for talking this through with me :slightly_smiling_face:

of course, thank you as well for taking the time!