Migration to Airbyte Cloud from self-hosted instance and homegrown ELT tool

Summary

Inquiring about the possibility of utilizing existing tables/sync states in new Airbyte Cloud connectors for migration and seeking advice or guides from others who have done a similar migration.


Question

Hello, everyone! My team currently utilizes a self-hosted Airbyte instance, Hevo, and a homegrown ELT tool and we are looking to migrate all of this to Airbyte Cloud. From what I can tell, it’s not really possible to utilize existing tables/sync states in new Airbyte Cloud connectors (to avoid backfilling large tables.) Is that accurate? Is this kind of migration something others have done and might be able to provide advice and/or point me towards some guides? Thank you!



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

["migration", "airbyte-cloud", "existing-tables", "sync-states", "guides"]

Yes, several folks have done this. If you’re connecting to the same destination, try to align versions to current and then you need to move the current connector state over (under the Advanced toggle on the connection).

Here’s a thread from someone who just did this, but you’ll find others around here who have done docker->kubernetes, kubernetes->cloud, cloud->OSS . . . pretty much every path imaginable:
https://airbytehq.slack.com/archives/C021JANJ6TY/p1720698134445559

Another tip: make sure you pause the old one while you’re migrating to make sure they aren’t both trying to write!

Okay great, thank you! I’ll give the steps in that thread a shot.

What destination are you pushing to?

Cool. well in that case, you can use Time Travel if it goes wrong and get the pre-sync state restored for those tables:
https://docs.snowflake.com/en/user-guide/data-time-travel

Just keep in mind that unless you’re on Enterprise, you only have same-day Time Travel access. But just wanted you to know that you have a path back.

(Or if you want to be cautious, just make a copy of both the Airbyte raw/tmp tables AND the destination tables before you start. but what fun is that :joy:)

The other wrinkle here is that the landing spot in the destination will likely change as well as part of our consolidation and cleanup efforts, from one database to another

that’s okay . . . Airbyte knows very little about the environments, so if you copy both the raw and destination tables to their appropriate locations and set up the job to point at them, Airbyte will be none the wiser to the change

really all that matters is that you point to a location where the raw table is in the correct location (the airbyte_internal dataset/schema by default) and the destination tables are wherever the config tells them to be, all named appropriately.

So you can rename/move all you want (and in this case that would also give you your backup too), then just make sure the Airbyte config is correct.

I’d try one smaller source manually first, but then you can always automate it by creating all the sources/connections over the API and translating the old values to the new ones, and script the table copies in Snowflake at the same time

(if it’s a small number, you can obviously just do it manually. But when you’re moving a couple hundred, it’s much nicer to be able to automate it)

Okay that makes sense. I wasn’t at the company when the OSS version was set up, but it seems like we’ve been using it long enough that the way Airbyte is naming the raw tables has changed as well

Yeah, and some of the older versions weren’t very good at cleaning up after themselves.

The most important thing is that you need to make sure you’re on a fairly recent version that uses Destinations V2 for Snowflake, as this was the last big breaking change (ideally it’ll be best to upgrade all the way to current before switching to Cloud—but Destinations V2 is the big breaking one). Make sure the source and destination connector versions are current too.

(You can find which of the raw tables are actually in use based on modified times using an INFORMATION_SCHEMA query in Snowflake)

Okay yeah, I was kind of thinking I’d have to go down this path. Will moving to a version with Destinations V2 backfill the newer versions of the raw tables (in airbyte_internal)? If so, does it leave the previous versions of raw tables and SCD tables in place?

there’s a migration path there, but it involves moving up to a version where both were supported and taking some extra actions. This is quite old, but check here to start:
https://docs.airbyte.com/release_notes/upgrading_to_destinations_v2

note that while it’s a pain to have to migrate through the breaking change, Destinations V2 has increased success and stability significantly for us, and I’ve heard the same from other folks. A big part of this is replacing the prior Normalization process (bundled dbt) which was often slow/frail with Typing/Deduping.

So stick with it, it’ll be worth it :smile:

(and count the cost . . . depending on the size of the tables, it may or may not just be easier to re-sync them on the new one versus fiddling with all these moving parts)

Yes, that’s a good call as well… Our biggest pain point is a database with ~190 million rows in it

Yeah, another trick I’ve done is made a dummy sync of a small subset, then written a query to convert the old raw table into the format of the new one. We deal with a couple trillion rows from APIs, with email event stream data (send/open/click/bounce) being the biggest offender, and some of those backfills taking months to finish

I know they’ve been doing a TON of work on database connector throughput (and have more to come in that area), so you may find it isn’t as painful as it was in the past too

Okay that’s a good suggestion, I will look into that as well. Thank you so much for your insight today, you’ve given me a lot to investigate.

Sure thing, and don’t hesitate to bump this thread if you run into problems or questions—usually myself or one of the other community folks have run into a lot of these cases before, and if not the Airbyte folks are around too