Merging and deduplicating messages from multiple Snowflake databases using Airbyte Slack connector

Summary

User is looking to merge and deduplicate messages from multiple Snowflake databases using the Airbyte Slack connector to create a sustainable sync destination.


Question

Airbyte folks, I have a bit of a crazy thing I’m trying to do and would appreciate any input/feedback.

Using the Slack source, I’ve been trying to export all the messages from a very large Slack community I run. (btw, I’ve been following the development of the new low-code Slack source closely and it’s a huge improvement!).

I’ve found myself in a place where I have 3 different syncs in 3 different Snowflake databases, and each one is missing messages, but I believe that combined they cover 99+% and are “good enough”. Particularly with regard to thread messages.

I’d like to merge and deduplicate them all into a single database while making it a sustainable sync destination moving forward. Has this ever been done? (plan in thread)



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want to access the original thread.

Join the conversation on Slack

["merge-messages", "deduplicate", "snowflake-database", "airbyte-slack-connector", "sync-destination"]

My plan, after studying the SQL queries run to transform airbyte_internal tables to airbyte_schema:

  1. Grab a deduplicated list of all "_airbyte_data" blocks from each of the 3 sources, and join all 3 together into a globally deduplicated list
  2. Throw away "_airbyte_raw_id" and generate a new one from UUID_STRING() (what could possibly go wrong, eh?)
  3. Set "_airbyte_loaded_at" to NULL so that on the next run, the Snowflake destination will write out all rows to the user-facing schema (which will be truncated in advance)
  4. Take all this and shove it back into airbyte_internal, and let a sync run

Uh oh. Well first off thank you for tinkering with Airbyte, and for your PR :wink:

Hey <@U069EMNRPA4> :smile:

Tell me more. Why 3 syncs? Are you syncing to 3 destinations from the same Slack source, and all three miss some records, but different records?

Ideally, we should fix that problem and you won’t have to deal with it all.

3 syncs happened kinda organically from 3 different experiments.

1st sync was on Airbyte cloud, until I discovered it wasn’t getting anything from private channels. but I did a HUGE 365 day lookback for threads, so I felt that data was valueable.

2nd sync was from a dev machine, where I patched the current connector to grab private channels (thus my PR branch).

3rd sync was from another dev machine, where I’ve been testing the new low-code connector branch.

I’m still a bit new to Airbyte and was worried about creating conflicts/issues, so I set each one of these up to a different destination in Snowflake

as you know with the current/python connector, you only get a single timestamp in state for each channel type. So there’s no way to go “whoops, connector, you messed up and didn’t sync all the messages from channel X”

Messing with Airbyte metadata might be adventurous, and I personally didn’t try that yet.

If you’re that deep into the connector, I would:

  1. Wait for lowcode Slack, it’s due soon.
  2. Grab private channels with it, so you’re -1 syncs already.
  3. see if that works.
    If it’s deadass slow, then I would proceed to another adventure called “let’s build a report export based slack source” and if that’s required, I would be happy to help you.

Now I’m in a place where I have much higher confidence with channel_messages as of sync 3. but the threads data seems to be a mess

I think unfortunately I’m gonna need to wait until low-code slack v1.1, when the thread sync gets rewritten, right?

(based on <🎉 Source Slack migration to low code by midavadim · Pull Request #35477 · airbytehq/airbyte · GitHub message>)

hopefully nobody in Slack is reading this :see_no_evil: , but we’re going to stop paying them at the end of this month because they tripled the price on us (for a user community Slack!) - so that’s why I’m getting a bit desperate and resorting to doing terrible things :joy:

> “let’s build a report export based slack source”
Are you referring to the Slack export function, where it gives you a giant zip file of all your data?

Well at the very least threads should sync in full, if they don’t that’s a bug. We’ll fix it before we release lowcode

Yeah, I still found a percentage of missing thread messages when I did a channel-by-channel comparison between sync 2 and sync 3. It’s not huge, but at least several thousand.

I am fully prepared to blame this on the Slack API and how totally weird it is, particularly dealing with threads!

But that’s why I’d need to wait until the N+1 release of lowcode, when the threads sync gets rewritten. I also don’t want to rush y’all on that, but I’m happy to provide my testing services (and become a paying Cloud customer as soon as it’s released)

source-slack in low-code is merged and released. So the first release is out. If it does not yet grab private messages, post an issue or a PR please!

I know you’ve been very active in the PR already — I hope this approach works better than the previous one. If you are still getting throttled and ratelimited in threads becaue they’re using legacy component, ping me here, and make an issue, we will prioritize it.

We’re pretty passionate about removing old python code :slightly_smiling_face:

Woohoo! Yeah I saw that, very excited :slightly_smiling_face:

I have been testing that PR branch (sync #3 as mentioned above) and it’s still definitely got some ratelimit and minor missing message issues (clarify: for thread messages). I’ll definitely be excited to see that get updated to low-code and will be a willing tester on a complex case :slightly_smiling_face:

I’ll try my hand at a PR for the private channel stuff!