Summary
When creating an Airbyte source with multiple levels of nested streams, such as reviewCycles, reviews, question, and questionRevision, there is an issue with duplicate calls to substreams based on the ID. The question and questionRevision streams end up querying the same ID multiple times. A solution is needed to deduplicate these calls.
Question
I’m trying to create an Airbyte source that has multiple levels of nested streams (meaning, multiple SubstreamPartitionRouter
s). In my case, I’ve got a reviewCycles
stream, which produces review
records. So I’ve got a review
stream based on that. Each review has two objects, question
and questionRevision
. So I created question
and questionRevision
streams using a SubstreamPartitionRouter
, where the parent_key is the id in the question or questionRevision objects respectively.
The thing is, each review shares many question and questionRevision objects. This means that the question
and questionRevision
streams end up querying the same id over and over again. Here’s an example:
{
"data" : [
{
"id" : "reviewcycle-1",
"name" : "Review Cycle 1"
}
]
}
/reviewCycle/reviewCycle-1/reviews
{
"data" : [
{
"id" : "review-1",
"name" : "Review 1",
"question" : {
"id" : "question-1",
"text" : "What is the capital of France?"
},
"questionrevision" : {
"id" : "questionrevision-1",
"version" : 1
}
},
{
"id" : "review-2",
"name" : "Review 2",
"question" : {
"id" : "question-1",
"text" : "What is the capital of France?"
},
"questionrevision" : {
"id" : "questionrevision-1",
"version" : 1
}
}
]
}```
So `/question/question-1` will end up getting called twice, as will `/questionRevision/questionrevision-1`.
What would be the best way to dedupe calls to the substreams based on the id?
<br>
---
This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. [Click here](https://airbytehq.slack.com/archives/C027KKE4BCZ/p1711653296349699) if you want to access the original thread.
[Join the conversation on Slack](https://slack.airbyte.com)
<sub>
["airbyte-source", "nested-streams", "substreampartitionrouter", "deduplication", "id"]
</sub>