Deduplication of calls to substreams based on ID in Airbyte source with nested streams

Summary

When creating an Airbyte source with multiple levels of nested streams, such as reviewCycles, reviews, question, and questionRevision, there is an issue with duplicate calls to substreams based on the ID. The question and questionRevision streams end up querying the same ID multiple times. A solution is needed to deduplicate these calls.


Question

I’m trying to create an Airbyte source that has multiple levels of nested streams (meaning, multiple SubstreamPartitionRouters). In my case, I’ve got a reviewCycles stream, which produces review records. So I’ve got a review stream based on that. Each review has two objects, question and questionRevision. So I created question and questionRevision streams using a SubstreamPartitionRouter , where the parent_key is the id in the question or questionRevision objects respectively.

The thing is, each review shares many question and questionRevision objects. This means that the question and questionRevision streams end up querying the same id over and over again. Here’s an example:


{
    "data" : [
        {
            "id" : "reviewcycle-1",
            "name" : "Review Cycle 1"
        }
    ]
}

/reviewCycle/reviewCycle-1/reviews

{
    "data" : [
        {
            "id" : "review-1",
            "name" : "Review 1",
            "question" : {
                "id" : "question-1",
                "text" : "What is the capital of France?"
            },
            "questionrevision" : {
                "id" : "questionrevision-1",
                "version" : 1
            }
        },
        {
            "id" : "review-2",
            "name" : "Review 2",
            "question" : {
                "id" : "question-1",
                "text" : "What is the capital of France?"
            },
            "questionrevision" : {
                "id" : "questionrevision-1",
                "version" : 1
            }
        }
    ]
}```
So `/question/question-1` will end up getting called twice, as will `/questionRevision/questionrevision-1`.

What would be the best way to dedupe calls to the substreams based on the id?

<br>

---

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. [Click here](https://airbytehq.slack.com/archives/C027KKE4BCZ/p1711653296349699) if you want to access the original thread.

[Join the conversation on Slack](https://slack.airbyte.com)

<sub>
["airbyte-source", "nested-streams", "substreampartitionrouter", "deduplication", "id"]
</sub>

Does this need to be done with a custom stream slicer? Or can this be configured via the low-code CDK?