Summary
API endpoint query for pre-filtering parent stream based on specific property or incremental sync for updated entities
Question
Also another approach for the above problem: I’m sending requests to an API endpoint, which has a parent stream (roughly: the parent stream has a list of entities, and this endpoint returns the entity properties). Can the parent stream be pre-filtered some way?
• filter on a specific property (e.g. if I can only query the child stream for some of the parent entries?
• or based on incremental sync, e.g. both the parent and child fields have “incremental” sync based on an “updated_at” field, and the if the child is updated, then the parent is updated. Thus from the parent list only the updated ones would need to query the child stream (as opposted to currently all of them even if they haven’t been updated).
So far I haven’t seen either of these cases supported, but I might be missing something.
This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want to access the original thread.
Join the conversation on Slack
["api", "parent-stream", "pre-filtering", "incremental-sync", "updated-at-field"]
> • filter on a specific property (e.g. if I can only query the child stream for some of the parent entries?
Yep! That thing exists in the low-code spec, record filtering I believe — let me look that up in docs.
There is a caveat — it is not exposed to the Connector Builder UI.
I believe you can add it in your yaml file directly (in Builder yaml mode) and it will stay there, let me try and get a proof of concept.
If you would like me to try the API you’re using, you can link me to a cloud workspace (DM) or give me the connector yaml file and test API credentials, I’ll tinker with it over the weekend.
Also do tell me your desired filtering criteria!
> • or based on incremental sync, e.g. both the parent and child fields have “incremental” sync based on an “updated_at” field, and the if the child is updated, then the parent is updated. Thus from the parent list only the updated ones would need to query the child stream (as opposted to currently all of them even if they haven’t been updated).
That thing works somehow and I have very little clue about how. I believe your parent stream can be incremental, and if it’s incremental on updated_at
and the parent record have not been updated, no new child records will be fetched.
I believe you could make the child stream incremental using the parent key or parent record value (parent_updated_at
) as a factor, but implementation details elude me — I need to try and learn more.
Hey <@U069EMNRPA4>, thanks for the pointer! Indeed I could go and add the record_filter
(as per the https://docs.airbyte.com/connector-development/config-based/understanding-the-yaml-file/record-selector#filtering-records|docs) to the YAML of the parent stream and get a proper behaviour. I tried it out with one of the simpler APIs that I was working with.
(here I worked on a simpler criteria, as some of the parent stream items didn’t have a field that the child stream is based on, so just checking for that field was easy)
The downside is that I cannot then switch back to the UI as that will delete the non-compatible parts of the YAML (ie. the record_filter
), so after adding this, it becomes much harder to work with the Connector Builder.
How do you think this will be handled in the future? (ie. these fields are likely to be part of the connector builder later, or incompatible fields can be kept at some point, or keep editing bare YAML is the way for forward for such things?) Just gauging your roadmap, if there’s any
At least now I can see how this conceptually comes together, and I see more and more use cases where it is useful (maybe I’m working with very pathological APIs ), so will dig a but more.
<@U04705U9GN4> a couple things here:
• In older versions of the Connector Builder, the parent stream configuration was duplicated in the child stream, but this has since been improved to use $refs
to point from the child to the parent stream. So if you aren’t seeing this behavior, I’d recommend updating your Airbyte platform to the latest version.
• From looking at your YAML, it looks like you set the record_filter
on the parent stream configuration that was duplicated into your child stream, which is why that error in the second screenshot was shown (the Builder UI requires that these configurations match, and as stated above in recent versions it just uses $refs
to guarantee this)
Currently, the only ways to get this working as you desire are by doing one of the following:
• Use an error handler as you suggested to ignore 404 requests on the child stream, though as you pointed out this will result in more requests than necessary so it likely will not have the best performance.
• Create another stream like Workspace Usage Top (filtered)
which looks similar to your existing parent stream, but includes a record filter to filter out the non-project records, and use this as the parent stream of the child. Then when you are setting up your connection, just don’t enable this stream.
• Use the YAML you attached above without being able to switch back to UI mode in the Builder.
I understand that none of these solutions are ideal, so I’ll be sure to capture this feature request in our github issues, but I’m not able to provide an estimate right now on when we will be able to prioritize that feature
> Create another stream like Workspace Usage Top (filtered)
which looks similar to your existing parent stream, but includes a record filter to filter out the non-project records, and use this as the parent stream of the child. Then when you are setting up your connection, just don’t enable this stream.
I remember a feature request to make certain streams “internal”, i.e. a stream is used internally by the connector as a parent stream, but we should not emit it’s records as useful output to the platform. That could hypothetically help with this situation too.
Hey <@U02T7NVJ6A3>, thanks for the detailed feedback! Indeed I was making this connector on an earlier Airbyte version (0.51.0, maybe?) originally, and will be keen to try out the newer ones with the $ref
behaviour, that sounds very handy indeed!
Makes complete sense on the “internal” streams, somehow I didn’t think of that!
And indeed, <@U069EMNRPA4>, I had some other connectors where that would have been handy (ie. a parent stream is never really used bare, just to feed the children), though just not enabling a stream in the UI is fine for now, the difference is mostly usability, and not functionality.
I think I’m all set for now on this topic, cheers! (and until next time…!)
Dammit. My impression was that record_filter won’t be deleted when you switch to UI. Eh. Well, we should just add it to the UI even if it’s a noop with custom yaml button. <@U02T7NVJ6A3> would it be wild to shove it in quickly?
I thought record_filter should stick around as well, so if it isn’t then that sounds like a bug. I’ll look into this today and get back here
<@U04705U9GN4> I tested this out and it seems to be behaving as expected - the record_filter
that you set in YAML mode should stick around even after you switch to the UI. Even if you edit the record selector for that stream, the record_filter will still be there under the hood and will still apply its logic during test reads in UI mode.
Thanks for checking <@U069EMNRPA4> <@U02T7NVJ6A3>!
So what I had for this particular use case, though you can tell me if it’s pathological:
• stream 1 that will be the parent stream doesn’t have a filter, as we want to receive all the data for this stream 1
• stream 2, the child stream is only available for certain items for the parent stream, so in the child I set a record_filter
(as per the first screenshot);
◦ This felt fine to add here, as in the YAML actually it seems all the parent stream config is duplicated (rather than referenced), if I read it correctly
• When I try to go back to the UI, then I get the error message on the second screenshot
Let me know if it’s hard to follow (I’m also adding the YAML here in case it’s useful)
I think I could work around this by: not setting the record filter on the second stream, but setting an error handler (which would just ignore a bunch of 404 errors for requesting items that do not exists
This feels very icky though as:
• it runs a lot more queries than it actually needs
• the attached YAML works otherwise…
Happy to hear if I could do something else too (since the solution working technically doesn’t mean one should be actually doing that, only a strong signal