Documentation for old Python CDK and Source Connector for Async HTTP API with Incremental Sync Modes

Summary

User is looking for documentation on the old Python CDK and help with developing a source connector for an async HTTP API with incremental sync modes.


Question

I’m struggling to find good documentation on the old python cdk. Has anyone developed a source connector for an async HTTP API, and correctly set up the streams and sync modes (specifically incremental)?



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want to access the original thread.

Join the conversation on Slack

["documentation", "python-cdk", "source-connector", "async-http-api", "incremental-sync-modes"]

Yes, there is documentation available on how to develop a source connector for an HTTP API and set up the streams and sync modes, including incremental sync, using the Python CDK. Firstly, you can use the Airbyte code generator to bootstrap the scaffolding for your connector. You can select the Python HTTP API Source template and input the name of your connector. The application will create a new directory in airbyte/airbyte-integrations/connectors/ with the name of your new connector source. For HTTP APIs, the CDK provides the HTTPStream object, which saves developers the hassle of implementing their own object source. To implement incremental sync, you need to change the sync_mode field in the sample_files/configured_catalog.json to incremental. After making the necessary changes, you can test your implementation by running the connector and passing the input state back to it. You should see that only the record from the last date is being synced, which is acceptable behavior as Airbyte requires at-least-once delivery of records source. Here is the code snippet for changing the sync mode to incremental: python "sync_mode": "incremental", And here is the code snippet for testing your implementation: python # Save the latest state to sample_files/state.json python main.py read --config secrets/config.json --catalog sample_files/configured_catalog.json | grep STATE | tail -n 1 | jq .state.data > sample_files/state.json # Run a read operation with the latest state message python main.py read --config secrets/config.json --catalog sample_files/configured_catalog.json --state sample_files/state.json Please note that these instructions are based on the new Python CDK, and may not be fully applicable if you are using an older version of the CDK.

<@U04SE163WC8> where do I find an example of one custom cdk connector so I can use as reference?

You can find an example of a custom Airbyte source connector in the tutorial on how to build a connector to extract data from the Webflow API. This tutorial discusses the implementation of the Webflow source connector that was developed using the Python Connector Development Kit (CDK). The tutorial covers real-world requirements such as authentication, pagination, dynamic stream generation, dynamic schema generation, connectivity checking, and parsing of responses. It also demonstrates how to import your connector into the Airbyte UI. You can find the tutorial here. Additionally, there’s a video tutorial on YouTube that introduces Airbyte’s new Low Code Connector CDK and demonstrates how to create a custom connector using the Exchange Rates API. You can watch the video here. For more resources, you can check out the Airbyte Dotnet CDK which comes with a guide for developing an Airbyte source with the Dotnet CDK. You can find the guide here.

Well, old is an overstatement. We believe that most connectors should be low-code, but Python CDK is the foundation that will improve along with low-code CDK, and we have several critical connectors written with it.

I’m happy to help if you have specific questions.

The documentation is not great — we used to have sphinx reference docs but they’re outdated, and I’ll try to get them back up soon.

As for examples, source-stripe and source-shopify are good. GitHub, Jira, Salesforce, and HubSpot as well.

Does that help and give you a good starting point?

<@U069EMNRPA4> didn’t mean old in that way :sweat_smile: Thanks for the references!

I’m developing a Twitter Ads connector and found out (based on <Issues · airbytehq/airbyte · GitHub issue> and the docs) that low-code would not be possible because of the asynchronous nature of the API. So I’m trying to use twitter’s sdk with your cdk framework.

I’ll gladly accept your offer on more specific answers after take a look on the examples you provided! I guess right now my main struggle is understanding what is abstraction handled by the cdk and what is missing in my code.

There’s amazon-ads connector that works with async export — it requests an export, polls for it’s status, then grabs the export file, uncompresses it and then processes records.

I think the pattern of what you’d have to do today would be similar to this. If there’s a Python Twitter SDK that would make it easier for you — great.

We’re looking at implementing async exports support in the low-code CDK, but I’m not yet absolutely sure it will generalize well enough to be in the declarative language of the low-code CDK.

Hey <@U069EMNRPA4>, just passing by to thank you again for the tips. I was finally able to do it.

To anyone who might follow this thread, I ended up using three source connectors as reference: Amazon Ads, Facebook Marketing and Google Ads.

With their async nature and advertising reporting as purpose, combined with their rather different implementations, they allowed me to create my Frankenstein monster lol

Once I have the code better organized I might share the fork repo here.

Cheers!

You’re welcome!

Start a draft PR, I’d be happy to help.

<@U069EMNRPA4> I might have celebrated too early? :sweat_smile: I’m unable to import the docker image into my Compute Engine deployment (first screenshot)… This is happening even though:

  1. I can successfully spec when running locally
  2. I’m successfully able to pull the image via ssh in the VM (Compute Engine)
    a. It has admin role to google artifact registry
    b. gcloud auth is using the VM service account
    c. gcloud auth configure-docker set up with with the correct domain.
  3. I’m able to successfully import the connector on my local deployment and connection tests pass (last screenshot)
    What am I missing?

Not sure if it’s worth noting, but I built the image with poetry and airbyte-ci

Triple-check interpolations — should they be {{tag}} and such instead?
Although I might be tripping

My gut instinct is that if you can pull manually / locally, but can’t pull into Airbyte — check those variables. Can you hardcode full path without interpolations?

I only masked the repository and tag infos here, but in both cases (local deployment and vm) I used the same valid avlues

I’ll ask how you can get some logs on those :wink:

would be great, thanks!

Alright, here’s what I got out of our folks:

> If it’s on prem — I’d advise them look in their local temporal dashboard with that same WorkflowType="SpecWorkflow" query filter for more info

I restarted Airbyte and kept the terminal open to see logs. Airbyte apparently thinks the image does not exist when it tries to pull it. To fix this I first ssh’ed into my vm and pulled the docker image. I then imported it via Airbyte, since it was now available locally - this worked fined! Do you have any idea why Airbyte is not finding the image from the valid docker repo and tag?

/shrug

It’s most likely that something in the URI to the docker image is not right.