Generate ConfiguredAirbyteCatalog for new connector with dynamic schema

  • Is this your first time deploying Airbyte?: Yes
  • OS Version / Instance: MacOS
  • Memory / Disk: 64Gb / 2 Tb
  • Deployment: Running locally with python main.py read ... in new connector
  • Airbyte Version: git hash 39220f4b1ebee7fce23acebcdbed7e1d259229b6
  • Source name/version: i’m developing a new source
  • Destination name/version: n/a
  • Step: tutorial Step 6: Read Data | Airbyte Documentation
  • Description:

I am following the tutorial to create a new source connector, connected to a custom API which returns streams with custom schemas – the streams and their schemas are determined by API calls. I implemented Stream.get_json_schema as described here Full Refresh Streams | Airbyte Documentation. This command prints the AirbyteCatalog:

python main.py discover --config secrets/config.json

Now I want to call

python main.py read --config secrets/config.json --catalog CATALOG

Where do I get the catalog? How can I get it to infer the catalog with discover? If I put the result of discover in a file, read complains because discover gives an AirbyteCatalog, not a ConfiguredAirbyteCatalog. I can modify the json to push each stream into a stream field, and add sync_mode and destination_sync_mode for each stream. Is there a script to do that automatically?

  • What I tried:

I looked at https://docs.airbyte.com/understanding-airbyte/airbyte-protocol#configuredairbytestream which describes the ConfiguredAirbyteCatalog but not how to generate one.

https://docs.airbyte.com/understanding-airbyte/beginners-guide-to-catalog/#dynamic-streams-example has a section “Dynamic Streams example” which is not an example of dynamic streams – it’s an example of static streams. Also note the section on ConfiguredAirbyteCatalog doesn’t describe anything dynamic, and the example is invalid because it’s missing destination_sync_mode.

Hello there! You are receiving this message because none of your fellow community members has stepped in to respond to your topic post. (If you are a community member and you are reading this response, feel free to jump in if you have the answer!) As a result, the Community Assistance Team has been made aware of this topic and will be investigating and responding as quickly as possible.
Some important considerations that will help your to get your issue solved faster:

  • It is best to use our topic creation template; if you haven’t yet, we recommend posting a followup with the requested information. With that information the team will be able to more quickly search for similar issues with connectors and the platform and troubleshoot more quickly your specific question or problem.
  • Make sure to upload the complete log file; a common investigation roadblock is that sometimes the error for the issue happens well before the problem is surfaced to the user, and so having the tail of the log is less useful than having the whole log to scan through.
  • Be as descriptive and specific as possible; when investigating it is extremely valuable to know what steps were taken to encounter the issue, what version of connector / platform / Java / Python / docker / k8s was used, etc. The more context supplied, the quicker the investigation can start on your topic and the faster we can drive towards an answer.
  • We in the Community Assistance Team are glad you’ve made yourself part of our community, and we’ll do our best to answer your questions and resolve the problems as quickly as possible. Expect to hear from a specific team member as soon as possible.

Thank you for your time and attention.
Best,
The Community Assistance Team

Usually the CATALOG is used the one inside the /integration_tests/configured_catalog where you don’t need to specify the json_schema, only the other values.
See example:

{ "streams": [{ "stream": { "name": "agents", "json_schema": {}, "supported_sync_modes": ["full_refresh"] }, "sync_mode": "full_refresh", "destination_sync_mode": "append" }, { "stream": { "name": "business_hours", "json_schema": {}, "supported_sync_modes": ["full_refresh"] }, "sync_mode": "full_refresh", "destination_sync_mode": "append" },

Thanks!

To close the loop on my confusion, the ConfiguredAirbyteCatalog is generated automatically from discover’s output when going through the UI. The only time you need to use python main.py read ... is for integration tests, which is also the only time you need to create a ConfiguredAirbyteCatalog manually.

I was especially confused because the “Standard Tests” at ./gradlew :airbyte-integrations:connectors:source-<source-name>:integrationTest doesn’t work (after installing the correct java version, Gradle started complaining about the python virtualenv setup on macOS). And I couldn’t test with the UI because it doesn’t support local docker builds. I fixed that by editing airbyte-config/init/src/main/resources/seed/source_specs.yaml to include the new connector. After using the connector in the UI, I was able to piece together how the ConfiguredAirbyteCatalog worked.

I see, sorry to hear that Lee. Airbyte uses gradle for automation tasks but the Java dependency can be tricky. Do you have any blockers or are you good?