Unable to update connection syncCatalog using the Airbyte Configuration API

  • Is this your first time deploying Airbyte?: Yes
  • OS Version / Instance: Ubuntu
  • Memory / Disk: 4Gb / 40 GB
  • Deployment: Docker
  • Airbyte Version: 0.36.3-alpha
  • Source name/version: BigQuery
  • Destination name/version: GCS
  • Step: Please see description below
  • Description:

Hi, I’m trying to exclude certain columns (PII data) from a connection. I understand there is an open issue on github Allow a user to filter fields/columns from a table · Issue #2227 · airbytehq/airbyte · GitHub and one (hacky) option is to modify the connection schema in DB (and I can verify that manual modification does work).

Since Airbyte provides Configuration API, I wrote a script to programmatically update a connection’s name and the schema in the syncCatalog field. It managed to update the name but not the schema.

The example of “Update a connection” in the API doc does not provide details on the format of the field jsonSchema. Could you share some insights on how to properly format the data or is it possible to update the schema from the API at all? Here’s the script I have so far:

url = "http://localhost:8000/api/v1/connections/update"
connection_details = {
    "connectionId": "d267acb1-12fa-XXXX-XXXX-XXXXX",
    "name": "Some connection I want to update",
    "namespaceDefinition": "destination",
    "namespaceFormat": "${SOURCE_NAMESPACE}",
    "syncCatalog": {
         "streams": [
            {
              "stream": {
                 "name": "Some table I want to sync",
                 "jsonSchema": {
                     "type": "object",
                     "properties": {
                        "field1": {
                           "type": "number"
                        },
                        "field2": {
                           "type": "string"
                        }
                     }
                 },
                 "supportedSyncModes": [
                     "full_refresh"
                 ],
                 "namespace": "my table"
               },
               "config": {
                   "syncMode": "full_refresh",
                   "destinationSyncMode": "append",
                   "selected": False
               }
            }
        ]
    },
    "schedule": {
         "units": 24,
         "timeUnit": "hours"
    },
    "status": "inactive"
}
response = requests.post(url, json=connection_details)
print(response.status_code)
print(response.json())

The script executes and returns with the following output:

200
{'connectionId': 'd267acb1-12fa-XXXX-XXXX-XXXXX', 'name': 'Some connection I want to update', 'namespaceDefinition': 'destination', 'namespaceFormat': '${SOURCE_NAMESPACE}', 'prefix': None, 'sourceId': '122b7875-3923-48cc-XXXX-XXXX', 'destinationId': '0163edc2-9fcc-47ed-XXXX-XXXXXX', 'operationIds': [], 'syncCatalog': {'streams': []}, 'schedule': {'units': 24, 'timeUnit': 'hours'}, 'status': 'inactive', 'resourceRequirements': None}

As you see, the streams in syncCatalog is empty. Am I using the right format?

Would be possible to use a view to filter the columsn you need Jing?

Hi @marcosmarxm, Thanks for the suggestion. Are you referring to create a view on the source table without the PII columns and then load from there? I don’t have full control of the source but this is something worth exploring.

Regarding the format, do you spot anything incorrect?

Jing I didn’t have time today to investigate and validate if it’s possible to update using the API. I’ll try monday

Sorry the delay Jing, today you’re not able to edit the catalog generated by Airbyte. Because of this we recommend to generate a view to remove the PII fields. You can follow the issue here: https://github.com/airbytehq/airbyte/issues/2227