- Is this your first time deploying Airbyte?: Yes
- OS Version / Instance: Ubuntu
- Memory / Disk: 4Gb / 40 GB
- Deployment: Docker
- Airbyte Version: 0.36.3-alpha
- Source name/version: BigQuery
- Destination name/version: GCS
- Step: Please see description below
- Description:
Hi, I’m trying to exclude certain columns (PII data) from a connection. I understand there is an open issue on github Allow a user to filter fields/columns from a table · Issue #2227 · airbytehq/airbyte · GitHub and one (hacky) option is to modify the connection schema in DB (and I can verify that manual modification does work).
Since Airbyte provides Configuration API, I wrote a script to programmatically update a connection’s name and the schema in the syncCatalog field. It managed to update the name but not the schema.
The example of “Update a connection” in the API doc does not provide details on the format of the field jsonSchema. Could you share some insights on how to properly format the data or is it possible to update the schema from the API at all? Here’s the script I have so far:
url = "http://localhost:8000/api/v1/connections/update"
connection_details = {
    "connectionId": "d267acb1-12fa-XXXX-XXXX-XXXXX",
    "name": "Some connection I want to update",
    "namespaceDefinition": "destination",
    "namespaceFormat": "${SOURCE_NAMESPACE}",
    "syncCatalog": {
         "streams": [
            {
              "stream": {
                 "name": "Some table I want to sync",
                 "jsonSchema": {
                     "type": "object",
                     "properties": {
                        "field1": {
                           "type": "number"
                        },
                        "field2": {
                           "type": "string"
                        }
                     }
                 },
                 "supportedSyncModes": [
                     "full_refresh"
                 ],
                 "namespace": "my table"
               },
               "config": {
                   "syncMode": "full_refresh",
                   "destinationSyncMode": "append",
                   "selected": False
               }
            }
        ]
    },
    "schedule": {
         "units": 24,
         "timeUnit": "hours"
    },
    "status": "inactive"
}
response = requests.post(url, json=connection_details)
print(response.status_code)
print(response.json())
The script executes and returns with the following output:
200
{'connectionId': 'd267acb1-12fa-XXXX-XXXX-XXXXX', 'name': 'Some connection I want to update', 'namespaceDefinition': 'destination', 'namespaceFormat': '${SOURCE_NAMESPACE}', 'prefix': None, 'sourceId': '122b7875-3923-48cc-XXXX-XXXX', 'destinationId': '0163edc2-9fcc-47ed-XXXX-XXXXXX', 'operationIds': [], 'syncCatalog': {'streams': []}, 'schedule': {'units': 24, 'timeUnit': 'hours'}, 'status': 'inactive', 'resourceRequirements': None}
As you see, the streams in syncCatalog is empty. Am I using the right format?