- Is this your first time deploying Airbyte?: Yes
- OS Version / Instance: Ubuntu
- Memory / Disk: 4Gb / 40 GB
- Deployment: Docker
- Airbyte Version: 0.36.3-alpha
- Source name/version: BigQuery
- Destination name/version: GCS
- Step: Please see description below
- Description:
Hi, I’m trying to exclude certain columns (PII data) from a connection. I understand there is an open issue on github Allow a user to filter fields/columns from a table · Issue #2227 · airbytehq/airbyte · GitHub and one (hacky) option is to modify the connection schema in DB (and I can verify that manual modification does work).
Since Airbyte provides Configuration API, I wrote a script to programmatically update a connection’s name and the schema in the syncCatalog field. It managed to update the name but not the schema.
The example of “Update a connection” in the API doc does not provide details on the format of the field jsonSchema
. Could you share some insights on how to properly format the data or is it possible to update the schema from the API at all? Here’s the script I have so far:
url = "http://localhost:8000/api/v1/connections/update"
connection_details = {
"connectionId": "d267acb1-12fa-XXXX-XXXX-XXXXX",
"name": "Some connection I want to update",
"namespaceDefinition": "destination",
"namespaceFormat": "${SOURCE_NAMESPACE}",
"syncCatalog": {
"streams": [
{
"stream": {
"name": "Some table I want to sync",
"jsonSchema": {
"type": "object",
"properties": {
"field1": {
"type": "number"
},
"field2": {
"type": "string"
}
}
},
"supportedSyncModes": [
"full_refresh"
],
"namespace": "my table"
},
"config": {
"syncMode": "full_refresh",
"destinationSyncMode": "append",
"selected": False
}
}
]
},
"schedule": {
"units": 24,
"timeUnit": "hours"
},
"status": "inactive"
}
response = requests.post(url, json=connection_details)
print(response.status_code)
print(response.json())
The script executes and returns with the following output:
200
{'connectionId': 'd267acb1-12fa-XXXX-XXXX-XXXXX', 'name': 'Some connection I want to update', 'namespaceDefinition': 'destination', 'namespaceFormat': '${SOURCE_NAMESPACE}', 'prefix': None, 'sourceId': '122b7875-3923-48cc-XXXX-XXXX', 'destinationId': '0163edc2-9fcc-47ed-XXXX-XXXXXX', 'operationIds': [], 'syncCatalog': {'streams': []}, 'schedule': {'units': 24, 'timeUnit': 'hours'}, 'status': 'inactive', 'resourceRequirements': None}
As you see, the streams in syncCatalog
is empty. Am I using the right format?