Question about hiding columns in the connection schema

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: COS on GCP
  • Memory / Disk: you can use something like 4Gb / 1 Tb
  • Deployment: Docker
  • Airbyte Version: 0.36
  • Source name/version: Salesforce 1.0.11
  • Destination name/version:
  • Step: The issue is happening during sync.
  • Description:

Hi, I’m trying to hide PII data from the source and would like to confirm on a few things.

I understand Airbyte doesn’t provide the feature to select columns yet and I have posted a question before.

This time the connection is to sync from Salesforce (not a database) to GCS. So the table view solution does not apply and unfortunately for some reasons there are PII columns like name that cannot be hidden in Salesforce (unless there is one?)

I tried the hacky solution provided in the github issue to edit the connection catalog directly to remove the PII columns. It seems to work since I don’t find them in the files on GCS.

Then this comment got me worried again:
I tried that solution and realised that the unwanted columns are still being synced in the raw table (they are removed in the normalised version). Is there a way to tackle that problem?

So my questions are:

  • If the connection catalog does not include a column of a stream, does that mean data from that column won’t be extracted from the source?
  • Or does it depends on the type of the connection? For example, mine is syncing over to GCS so it doesn’t have the option of normalization but it could be different if the destination is BigQuery as mentioned in the github comment?
  • Where are the buffered data saved? From the log I see things like Records read: 50000 (120 MB), Flushing all 1 current buffers (180 MB in total), and Finished writing data to ee4b2c65-5193-49ef-a5a5-c6233e04eef110743XXXXXXXX.csv. Are the buffered data saved on the host that runs Airbyte? I ask this because I want to inspect file ee4b2c65-5193-49ef-a5a5-c6233e04eef110743XXXXXXXX.csv to make sure that no PII columns are extracted. Do you know where I can find this file or does it make sense to check it?

Thank you :slight_smile:

Hey @Jing,

This is an excellent question! Right off the bat I can say that yes, if the catalog does not include a column, it will not be in the stream.

The other points you asked about I’ll need some time to look into. Hope to get back to you with more info soon!

1 Like

Hi @natalyjazzviolin,

I just want to update here that even after the edit on the catalog to remove a column, it could still show up in the destination.

I used a BQ destination in a connection and removed a column from the catalog. The final (normalized) table looks OK but unfortunately the raw table still has it. This confirms the answer on github Allow a user to filter fields/columns from a table · Issue #2227 · airbytehq/airbyte · GitHub

I really wish Airbyte could prioritize the Column Selection feature Harvestr app as it has become a major pain point when there are PII data involved…

Thanks,
Jing