Summary
Exploring options for custom data transformations within Airbyte connectors without adding a new layer like dbt
Question
Hi everyone, I am new with Airbyte, I am exploring it, and I am not sure if it is a tool that works for me, and if there si any configuration missing. I want to get data from some data sources, transform them and stream this into a database that will be queried by superset. With this general description Airbyte seems to be a good option. But, I need to transform the data, make some pipelines to customize a little bit the data I am storing, and make things easier when building dashboards in Superset. I have an instance of Airbyte running and already connected to two different sources, the problem is that I get was data, with some columns with Airbyte tags (like date of access, and so on). My question is: is there any hook or something I can do to add a custom transformations in the middle of extraction and loading that does not involve adding a new layer in the infra like dbt?? ( Leave a pic with an example of the data I am getting from hubspot with Airbyte)
This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.
Join the conversation on Slack
["custom-data-transformation", "airbyte-connector", "data-pipelines", "superset", "data-transformation"]
It’s worth noting that the example you’re showing isn’t Airbyte’s output, but rather its raw copy of the source data (used for incremental syncing and such). Here the response from the API is basically just stored as JSON along with some metadata. There’s also a “final” output table that is more structured that you likely want to use instead. Take a look at that and see if it meets your needs (but note, the final table won’t usually exist until the sync is completed the first time). Where this table is located is based on the settings of the connection (depending on the destination, namespace usually maps to something like dataset, then the table names will be the source’s table names with a prefix applied if specified).
When it comes to additional transformations without using something like dbt, some folks will use query-based transformations, but this configuration is very dependent on your destination. For example, in BigQuery you could listen for the write operation against the destination, then trigger a “scheduled” query (set to manual) to further transform the data. Or just run that query on a schedule if recency isn’t as critical.