Using dbt tranformation to aggregate data during a sync?

jschuur · December 13, 2022, 11:07am

New to Airbyte and dbt, and I’d like to transform data from sessions in a source database to aggregates in a destination during a sync.

Overly simplified, I’d like to append/upsert the results of this query from the source to the destination and run this hourly:

SELECT 
  DATE(created_at) AS date,
  COUNT(1) AS count
FROM
  sessions
WHERE
  DATE(created_at) > NOW() - INTERVAL '1 days'
GROUP BY
  date

The actual query is more complex, since I’d be counting the amont of times certain events occurred during each session. I already have a query that does that just fine, so I’m keeping it simple for this post.

Is this something I’d do via dbt during an Airbyte sync under the transformation section in a connection?

Looking through the guide on transformations, it looks like those run on the destination data. So I’m actually transferring all my session data (a much larger data set) to the destination first and then reducing it to aggregates?

That doesn’t quite feel like what I want to do, if I could transform the data using the source dataset instead. I also don’t want the full session data in the destination at all (maybe the transformation can clean it up afterwards?).

marcosmarxm · December 13, 2022, 4:02pm

Hello there! You are receiving this message because none of your fellow community members has stepped in to respond to your topic post. (If you are a community member and you are reading this response, feel free to jump in if you have the answer!) As a result, the Community Assistance Team has been made aware of this topic and will be investigating and responding as quickly as possible.
Some important considerations that will help your to get your issue solved faster:

It is best to use our topic creation template; if you haven’t yet, we recommend posting a followup with the requested information. With that information the team will be able to more quickly search for similar issues with connectors and the platform and troubleshoot more quickly your specific question or problem.
Make sure to upload the complete log file; a common investigation roadblock is that sometimes the error for the issue happens well before the problem is surfaced to the user, and so having the tail of the log is less useful than having the whole log to scan through.
Be as descriptive and specific as possible; when investigating it is extremely valuable to know what steps were taken to encounter the issue, what version of connector / platform / Java / Python / docker / k8s was used, etc. The more context supplied, the quicker the investigation can start on your topic and the faster we can drive towards an answer.
We in the Community Assistance Team are glad you’ve made yourself part of our community, and we’ll do our best to answer your questions and resolve the problems as quickly as possible. Expect to hear from a specific team member as soon as possible.

Thank you for your time and attention.
Best,
The Community Assistance Team

marcosmarxm · December 14, 2022, 12:16am

You can use custom transformation to run an analytic function after every sync https://docs.airbyte.com/operator-guides/transformation-and-normalization/transformations-with-airbyte

jschuur · December 15, 2022, 11:12am

I’ve seen those docs and in my scenario, that means transferring a lot of data to the destination first and then reducing it down to a much smaller aggregate dataset. I’d be transferring 1000 or more times the amount of data than I need. This seems a little inefficient, without even considering the bandwidth implications.

Is there a fundamental reason why AirByte can’t be more selective at what data is picked up at the source end before it syncs?

I know there are multiple sync modes which abstract away the query used to read in the data, but it seems like if I could provide custom logic at this stage, this would ensure only the aggregated data I want ends up being sent to the destination.

I fully understand that AirByte might just not be the kind of product I need here.

marcosmarxm · December 15, 2022, 7:41pm

Joost, Airbyte follows ELT paradigms which moves all your data to your data warehouse. This paradigms acts like this because storage is cheap and if you need to reprocess anything you already have your data in your analytics tool.
If you’re using a source Database you can always create a view in the source and ingest data from it.

Topic		Replies	Views
Fetching data from Azure Table Storage to Postgres Q&A	3	232	February 27, 2023
Best practices for handling data resets Q&A normalization , getting-started , data-loading , schema	4	1883	July 14, 2022
Custom Data Transformation in Airbyte Connector Questions airbyte-connector , connector , question , data-transformation , custom-data-transformation	1	1006	July 31, 2024
Using Airbyte Cloud for Running DBT Transformations Platform Questions platform , airbyte-cloud , question , dbt-transformations , raw-data	0	58	May 14, 2024
Maintain datetime/timestamp datatype in destination (IE Postgres & SQL Server sources) Connector Questions & Issues normalization , data-loading , connectors	1	839	October 25, 2022

Using dbt tranformation to aggregate data during a sync?

Related topics