Summary
How to retrieve children blocks in the correct order from a Notion page synced to BigQuery using Airbyte connectors
Question
I want to reconstruct a Notion page synced to BigQuery using Airbyte and it’s Notion and BigQuery source and destination connectors. The problem is that while I can locate a parent page/block using parent
column in blocks
table unfortunately naughtier blocks nor pagers have children column. How can I use airbyte to list the children blocks in the correct order?
This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want to access the original thread.
Join the conversation on Slack
["notion-source-connector", "bigquery-destination-connector", "airbyte", "children-blocks", "order"]
hey Maciej - I just wrote some parsing logic for this, and what I ended up doing was recursively rebuilding the page as a tree, then traversing it and flattening it with an accumulator (i.e. with reduce
) to pull out just the plain_text
(which is all we needed for our purposes). Fortunately as far as I can tell the blocks do get written in order to the data file (at least for JSONL, which is what we’re using). The reason it needs to be a tree is that a block could have another block as its parent, we found all blocks do have a parent id (either a page_id
or a block_id
), so you can determine the children.
Building the tree basically looks like:
• get the id from the page, store it as a key on a dictionary/object/map (depending on your language of choice), along with another key children
, and
• create a function to run on children
that maps across the values in the blocks
data file and checks if the parent_id
matches, if no results return an empty array, otherwise return the a dictionary/object/map with a text
field on it (and store the text from the record there) and a children
field, then run the same function for its children
There’s some additional data munging because based on i.e. the type
of the parent
field you’ll need to check whether you should be getting the page_id
or block_id
, and maybe a couple other things. Feel free to message me if you have questions though
Thank you <@U05V19A54QN> for your answer. In my case I would like to use a big query as a destination so I can’t depend on file creation dates.
I’m at the point when I’m starting to think about forking the Notion connector so it exports child blocks.
Another option is to write a cloud function that will be using a Notification API to augment the “blocks” table.
Every solution looks like an ugly hack to me. I can’t believe that the notion connector is so bad. I think it is such an obvious use case that someone had to notice this problem.
Another thing that borders me when I think about a time stamp based approach is data consistency. What will happen when my parsing job will be executed before the airbite sync is finished? How to handle blocks deletion?