Summary
The user is stuck on building a GitBook connector and needs help with handling ‘sub pages’.
Question
I am stuck on building this GitBook connector. I am 95% there but not sure how to handle the “sub pages”
https://github.com/airbytehq/airbyte/discussions/38072
This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here.
Join the conversation on Slack
Click here if you want to access the original thread.
`["gitbook-connector", "building", "sub-pages", "airbyte"]`
So sometimes the best way to work through this is to look at a platform’s existing client libraries. In this case, I see them implementing a method called listPages
which calls an (undocumented AFAIK) endpoint of /spaces/${spaceId}/content/pages
. So maybe try that and see if it gets you what you want in a more sane way. If it’s in their client library, it’s unlikely to be deprecated.
That full code is available here:
https://www.npmjs.com/package/@gitbook/api?activeTab=code
The note above is based on line ~1131 in /@gitbook/api/dist/index.js
— this is the code I’m referencing:
path: `/spaces/${spaceId}/content/pages`,
method: "GET",
query,
secure: true,
format: "json",
...params
}),```
There’s also this around line ~5229 in /@gitbook/api/dist/client.d.ts
:
* No description
*
* @name ListPages
* @summary List all pages for the main revision content of a space
* @request GET:/spaces/{spaceId}/content/pages
* @secure
*/
listPages: (spaceId: string, query?: {
/**
* If `false` is passed, "git" mutable metadata will not returned. Passing `false` can optimize performances of the lookup.
* @default true
*/
metadata?: boolean;
}, params?: RequestParams) => Promise<HttpResponse<{
pages: RevisionPage[];
}, Error>>;```
Thank you <@U035912NS77> digging through your comment now.
<@U035912NS77> just tried it and it is very similar to the call I am making already… It still has the nested pages array in the pages objects.
looking at their docs, I think those nested pages may be revisions. but at least with this endpoint you know you’re already getting all the pages, so you could just drop that key. or try to parse it to get just the list of IDs if you think this is actually a hierarchy and not revisions. (although maybe path
would get you that?)
You could also extract the same endpoint twice; once for the pages, then again for the hierarchy using the Record Selector to pull just the nested page values for each page (maybe calling this as a child stream so you can use a transformation to add the parent’s ID as a field)
<@U035912NS77> first off I am super appreciative of your thoughts.
I believe the “sub pages” are not the revisions they are nested pages in the docs. I have access to the GitBook engeering folks so I will check with them to verify.
This is an example of output I am seeing. I did not think of dropping the key(s) and using the path
I think that is a great idea and will try that next.
Well looking at this again. I actually need all the nested id
not path
Example response from your first suggestion above. Do I need to get all the id
from all the pages. I guess that would be something like *,id
?
Ok, I am failing at what technique to use to extract all id
from the json graph
Technically since you only what that one field, you could just use the default extractor and only map the ID field in the schema (unless their API gives you an option to limit the fields returned which would just save bytes on the wire).
I want all the “id” from all of the JSON. Is that the same answer as you gave ^
If you look at the screen shot there can be many id
in the one response (nested in many levels)
(sorry if I am asking dumb questions. This is my first “real” connector)
That seems to yield nested id
but it does not seem to yield the nested-nested pages which makes since I guess. So maybe I can somehow have this be a parent stream of itself to keep going for each level? (does that seem I am on the right track or maybe going the wrong direction?)
Probably not the easiest one to learn on, but you’ll know it a lot better than most!
I know in transformations you can use **
to match across levels, but I’m not sure in an extractor. Might be worth trying though.
But, you may not need to. Logic-ing though this:
- If
content/pages
returns all pages. This gives us all our pages
- Then a child stream could return the first level of page
id
s under that, and associate the children with the parents
- In the child stream, you can use transformation to add a field like
parent_id
based on the current partition from the parent stream and then only pull the id
in the schema
- For clarity, you could also use a transformation to add the child
id
as a new field named child_id
and then remove the unprefixed one, so you effectively get a list of parent_id
and child_id
to build any hierarchy you need
- So now you have a stream for
pages
and a stream for relationships
That feels sane to me, probably more sane than having any source with nesting. You just need to make sure that the child stream is non-incremental (since you want to know the current hierarchy any time a record is synced).
(If you haven’t seen the parent stream config yet, you can take a look at <https://docs.airbyte.com/connector-development/connector-builder-ui/partitioning|this page in the docs> to wrap your head around that and the broader concept of partitioning)
That’s where you’ll get a value like {{ stream_partition.parent_id }}
(at least if you call the “Current Parent Key Value Identifier” parent_id
—that’s a name you choose)
when I talk about changing the output schema, you can run it once with autodetect on, and then toggle it off and just edit the schema only before the the fields you specify so that it isn’t redundant to the parent. (And again, some APIs have a param you can pass to only return certain fields, which may help speed the requests too)