Handling sub pages in GitBook connector

slack-user-airbyte · May 14, 2024, 2:32pm

Summary

The user is stuck on building a GitBook connector and needs help with handling ‘sub pages’.

Question

I am stuck on building this GitBook connector. I am 95% there but not sure how to handle the “sub pages”

https://github.com/airbytehq/airbyte/discussions/38072

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here.
Join the conversation on Slack

Click here if you want to access the original thread.

_{`["gitbook-connector", "building", "sub-pages", "airbyte"]`}

slack-user-airbyte · May 14, 2024, 2:45pm

So sometimes the best way to work through this is to look at a platform’s existing client libraries. In this case, I see them implementing a method called listPages which calls an (undocumented AFAIK) endpoint of /spaces/${spaceId}/content/pages. So maybe try that and see if it gets you what you want in a more sane way. If it’s in their client library, it’s unlikely to be deprecated.

That full code is available here:
https://www.npmjs.com/package/@gitbook/api?activeTab=code

The note above is based on line ~1131 in /@gitbook/api/dist/index.js — this is the code I’m referencing:

        path: `/spaces/${spaceId}/content/pages`,
        method: "GET",
        query,
        secure: true,
        format: "json",
        ...params
      }),```

slack-user-airbyte · May 14, 2024, 2:45pm

There’s also this around line ~5229 in /@gitbook/api/dist/client.d.ts:

         * No description
         *
         * @name ListPages
         * @summary List all pages for the main revision content of a space
         * @request GET:/spaces/{spaceId}/content/pages
         * @secure
         */
        listPages: (spaceId: string, query?: {
            /**
             * If `false` is passed, "git" mutable metadata will not returned. Passing `false` can optimize performances of the lookup.
             * @default true
             */
            metadata?: boolean;
        }, params?: RequestParams) =&gt; Promise&lt;HttpResponse&lt;{
            pages: RevisionPage[];
        }, Error&gt;&gt;;```

slack-user-airbyte · May 14, 2024, 2:45pm

Thank you <@U035912NS77> digging through your comment now.

slack-user-airbyte · May 14, 2024, 2:45pm

<@U035912NS77> just tried it and it is very similar to the call I am making already… It still has the nested pages array in the pages objects.

slack-user-airbyte · May 14, 2024, 4:08pm

looking at their docs, I think those nested pages may be revisions. but at least with this endpoint you know you’re already getting all the pages, so you could just drop that key. or try to parse it to get just the list of IDs if you think this is actually a hierarchy and not revisions. (although maybe path would get you that?)

You could also extract the same endpoint twice; once for the pages, then again for the hierarchy using the Record Selector to pull just the nested page values for each page (maybe calling this as a child stream so you can use a transformation to add the parent’s ID as a field)

slack-user-airbyte · May 14, 2024, 4:08pm

<@U035912NS77> first off I am super appreciative of your thoughts.

I believe the “sub pages” are not the revisions they are nested pages in the docs. I have access to the GitBook engeering folks so I will check with them to verify.

This is an example of output I am seeing. I did not think of dropping the key(s) and using the path I think that is a great idea and will try that next.

slack-user-airbyte · May 14, 2024, 4:08pm

Well looking at this again. I actually need all the nested id not path

slack-user-airbyte · May 14, 2024, 4:08pm

Example response from your first suggestion above. Do I need to get all the id from all the pages. I guess that would be something like *,id?

slack-user-airbyte · May 14, 2024, 4:08pm

Ok, I am failing at what technique to use to extract all id from the json graph

slack-user-airbyte · May 14, 2024, 4:08pm

Technically since you only what that one field, you could just use the default extractor and only map the ID field in the schema (unless their API gives you an option to limit the fields returned which would just save bytes on the wire).

slack-user-airbyte · May 14, 2024, 4:08pm

I want all the “id” from all of the JSON. Is that the same answer as you gave ^

slack-user-airbyte · May 14, 2024, 4:08pm

If you look at the screen shot there can be many id in the one response (nested in many levels)

slack-user-airbyte · May 14, 2024, 4:32pm

(sorry if I am asking dumb questions. This is my first “real” connector)

slack-user-airbyte · May 14, 2024, 4:32pm

I think this is correct

slack-user-airbyte · May 14, 2024, 4:32pm

That seems to yield nested id

slack-user-airbyte · May 14, 2024, 4:32pm

but it does not seem to yield the nested-nested pages which makes since I guess. So maybe I can somehow have this be a parent stream of itself to keep going for each level? (does that seem I am on the right track or maybe going the wrong direction?)

slack-user-airbyte · May 14, 2024, 4:32pm

Probably not the easiest one to learn on, but you’ll know it a lot better than most!

I know in transformations you can use ** to match across levels, but I’m not sure in an extractor. Might be worth trying though.

But, you may not need to. Logic-ing though this:

If content/pages returns all pages. This gives us all our pages
Then a child stream could return the first level of page ids under that, and associate the children with the parents
In the child stream, you can use transformation to add a field like parent_id based on the current partition from the parent stream and then only pull the id in the schema
For clarity, you could also use a transformation to add the child id as a new field named child_id and then remove the unprefixed one, so you effectively get a list of parent_id and child_id to build any hierarchy you need
So now you have a stream for pages and a stream for relationships
That feels sane to me, probably more sane than having any source with nesting. You just need to make sure that the child stream is non-incremental (since you want to know the current hierarchy any time a record is synced).

slack-user-airbyte · May 14, 2024, 4:32pm

(If you haven’t seen the parent stream config yet, you can take a look at <https://docs.airbyte.com/connector-development/connector-builder-ui/partitioning|this page in the docs> to wrap your head around that and the broader concept of partitioning)

That’s where you’ll get a value like {{ stream_partition.parent_id }} (at least if you call the “Current Parent Key Value Identifier” parent_id—that’s a name you choose)

slack-user-airbyte · May 14, 2024, 4:32pm

when I talk about changing the output schema, you can run it once with autodetect on, and then toggle it off and just edit the schema only before the the fields you specify so that it isn’t redundant to the parent. (And again, some APIs have a param you can pass to only return certain fields, which may help speed the requests too)

Topic		Replies	Views
Using Connector Builder to Sync Nested Pages from Gitbook API Connector Development connector-builder , connectorbuilder , question , sync , nested-pages	0	28	May 14, 2024
Flattening nested pages in connector builder for API with rate limit Connector Development connector-builder , api , connectorbuilder , question , nested-pages	13	61	June 23, 2024
Nested Data Unwrapping in Connector Builder Connector Development connector-builder , connectorbuilder , question , nested-data-unwrapping , parent-stream	4	227	June 20, 2024
Error creating substream with custom connector in Airbyte Connector Questions pagination , connector , error , bug , custom-connector	0	0	August 17, 2024
Inquiry About Facebook Pages Source Availability Connector Questions airbyte-cloud , connector , question , facebook-pages , api-version	0	2	December 9, 2024

Handling sub pages in GitBook connector

Summary

Question

Related topics