Handling sub pages in GitBook connector

Summary

The user is stuck on building a GitBook connector and needs help with handling ‘sub pages’.


Question

I am stuck on building this GitBook connector. I am 95% there but not sure how to handle the “sub pages”

https://github.com/airbytehq/airbyte/discussions/38072



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here.
Join the conversation on Slack

Click here if you want to access the original thread.

`["gitbook-connector", "building", "sub-pages", "airbyte"]`

So sometimes the best way to work through this is to look at a platform’s existing client libraries. In this case, I see them implementing a method called listPages which calls an (undocumented AFAIK) endpoint of /spaces/${spaceId}/content/pages. So maybe try that and see if it gets you what you want in a more sane way. If it’s in their client library, it’s unlikely to be deprecated.

That full code is available here:
https://www.npmjs.com/package/@gitbook/api?activeTab=code

The note above is based on line ~1131 in /@gitbook/api/dist/index.js — this is the code I’m referencing:

        path: `/spaces/${spaceId}/content/pages`,
        method: "GET",
        query,
        secure: true,
        format: "json",
        ...params
      }),```

There’s also this around line ~5229 in /@gitbook/api/dist/client.d.ts:

         * No description
         *
         * @name ListPages
         * @summary List all pages for the main revision content of a space
         * @request GET:/spaces/{spaceId}/content/pages
         * @secure
         */
        listPages: (spaceId: string, query?: {
            /**
             * If `false` is passed, "git" mutable metadata will not returned. Passing `false` can optimize performances of the lookup.
             * @default true
             */
            metadata?: boolean;
        }, params?: RequestParams) => Promise<HttpResponse<{
            pages: RevisionPage[];
        }, Error>>;```

Thank you <@U035912NS77> digging through your comment now.

<@U035912NS77> just tried it and it is very similar to the call I am making already… It still has the nested pages array in the pages objects.

looking at their docs, I think those nested pages may be revisions. but at least with this endpoint you know you’re already getting all the pages, so you could just drop that key. or try to parse it to get just the list of IDs if you think this is actually a hierarchy and not revisions. (although maybe path would get you that?)

You could also extract the same endpoint twice; once for the pages, then again for the hierarchy using the Record Selector to pull just the nested page values for each page (maybe calling this as a child stream so you can use a transformation to add the parent’s ID as a field)

<@U035912NS77> first off I am super appreciative of your thoughts.

I believe the “sub pages” are not the revisions they are nested pages in the docs. I have access to the GitBook engeering folks so I will check with them to verify.

This is an example of output I am seeing. I did not think of dropping the key(s) and using the path I think that is a great idea and will try that next.

Well looking at this again. I actually need all the nested id not path

Example response from your first suggestion above. Do I need to get all the id from all the pages. I guess that would be something like *,id?

Ok, I am failing at what technique to use to extract all id from the json graph

Technically since you only what that one field, you could just use the default extractor and only map the ID field in the schema (unless their API gives you an option to limit the fields returned which would just save bytes on the wire).

I want all the “id” from all of the JSON. Is that the same answer as you gave ^

If you look at the screen shot there can be many id in the one response (nested in many levels)

(sorry if I am asking dumb questions. This is my first “real” connector)

I think this is correct

That seems to yield nested id

but it does not seem to yield the nested-nested pages which makes since I guess. So maybe I can somehow have this be a parent stream of itself to keep going for each level? (does that seem I am on the right track or maybe going the wrong direction?)

Probably not the easiest one to learn on, but you’ll know it a lot better than most!

I know in transformations you can use ** to match across levels, but I’m not sure in an extractor. Might be worth trying though.

But, you may not need to. Logic-ing though this:

  1. If content/pages returns all pages. This gives us all our pages
  2. Then a child stream could return the first level of page ids under that, and associate the children with the parents
  3. In the child stream, you can use transformation to add a field like parent_id based on the current partition from the parent stream and then only pull the id in the schema
  4. For clarity, you could also use a transformation to add the child id as a new field named child_id and then remove the unprefixed one, so you effectively get a list of parent_id and child_id to build any hierarchy you need
  5. So now you have a stream for pages and a stream for relationships
    That feels sane to me, probably more sane than having any source with nesting. You just need to make sure that the child stream is non-incremental (since you want to know the current hierarchy any time a record is synced).

(If you haven’t seen the parent stream config yet, you can take a look at <https://docs.airbyte.com/connector-development/connector-builder-ui/partitioning|this page in the docs> to wrap your head around that and the broader concept of partitioning)

That’s where you’ll get a value like {{ stream_partition.parent_id }} (at least if you call the “Current Parent Key Value Identifier” parent_id—that’s a name you choose)

when I talk about changing the output schema, you can run it once with autodetect on, and then toggle it off and just edit the schema only before the the fields you specify so that it isn’t redundant to the parent. (And again, some APIs have a param you can pass to only return certain fields, which may help speed the requests too)