Currently one has to do docker-compose up
then go to the UI and config source/destination to do a sync.
Is there a convenient place where I can find docker-free (“native”) series of commands to do the same? In other words, instructions to set up local python env, install dependencies, and then either a sequence of terminal cmds or python invocations to accomplish the same thing. I suppose I could look at the Dockerfile…but I’m suspecting it may not be as simple as that.
Thanks
Hey @arbitrer,
If you are specifically interested in building a local environment to work on the normalization you can check this README. It explains how to setup the Python virtualenv and run the tests. Feel free to share a bit more about what you are trying to achieve so that I can give more practical examples.
Thanks @alafanechere, yes my main interest is in obtaining the dbt normalization models for a source that I may not necessarily have credentials for. And yes I was actually able to run the tests under base-normalization
and even was able to manually create a catalog.json
for Marketo (for which I don’t have credentials), and added a script to splice in the specific schemas of each type (e.g. campaign
, lead
) into the streams
list in the catalog (which has empty json_schema: {}
stubs).
I tried to generalize my script to other sources, but saw that they were all organized a bit differently (e.g. some have schemas/blah.json
and others have schemas.json
, etc). In any case, I thought it would be good to know the overall end-to-end docker-free sequence of commands to go from source to destination and final normalization. If I know that, then I can peek into the stage where a catalog.json
is created and hack together a script that directly produces this catalog and simply produces the normalization models without having to depend on any source credentials.
The nice thing about having the normalization models for sources X, Y, Z is that I can write downstream dbt models to transform these into a final unified table-structure on which my analytics tool can run. Then I can go to customers that use X, Y, Z and say we are ready to do analytics for them. I am aware that one needs actual data from these sources to get the fully normalized tables, but I think that is something that can be done once a customer actually connects their X, Y, Z tools.
Hi @arbiter,
Unfortunately there’s no single run down process in Python that can be done for this use case. Airbyte’s protocol leverage docker containerization and is not language specific, this is why we also have java source connectors.
The standard way of obtaining a catalog is running the discover
command on a source connector: e.g.
docker run airbyte/my-source:dev discover --config path_to_secret.json
As you can see from the command I shared above, the discover command requires secrets, hence access to data. This is because some connector have dynamically discovered schema (and not hard coded JSON schemas).
TLDR; the standard way of getting a source catalog is running the discover
command on a source connector. To get it you need access to the source because some connectors dynamically generate schema. This is why normalization models are generated at sync time and not before the sync run.
1 Like