With Github source and Postgres destination, each sync misses some data

  • Is this your first time deploying Airbyte: No
  • OS Version / Instance: Mac OS
  • Memory / Disk: 8Gb / 256 SSD
  • Deployment: Docker
  • Airbyte Version: 0.35.59-alpha
  • Source name/version: Github
  • Destination name/version: Postgres
  • Step: Setting new connection, source / On sync
  • Description: With Github source and Postgres destination, each sync misses some data.

Specifically, for a set of repos, size of _airbyte_raw_issues table is 1041 after sync 1, for example. Yet, if I rerun this experiment from scratch, the size of _airbyte_raw_issues would be 1030 (for example) such that it’s not reproducible. I ran many experiments today but airbyte seems to miss at least a few records in raw issues table in each sync.

On subsequent syncs, it’ll sometimes fetch a few more records (e.g. 1045) but still not all of them in one sync.

  • I’m using Airbyte API (with dagster).

  • In Github Configuration, my list of repos is a long string with > 20 repos e.g. “repo1 repo2 …”

  • This discrepancy doesn’t seem to occur in other streams/tables (e.g. comments)

Do you mind upload 2 sync logs to us read it and see what happened?

Thanks for your response, @marcosmarxm. Yes, absolutely, PFA the five log files.

conn_1_logs_large_org.txt (171.5 KB)
conn_2_logs_medium_org.txt (153.0 KB)
conn_3_logs_small_org.txt (138.3 KB)
conn_4_logs_small_org.txt (152.9 KB)
dagster_debug_file.txt (467.5 KB)

Some Context

  • I’m trying to extract GitHub data from four different Github organisations. And, I need to create four different airbyte connections as each org has its own access token; Hence, four airbyte log files are attached (one for each org/connection).
  • The fifth log file is from dagster which invokes these airbyte connections.
  • All four connections are writing to the same Postgres DB in the same tables.

Let me know if I can provide anything else. Thank you.

P.S. I just reran a sync from scratch, and some streams missed some records i.e. records fetched are fewer than yesterday. Some streams fetched all records. If I run enough syncs, most missing records will eventually be fetched.

Thanks again!

I also noticed a discrepancy with basic normalisation (default).

  • With append_dedup sync mode, Issues stream ( final table) sometimes misses records i.e. _aribyte_raw_issues contains more unique records and normalised Issues table contains fewer unique records. This discrepancy is observed only for issues table; All other streams work as expected.
  • With append sync mode, for normalisation, all streams (including Issues) work as expected i.e. the same number of unique records in raw and normalised table.

This can happens because raw table stores all updated records too, the final one should keep the unique values. Did you have some numbers about the discrepancy problem?

Sorry for the long delay in the reply @HAMZA310 tomorrow I’ll spend more time investigating your issue.