Partial Data Fetching From Source To Destination

  • Is this your first time deploying Airbyte?: Yes
  • OS Version / Instance: Ubuntu
  • Memory / Disk: 8Gb / 100Gb
  • Deployment: Docker
  • Airbyte Version: 0.40.15
  • Source name/version: MongoDB
  • Destination name/version: BigQuery
  • Step: The issue is happening during sync, creating the connection or a new source?

Description:
This is my first topic here, so apologizing for any mistake made here. I am trying to introduce Airbyte to my company. For the POC part, I need to show that airbyte is able to fetch partial data from source rather than fetching all the data from Source DB at initial state at the first run. I am using incremental-append. Can anyone suggest how to do that?

For instance, assume I have a production table sized 200 GB, but I am not willing to fetch all the data for poc, I would like to have recent 1 month of data and continue on from there. How can I achieve that.

What I tried:
I tried to use airbyte api to set the state of the connection state to a specific date and tried to run. The state was set perfectly, but when I ran from UI, it ran from scratch fetching all the data.
As usual, airbyte acts perfectly incremental-append mode after initial full fetch. But I do not want that for my POC.

Requested Payload to the endpoint: http://0.0.0.0:8000/api/v1/state/create_or_update

{
  "connectionId": CONNECTION_ID,
  "connectionState": {
    "stateType": "legacy",
    "connectionId": CONNECTION_ID,
    "state": {
    'cdc': False,
    'streams': [
            {
                'cursor': '2022-11-27T08:51:45.914Z',
                'stream_name': 'partial_some_table',
                'cursor_field': ['updatedAt'],
                'stream_namespace': 'some_stream'
            }
        ]
    },
    "streamState": [],
    "globalState": {}
  }
}

Response:

{'stateType': 'legacy',
 'connectionId': CONNECTION_ID,
 'state': {'cdc': False,
  'streams': [{'cursor': '2022-11-27T08:51:45.914Z',
    'stream_name': 'partial_some_table',
    'cursor_field': ['updatedAt'],
    'stream_namespace': 'some_stream'}]}}

Thanks in advance for any suggestions. :slight_smile:

Hey @RakibulHoque, so currently this isn’t possible - the first sync is always a ‘full refresh’ sync that gets all the data. A workaround I would suggest is creating views with the data you want and syncing those!

Thank you so much for providing a work around @natalyjazzviolin. I shall try this way.