Issue with long JSON field and mysql source

  • Is this your first time deploying Airbyte?: No
  • OS Version / Instance: EC2
  • Memory / Disk: t2.large
  • Deployment: Docker
  • Airbyte Version: 39.21
  • Source name/version: mysql 0.5.13
  • Destination name/version: redshift 0.3.39
  • Step: During sync

I’m banging my head against the wall on this one. I have a field in a source mysql database that is a very long JSON string. It’s formatted as a JSON datatype. When I try to bring the data into redshift, records longer than ~65K characters are getting dropped because Airbyte is trying to format the field type as a string in the destination. It should instead format to a SUPER datatype (which I think would solve this issue). The part I’m not understanding is that this data never even makes it into the _raw table … it gets dropped on import before normalization.

Would love any assistance to troubleshoot this…

Quick follow-up question … is there a way to coerce the datatype by modifying the destination_catalog.json file? The field in question is currently typed as “string” when it should in fact be typed as “array.”

ETA: On deeper inspection of the mysql source connector, fields with datatype JSON are tied to the STRING type in Airbyte. It seems this should be object, no (per this)?

addDataTypeTestData(
    TestDataHolder.builder()
        .sourceType("json")
        .airbyteType(JsonSchemaType.STRING)
        .addInsertValues("null", "'{\"a\": 10, \"b\": 15}'")
        .addExpectedValues(null, "{\"a\": 10, \"b\": 15}")
        .build());

Hey @dra,
Thank you for reporting this.
From my understanding you raised two problems:

  1. A MySQL column typed as JSON Data Type is serialized to a string in the Airbyte raw table.

  2. Strings > 65k characters are dropped by the Destination Redshift connector.

  3. A MySQL column typed as JSON Data Type **are serialized to string **
    We are actively working on improving our database connectors (check our roadmap out). Having a better coverage of all the RDBS native type is among these plan improvements. I will open a GitHub issue to make sure we work on this specific type. It would help if you could share your MySQL table schema and your Airbyte catalog (instruction on how to extract it here).
    2. Strings > 65K characters drops:
    This is expected behavior. According to our source code the connector ignores with strings longer than 65535 characters.

Hi @alafanechere – thanks for the reply. A couple of responses to the points you raised:

  1. Correct – I would think the JSON Data Type from the MySQL source connector should be serialized as an object, not a string.

  2. Also correct re: records being dropped. Upon deeper reading of the docs last night, I think this is expected behavior because the SUPER data type does not exist as an option for the Redshift destination connector.

Adding object serialization to the MySQL source connector for JSON data would certainly be an improvement, but I don’t think it would actually solve the issue I’m facing. I think to solve my issue, the Redshift destination connector would need to add the SUPER datatype (which can ingest JSON data that is longer than 65K characters).

Let me know if that logic is making sense … it’s very possible that I’m missing something.

Hey @davidfromtandym,

  1. I will open an issue for this. Could share your MySQL table schema and your Airbyte catalog (instruction on how to extract it here).
  2. I dug a bit into the connector code and realized that the _airbyte_data column of your raw tables should already be set to SUPER type according to this code. Do you confirm your airbyte_data columns are SUPER? Could you also please upgrade your connector to the latest version? We’re now on 0.3.46 for destination-redshift. Even if the column is SUPER the limit of 65k bytes still applies for the individual. VARCHAR field of a. SUPER object according to this doc. This is why we drop >65k bytes string fields before inserts to the destination.

Adding object serialization to the MySQL source connector for JSON data would certainly be an improvement, but I don’t think it would actually solve the issue I’m facing

I think this will solve your problem, as your field value will not be considered as VARCHAR but as a “structure” object. The limit for your whole record size will be 1MB and not only 65k bytes your field value.