Issue with long JSON field and mysql source

davidfromtandym · June 23, 2022, 5:52pm

Is this your first time deploying Airbyte?: No
OS Version / Instance: EC2
Memory / Disk: t2.large
Deployment: Docker
Airbyte Version: 39.21
Source name/version: mysql 0.5.13
Destination name/version: redshift 0.3.39
Step: During sync

I’m banging my head against the wall on this one. I have a field in a source mysql database that is a very long JSON string. It’s formatted as a JSON datatype. When I try to bring the data into redshift, records longer than ~65K characters are getting dropped because Airbyte is trying to format the field type as a string in the destination. It should instead format to a SUPER datatype (which I think would solve this issue). The part I’m not understanding is that this data never even makes it into the _raw table … it gets dropped on import before normalization.

Would love any assistance to troubleshoot this…

davidfromtandym · June 23, 2022, 6:08pm

Quick follow-up question … is there a way to coerce the datatype by modifying the destination_catalog.json file? The field in question is currently typed as “string” when it should in fact be typed as “array.”

ETA: On deeper inspection of the mysql source connector, fields with datatype JSON are tied to the STRING type in Airbyte. It seems this should be object, no (per this)?

addDataTypeTestData(
    TestDataHolder.builder()
        .sourceType("json")
        .airbyteType(JsonSchemaType.STRING)
        .addInsertValues("null", "'{\"a\": 10, \"b\": 15}'")
        .addExpectedValues(null, "{\"a\": 10, \"b\": 15}")
        .build());

alafanechere · June 24, 2022, 2:52pm

Hey @dra,
Thank you for reporting this.
From my understanding you raised two problems:

A MySQL column typed as JSON Data Type is serialized to a string in the Airbyte raw table.
Strings > 65k characters are dropped by the Destination Redshift connector.
A MySQL column typed as JSON Data Type **are serialized to string **
We are actively working on improving our database connectors (check our roadmap out). Having a better coverage of all the RDBS native type is among these plan improvements. I will open a GitHub issue to make sure we work on this specific type. It would help if you could share your MySQL table schema and your Airbyte catalog (instruction on how to extract it here).
2. Strings > 65K characters drops:
This is expected behavior. According to our source code the connector ignores with strings longer than 65535 characters.

davidfromtandym · June 24, 2022, 4:59pm

Hi @alafanechere – thanks for the reply. A couple of responses to the points you raised:

Correct – I would think the JSON Data Type from the MySQL source connector should be serialized as an object, not a string.
Also correct re: records being dropped. Upon deeper reading of the docs last night, I think this is expected behavior because the SUPER data type does not exist as an option for the Redshift destination connector.

Adding object serialization to the MySQL source connector for JSON data would certainly be an improvement, but I don’t think it would actually solve the issue I’m facing. I think to solve my issue, the Redshift destination connector would need to add the SUPER datatype (which can ingest JSON data that is longer than 65K characters).

Let me know if that logic is making sense … it’s very possible that I’m missing something.

alafanechere · June 28, 2022, 6:59am

Hey @davidfromtandym,

I will open an issue for this. Could share your MySQL table schema and your Airbyte catalog (instruction on how to extract it here).
I dug a bit into the connector code and realized that the _airbyte_data column of your raw tables should already be set to SUPER type according to this code. Do you confirm your airbyte_data columns are SUPER? Could you also please upgrade your connector to the latest version? We’re now on 0.3.46 for destination-redshift. Even if the column is SUPER the limit of 65k bytes still applies for the individual. VARCHAR field of a. SUPER object according to this doc. This is why we drop >65k bytes string fields before inserts to the destination.

Adding object serialization to the MySQL source connector for JSON data would certainly be an improvement, but I don’t think it would actually solve the issue I’m facing

I think this will solve your problem, as your field value will not be considered as VARCHAR but as a “structure” object. The limit for your whole record size will be 1MB and not only 65k bytes your field value.

davidfromtandym · July 13, 2022, 8:59pm

Hi @alafanechere … I missed your response, apologies for the delay in addressing your question.

Please see below for the source and destination catalog output. The jsonResponse field is actually stored as a JSON datatype in the mySQL source, but it’s showing up in the source catalog as a string.
I can confirm that the _airbyte_data column in the raw tables are SUPER.

I know there is an open issue for the solution you proposed in the source-mysql connector. Hopefully that can be addressed soon.

{
    "streams": [
        {
            "stream": {
                "name": "api_data",
                "json_schema": {
                    "type": "object",
                    "properties": {
                        "path": {
                            "type": "string"
                        },
                        "method": {
                            "type": "string"
                        },
                        "userID": {
                            "type": "string"
                        },
                        "provider": {
                            "type": "string"
                        },
                        "apiDataID": {
                            "type": "string"
                        },
                        "createdAt": {
                            "type": "string"
                        },
                        "updatedAt": {
                            "type": "string"
                        },
                        "apiVersion": {
                            "type": "string"
                        },
                        "jsonResponse": {
                            "type": "string"
                        },
                        "requestPayload": {
                            "type": "string"
                        },
                        "serializedPath": {
                            "type": "string"
                        }
                    }
                },
                "supported_sync_modes": [
                    "full_refresh",
                    "incremental"
                ],
                "default_cursor_field": [],
                "source_defined_primary_key": [
                    [
                        "apiDataID"
                    ]
                ],
                "namespace": "production_tandym_db"
            },
            "sync_mode": "full_refresh",
            "cursor_field": [],
            "destination_sync_mode": "overwrite",
            "primary_key": [
                [
                    "apiDataID"
                ]
            ]
        }]}

{
    "streams": [
        {
            "stream": {
                "name": "tdym_api_data",
                "json_schema": {
                    "type": "object",
                    "properties": {
                        "path": {
                            "type": "string"
                        },
                        "method": {
                            "type": "string"
                        },
                        "userID": {
                            "type": "string"
                        },
                        "provider": {
                            "type": "string"
                        },
                        "apiDataID": {
                            "type": "string"
                        },
                        "createdAt": {
                            "type": "string"
                        },
                        "updatedAt": {
                            "type": "string"
                        },
                        "apiVersion": {
                            "type": "string"
                        },
                        "jsonResponse": {
                            "type": "string"
                        },
                        "requestPayload": {
                            "type": "string"
                        },
                        "serializedPath": {
                            "type": "string"
                        }
                    }
                },
                "supported_sync_modes": [
                    "full_refresh",
                    "incremental"
                ],
                "default_cursor_field": [],
                "source_defined_primary_key": [
                    [
                        "apiDataID"
                    ]
                ],
                "namespace": "tdym_processing"
            },
            "sync_mode": "full_refresh",
            "cursor_field": [],
            "destination_sync_mode": "overwrite",
            "primary_key": [
                [
                    "apiDataID"
                ]
            ]
        }]}

marcosmarxm · July 13, 2022, 12:00am

Hi there from the Community Assistance team.
We’re letting you know about an issue we discovered with the back-end process we use to handle topics and responses on the forum. If you experienced a situation where you posted the last message in a topic that did not receive any further replies, please open a new topic to continue the discussion. In addition, if you’re having a problem and find a closed topic on the subject, go ahead and open a new topic on it and we’ll follow up with you. We apologize for the inconvenience, and appreciate your willingness to work with us to provide a supportive community.

Topic		Replies	Views
Destination Redshift - Workaround for failed DBT normalization of SUPER datatype Connector Questions & Issues source-jira , destination-redshift , normalization	3	1155	January 30, 2023
Data types used by Airbyte when syncing to AWS Redshift Connector Questions airbyte , connector , question , aws-redshift , optimization	2	20	October 24, 2024
Guidance / update on MySQL source connector issues Connector Questions & Issues source-mysql	3	288	July 14, 2022
Issue with failing Redshift dbt normalisation Connector Questions & Issues source-postgres , destination-redshift , normalization	7	935	July 14, 2022
MySQL source connector: slow CDC replication Connector Questions & Issues source-mysql	6	552	September 23, 2022

Issue with long JSON field and mysql source

Related topics