Optimizing data loading from AWS S3 to Teradata using custom connector in Java and Kotlin

Summary

The user is looking to optimize data loading from AWS S3 to Teradata using a custom connector in Java and Kotlin. They want to modify the batch size and implement additional tweaks to improve performance.


Question

hello, I am coding my own connector in Java and Kotlin. Fetching data from a S3 butket (AWS) to Teradata. I have like 36gb of data to load, the data contains an id, a json object and a date (totals 162M records). This is taking too much time; 7+ h. I want to optimize that. How can I modify the batch_size? the default value is 25MB it seems.
Can you suggest me additional tweaks to optimize the performance please.

Info: I am using the com.teradata.jdbc:terajdbc4:17.20.00.12 driver.



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want
to access the original thread.

Join the conversation on Slack

["optimize-performance", "aws-s3", "teradata", "java", "kotlin", "custom-connector", "batch-size", "terajdbc4"]

Just to be sure, are you using source-s3 connector and writing your own terradata connector?

I might be wrong but I think batch size is based on 10_000 records https://github.com/airbytehq/airbyte/blob/ea68817cf96976d54ca041551b4c409683868eab/airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/cursor/default_file_based_cursor.py#L17|default_file_based_cursor.py#L17

Depending on how you have your data organized, maybe it would be possible to configure multiple connections, each per S3 prefix, so you could synchronize data in parallel

I’m just asking because there is destination-teradata connector https://docs.airbyte.com/integrations/destinations/teradata