Add support for Parquet format in AWS Datalake destination

I have been using the S3 destination for some initial prototyping, and recently discovered the AWS Datalake connector which is very helpful. My main question is whether there has been any discussion of adding support for the Parquet file format in the AWS Datalake destination?

I like the fact that the S3 destination supports the Parquet format because of the fact that the data type and schema information are embedded with the data. This is somewhat mitigated by the presence of schema information in the Glue catalog with the lake formation plugin, but doesn’t solve every use case.

I like the fact that the AWS Datalake destination automatically handles registration and maintenance of schema information in the Glue catalog without having to rely on Glue crawlers or direct integration with the Glue table API.

Is there anything on the roadmap to merge or cross-pollinate functionality between these plugins? Are there any particular design/product questions that would need to be answered if I or my team were to take a stab at this work?

Thank you for the great product!

Hey @blarghmatey, thanks for the feature suggestion. I don’t believe the AWS Datalake connector is a particularly popular one, so I don’t think it’s a high priority for the team. That being said, we’re always open to new PRs and contributions for our connectors. Feel free to take a stab at it. submit a draft PR and someone from our side will follow up with feedback. Thanks again for the request and for using our product!

I’m curious if you have any sense of how people are using Airbyte with data lake architectures (that aren’t Databricks). We started off with the S3 connector, which is very robust, but that leaves open the question of how to register that data with tables. The AWS Glue crawler is an option, but they are flakey and hard to manage beyond simple cases.

Hey @blarghmatey, sorry for the delay. Unfortunately, I don’t have answer for you at the moment but I’ll update this post when I do.

We’ve recently published an article/tutorial on how to use Airbyte with Dremio: https://airbyte.com/tutorials/build-an-open-data-lakehouse-with-dremio

I’m doubtful on how helpful this is or if it answers your question directly but perhaps it may be of some value here. Thanks again for your suggestions!