XML as a source in Airbyte

slack-user-airbyte · May 14, 2024, 6:16pm

Summary

Airbyte does not currently support XML as a source. Considerations before implementing XML as a source.

Question

Hi, airbyte does not have XML as a source, like it has csv or json. Is any problem i may need to know before starting to implement XML as a source?

This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want to access the original thread.

Join the conversation on Slack

_{["xml-source", "airbyte-connector", "data-integration"]}

slack-user-airbyte · May 16, 2024, 1:24pm

I’m not authoritative source on this, but my guess is two things:

XML requires a schema to understand the structure of the data, these xsd tend to be quite large and may require extra processing that json,csv does not.
XML is always wrapped in a root element, since it is a tree - so unlike JSONL (where object is a row, line-delimited), each document would be structured something like this: <rows><row/><row/></rows> and so you cannot actually evaluate if the document matches the schema without reading the whole file. ie: you only know its valid in the above example after reading </rows> element which could be the end of a 100GB file

slack-user-airbyte · May 16, 2024, 1:24pm

Hi Justen, it makes sense what you are saying.Possible a global XML connector would produced a lot of problems / compatibility

slack-user-airbyte · May 16, 2024, 1:24pm

also, what about <row key="name">value</row> vs <row><key>name</key><value>val</value></row> – both are valid - so it would need to somehow have some kind of specification about how to extract this

slack-user-airbyte · May 16, 2024, 1:25pm

I think it might be possible for airbyte to support a specific pre-defined schema of xml, but not general XML

slack-user-airbyte · May 16, 2024, 1:25pm

Yeah, i will implement a connector to read my used formats

slack-user-airbyte · May 16, 2024, 1:25pm

thank you very much for your reply and ideas

slack-user-airbyte · May 16, 2024, 1:25pm

like technically it already does; Excel is the xlsx data format – which is xml under the hood; just with the pre-defined excel document schema

slack-user-airbyte · May 16, 2024, 1:25pm

yes, you are absolutly right, even xlsx could be dangerous because of conversions from other formats to xlsx

slack-user-airbyte · May 16, 2024, 1:25pm

as a single document probably not useful, but ability to ingest bunch of xml-files messages from S3/Azure or GCP will be handy.
Can be structured. i.e. all files have same schema and parsed with Airbyte
or unstructured, i.e. every file has it’s own schema.

can be useful loading file content into Postgres forexample and use dbt+xmlparse for processing

slack-user-airbyte · May 16, 2024, 1:25pm

even if you chose the source format to be row per document: i believe the issue there is that the canonical format of the data during replication is json, described by a jsonschema (ie; the catalog), so you need to translate the xml document into that for it to be emitted over the wire. you could maybe just ship the whole document as a single base64-encoded field, but its likely not useful in that format; You could use something like JSON-ML i think; but there is a lot of overhead (ie; even more than xml)

slack-user-airbyte · May 16, 2024, 1:25pm

pass entire file content as a string in JSON?

slack-user-airbyte · May 16, 2024, 1:25pm

yeah, but how useful is that? I guess it really depends on the use-case; but then its not really an xml source; its a “text file” source.

slack-user-airbyte · May 16, 2024, 1:25pm

by itself probably not very usefull

slack-user-airbyte · May 16, 2024, 1:25pm

but if existing S3 and other sources can support generic Text files - would be great

slack-user-airbyte · May 16, 2024, 1:25pm

maybe it is; but this diverges from the original question about xml support. It would probably be achievable as a file format in the file source; but the community would probably need to weigh to see how popular a request it is

Topic		Replies	Views
Decoding XML response in low code CDK Connector Development other , question , low-code-cdk , decode-xml , json-format	1	130	May 14, 2024
SFTP File Source - Excel File import Error Connector Questions & Issues data-loading , connectors	7	718	July 14, 2022
Retaining column types in Airbyte sync setup Connector Questions airbyte , connector , question , json-schema , csv-source-connector	0	50	June 8, 2024
Failed to fetch schema...! Connector Development	5	928	July 14, 2022
Creating sources from custom connector using Airbyte API API, Terraform and Other Topics api , airbyte-api , question , custom-connector , yaml	0	103	May 16, 2024

XML as a source in Airbyte

Summary

Question

Related topics