XML as a source in Airbyte

Summary

Airbyte does not currently support XML as a source. Considerations before implementing XML as a source.


Question

Hi, airbyte does not have XML as a source, like it has csv or json. Is any problem i may need to know before starting to implement XML as a source?



This topic has been created from a Slack thread to give it more visibility.
It will be on Read-Only mode here. Click here if you want to access the original thread.

Join the conversation on Slack

["xml-source", "airbyte-connector", "data-integration"]

I’m not authoritative source on this, but my guess is two things:

  1. XML requires a schema to understand the structure of the data, these xsd tend to be quite large and may require extra processing that json,csv does not.
  2. XML is always wrapped in a root element, since it is a tree - so unlike JSONL (where object is a row, line-delimited), each document would be structured something like this: <rows><row/><row/></rows> and so you cannot actually evaluate if the document matches the schema without reading the whole file. ie: you only know its valid in the above example after reading </rows> element which could be the end of a 100GB file

Hi Justen, it makes sense what you are saying.Possible a global XML connector would produced a lot of problems / compatibility

also, what about &lt;row key="name"&gt;value&lt;/row&gt; vs &lt;row&gt;&lt;key&gt;name&lt;/key&gt;&lt;value&gt;val&lt;/value&gt;&lt;/row&gt; – both are valid - so it would need to somehow have some kind of specification about how to extract this

I think it might be possible for airbyte to support a specific pre-defined schema of xml, but not general XML

Yeah, i will implement a connector to read my used formats

thank you very much for your reply and ideas

like technically it already does; Excel is the xlsx data format – which is xml under the hood; just with the pre-defined excel document schema

yes, you are absolutly right, even xlsx could be dangerous because of conversions from other formats to xlsx

as a single document probably not useful, but ability to ingest bunch of xml-files messages from S3/Azure or GCP will be handy.
Can be structured. i.e. all files have same schema and parsed with Airbyte
or unstructured, i.e. every file has it’s own schema.

can be useful loading file content into Postgres forexample and use dbt+xmlparse for processing

even if you chose the source format to be row per document: i believe the issue there is that the canonical format of the data during replication is json, described by a jsonschema (ie; the catalog), so you need to translate the xml document into that for it to be emitted over the wire. you could maybe just ship the whole document as a single base64-encoded field, but its likely not useful in that format; You could use something like JSON-ML i think; but there is a lot of overhead (ie; even more than xml)

pass entire file content as a string in JSON?

yeah, but how useful is that? I guess it really depends on the use-case; but then its not really an xml source; its a “text file” source.

by itself probably not very usefull

but if existing S3 and other sources can support generic Text files - would be great

maybe it is; but this diverges from the original question about xml support. It would probably be achievable as a file format in the file source; but the community would probably need to weigh to see how popular a request it is