When I took my current position, I was invited to teach a course on unstructured data. It is a sensible topic for a course: some say that between 80% to 90% of all enterprise data is unstructured. But I objected to the title for marketing reasons. How many students would take a course on unstructured data? I can hear the students asking “what’s that course about?” Thus, I proposed a better title for the course: information retrieval and filtering. Indeed, everyone wants to filter and retrieve data, right?

Meanwhile, there were already courses on structured data (that is, on databases and information systems). However, there was no course on semi-structured data. So I proposed one. But I couldn’t call it semi-structured data as hardly any student would know what the title meant. Instead, I proposed a course which, roughly translated, is called “Information Management with XML.”

Immediately, I got into trouble: how could I dare omit  SOAP and web services from a course on XML? I was annoyed by these comments. With some sense of irony, I decided to start dumping on my students some SOAP examples so that they could see the “beauty” [I'm being ironic] of using XML for data exchange on the web. So, there I was, trying to teach my students about semi-structured data, and I was asked to tell them about remote procedure calls, an irrelevant topic for my purposes.

Thankfully, it appears that history is on my side. Developers got tired of getting these annoying XML payloads. In time, they  started using JSON, a much more appropriate format for passing small loads of structured data between a server and an ECMAScript client. It uses fewer bytes and, more importantly, JSON is an order of magnitude faster than XML. When you ask on Stack Overflow whether you should be using SOAP you are being told to avoid SOAP at all costs. The developers have spoken. And as a result, the organization behind the SOAP stack decided to close shop.

Where does that leave XML at? Precisely where it started. XML is a great meta-example on how to deal with semi-structured data. And it is just as useful as ever. Want to deal with documents? DocBook and OpenDocument are great formats. Want to add semantic information to web pages? Microformats can do it. You want to exchange complex business data? The Universal Business Language probably does what you need. Some people are having luck with the SVG image format. You want to subscribe to my blog? Grab my atom feed. For these applications, you couldn’t easily replace XML by flat files or JSON. Nor should you try.

Alas, we ended up torturing XML by applying it to ill-suited purposes. We must learn how to select the best format. Does your data look like a table? Can a flat file do the job? Do you need a key-value format like JSON? Or maybe a simple text file? Or is your data more like an XML document? Take a good look at your data before picking a format for it.

Further reading: Indexing XML and Native XML databases: have they taken the world over yet?

Update: I don’t include configuration files in my list of proper XML applications.

26 Comments

  1. As far as I see it, json is the format that is preferred by a browser frontend, and xml / flat file is the solution to everything else.

    Comment by Yoav — 17/11/2010 @ 21:04

  2. Hi, enjoyed the post!

    Considering the immediate utility of JSON and YAML, I think it may be most helpful for your students to focus on exactly why DocBook, Atom, and OpenDocument are *not* JSON.

    They needed to be standards in order to be of any use, and the only way to standardize something is with a formal, implementation-agnostic definition. I think the reason they are XML has little or nothing to do with the data they represent, or whether or not that data could be considered “semi-structured.”

    It just happens that XML has the most complete and accepted tooling around schemas and validation. My opinion, anyway. Your post inspired me to write a longer, semi-rant over here: http://alan.dipert.org/post/1606016275/an-xml-rant

    Comment by Alan Dipert — 17/11/2010 @ 22:13

  3. Hi,

    just wondering what you think about XML Databases ( http://en.wikipedia.org/wiki/XML_database), or if you even taught them in this course?

    I did some work with XML:DB a few years back (mainly eXist, XIndice and Berkeley DB), and I really enjoyed it.
    It was great to just dump some (even close to million) XML documents, and do some XQuery queries on top of that.
    At least, it was much much easier than dealing with the weird XML layers on top of Oracle or IBM’s DB2.
    And not only this, but the even-more-strange SQL queries to retrieve those XML chunks.
    IMHO, XQuery was(is?) a great language to deal with XML data.

    I also did some webapps where the backend was just a simple XML:DB (eXist), and it was quite fast.

    Ok… I guess that’s all for now!

    Cheers, Oscar

    Comment by Oscar — 17/11/2010 @ 22:57

  4. I agree that there are legacy XML formats that are not too horrible, and plenty of existing tools structured around XML that work. But if you were designing a new system, can you give an example of one where you would choose XML over JSON, and why?

    Comment by Brandon — 18/11/2010 @ 0:11

  5. Why couldn’t you easily replace XML with JSON in the case of Atom? It seems to me that it would be quite easy to do so.

    Comment by Bill Mill — 18/11/2010 @ 0:31

  6. JSON/XML are extremely similar. They’re both tree structures permitting untyped ordered nodes.

    There are two (to me) significant differences between JSON/XML:
    – tooling: XML has far better support in almost all environments, but (ironically) not necessarily on the web. That includes highly tuned parsers, syntax highlighting in editors, availability of query languages, query API’s, serialization support etc. JSON is catching up here – but slowly, and I don’t expect it will ever close the gap. XPATH, XQUERY, XSLT are all extremely powerful and rather useful.
    – wordiness: JSON is shorter, which is of course better. But the difference is small, particularly when compressed. The big json savings come from the lack of element names on array elements – but those make inspection easier and are a natural means of schema extensibility without any planning necessary. By constrast, it’s not so automatic to make an extensible json format, nor will it be quite as inspectible (though the difference is often nil).

    So if you’re on the web client side, exchanging small mostly untyped bits of information, compressed JSON works well. Browsers transparently decompress and json parsing can be trivial. To put it in perspective, if size is a key issue, enabling compression is a larger factor than xml vs. json. Elsewhere, tooling almost certainly matters more than details of encoding on the wire. And if in fact you do need to inspect the encoded data, element names are nice to have. But most relevantly, it just doesn’t matter much; and standardization is simplicity: using two almost identical formats is just making your life harder.

    For a similar reason, I’d say the suggestion to use “flat files” or whatever is generally unwise. Except in the most trivial of cases, json or xml have little enough overhead to make reinventing the wheel simple more effort than it’s worth.

    On the topic of messed up bits of XML: wtf is up with namespaces? What a bloated, unhandy mess. Fortunately mostly avoidable. And such a shame that XSLT2 never caught on on the web – that would be neat bit of tech in a browser.

    Comment by Eamon Nerbonne — 18/11/2010 @ 4:27

  7. @Oscar

    I did link at the bottom of my post to an older post on XML databases. I think they are just not getting much traction at all.

    @Brandon and @Mill

    For “new” applications that do not fit well in the established XML applications I have listed in my post, I would be tempted to consider JSON or flat files.

    JSON was designed for small loads of structured data. It is poor match as a document format.

    @Nerbonne

    There are excellent tools to work with flat files. Flat files are fast (parsing is minimal). They are also simple and human readable (more so sometimes than JSON/XML).

    I agree that namespaces are difficult. However, they are useful to support things like microformats.

    XSLT2 is difficult to implement efficiently in a browser. It is a very rich language. I must point out that XQuery has also received little support in the browsers. In fact, browser developers seem to have given up on XML technology beyond what they already have.

    Comment by Daniel Lemire — 18/11/2010 @ 10:09

  8. @Daniel, I understand that switching working formats and tools from XML to JSON might be a waste of time, so for established systems and XML standards, XML will remain in use. You say that JSON is a poor match as a document format. Why? What is it that XML does that JSON cannot do (or that XML does more elegantly)? I can’t think of a time when I would use XML, other than to take advantage of legacy tools or formats.

    Comment by Brandon — 18/11/2010 @ 10:40

  9. @Brandon

    You say that JSON is a poor match as a document format. Why?

    See Sean McGrath’s post on mixed content for a related explanation:

    http://seanmcgrath.blogspot.com/2007/01/mixed-content-trying-to-understand-json.html

    Comment by Daniel Lemire — 18/11/2010 @ 11:11

  10. Flat files can be fine if all you want is a plain sequence of characters, or perhaps a sequence of lines – or just maybe if you have a comma separated file.

    What I see a little too often is feature-creep; what started as a flat file now has some structured stuff up front- “headers” if you will, and some structure in each line – at which point, why not just use an existing format, and one that’s trivially usable by come what may.

    Comment by Eamon Nerbonne — 18/11/2010 @ 11:18

  11. One aspect that is underrepresented in this discussion, and also in our basic computer science education, is that of data independence.

    In this XML vs. JSON discussion, most of the arguments are really mixing up the physical and logical level – one can also represent XML in a highly compressed format; even though many of today’s implementations of XML technologies may not properly do this. And one can also represent semi-structured data using JSON, even though then manipulating it may prove a big challenge.

    On a practical issue, I think a standard JSON 2 XML converter would help us use JSON data in our XML query languages without effort, and it should not be difficult to do this.

    Comment by Arjen P. de Vries — 18/11/2010 @ 11:31

  12. Most students here in the UK aren’t even taught about the concept of web services, let alone the difference between the SOAP/XML approach and the RESTful / JSON approach.

    In fact I personally think XML is very valuable in web services – for business applications where you’re passing large amounts of structured data. In this arena XML can be highly effective. The problem is very few students or academics ever see such web service systems in action. And many home brew developers don’t have the patience or desire to learn about SOAP/XML because that’s ‘enterprise’ stuff and its all a bit scary.

    I think what it boils down to is that the real problem (and this is a major problem in software I will continue to bang on about at every oppoturnity) is that:

    The majority of developers are way behind the minority who are developing systems using modern technologies like web services and SOA. And I would be willing to bet most students are even further behind them in terms of useful knowledge of the landscape of cutting edge business software.

    Before you can even begin to talk about the appropriate approach to something like web services, you have to get them to understand what they are and they’re important.

    Comment by Stewart Sims — 18/11/2010 @ 11:35

  13. I think a lot of us developers that started using json merely did it to get around the cross-domain issues of fetching xml client-side. If it wasn’t for that, I wouldn’t have actively looked for another solution (JSON).

    http://tech.rawsignal.com

    Comment by Troy — 18/11/2010 @ 11:40

  14. @Daniel, great link, and I get that. XML (HTML) is great for text formatting markup, and the mixed content problem with text markup would be very difficult to solve in XML-free JSON. If I needed to send HTML-formatted text from server to client, I certainly wouldn’t try to translate it into JSON. I’m still struggling, though, with a use case for XML in a modern (non-legacy-driven) data-exchange situation. Is there an example you can give me (other than presentation markup) where XML would be a better choice than JSON? Thanks for your responses — I still think I misunderstand XML :)

    Comment by Brandon — 18/11/2010 @ 11:54

  15. @Brandon

    I’m still struggling, though, with a use case for XML in a modern (non-legacy-driven) data-exchange situation

    Firstly, you have to determine whether you are dealing with structured, semi-structured or unstructured data. If, and only if, you are dealing with semi-structured data, then XML is a candidate. If you are dealing with simple, mostly structured data, and you are on the Web, then JSON is probably best.

    XML is not, and has never been, a generic data format.

    One annoying application of XML is for configuration files… because often, all that was required is a simple key/value list.

    Comment by Daniel Lemire — 18/11/2010 @ 13:11

  16. Both JSON and XML are formats. It is possible to perform 1:1 mapping from one to another, as both of them have different ways to describe the same data.
    Data formats are needed to send data from place A to place B, and therefore the only thing that matters is how the format is supported in A and B. If browsers provides would build an XML engine that would be faster then the JSON engine, then there would be no need to use JSON. If the tools / standard support for JSON would be better then they support XML, then there would be no need to use XML.It’s all in the tool + standards support. Readability and network payload are to be handled by tools. What matters is the data, formats are just the paper that carries the text.

    Comment by Yoav — 18/11/2010 @ 13:11

  17. Thanks for a great article that calls out what I’ve said for years, i.e., the structure of the data should be thoroughly understood before picking an encoding method. As you note, there are some things that XML works well for and others that it simply doesn’t. Unfortunately, XML was taken to be the silver bullet for all data formats (as typically happens with new things) and applied in ways that it should really never have been.

    Comment by Chuck — 18/11/2010 @ 13:18

  18. At our university in Austria we got a course called “semi-structured data”. It’s made mandatory for most students therfor there is no problem with a lack of students attending :)

    Comment by anty — 18/11/2010 @ 15:50

  19. Hmm, maybe I am completely wrong, but isn’t the main difference between XML and json/yaml the fact that XML has the possibility to validate vs. a schema? Which itself is described in the same format …

    Oh, and next time I would suggest to go for “semi-structured data” in the course title. It is about time that students learn about those 80 – 90% of enterprise data. Imho there is no such thing as completely unstructured data in enterprises, but that is another discusssion ….

    Comment by KoW — 21/11/2010 @ 1:05

  20. When it comes true unstructured data, which covers multimedia and any file format, neither XML nor JSON are efficient for handling them.
    Its not that one is better than the other, its using the right one for the right job, knowing their limitations.
    A private paper I recently wrote, addresses and highlights this (available on request – email marcelle@piction.com) and covers the mentality of the relationist.

    Comment by Marcelle Kratochvil — 21/11/2010 @ 17:18

  21. @KoW

    I use the expression “unstructured data” as a technical term in this blog post.(One could argue that only random noise has no structure.)

    Comment by Daniel Lemire — 21/11/2010 @ 19:40

  22. XML is can be a really good format for configuration. Using key/value pair, more validation has to be done by the programmer and the user can’t benefits from Editor assistance (completion, validation…).

    XML/JSON provide lot of flexibility for managing tree like data but this flexibility comes with a big cost in performance and size. The main thing

    if your data doesn’t express easily as a tree (image, pictures, sound…)then avoid XML/JSON.

    You want your messages to be small and processing to be fast ? stay away from XML or JSON.

    You want a portable format with minimal cost for the design of the format and the tooling ? Then theses format can help you a lot.

    Comment by Anonymous — 22/11/2010 @ 12:09

  23. Actually, I disagree about using XML for configs. The validation has to be performed at some point by a developer whether it involves validating key-value pairs in code or creating a DTD or schema to do the validation. If the assumption is that the tool will be generating / managing the config then the the difference only be the time it takes for the tool to validate, regardless of the file format. The fallacy of XML in many of these cases is that it makes the file unreadable by a human, which may be moot if a human is not expected to hand edit it.

    Comment by Chuck — 24/11/2010 @ 14:25

  24. As a recent CS graduate, I think your marketing sense is exactly backwards from mine. “Information retrieval and filtering” sounds like the world’s most boring course. Who’d want to sit through that?

    “Unstructured” or “semi-structured data” sounds somewhat interesting: it’s something no other CS course at my university has even tried to address.

    I would never worry about not understanding the title. The most fun and interesting (and even, dare I say it, useful) courses were the ones whose titles I didn’t understand. In some cases it had words I’d never even heard before!

    Comment by Ken — 30/11/2010 @ 15:17

  25. @Ken

    Thanks Ken. You might be right.

    Comment by Daniel Lemire — 30/11/2010 @ 15:32

  26. IMO JSON has not at all replaced SOAP-XML in web services. Not all platforms support JSON out of the box, and it does not provide service contracts. SOAP is not meant for human readability but for intra-software readability, and it works well for that and is in widespread use. RPC technology has a complex history, its own dirty politics, and a simplistic essay like this does not do justice to the decision to choose JSON over SOAP. SOAP is a protocol, provides a *standardized* way of allowing the endpoint to return error messages to the calling application, and is in widespread use for good reasons, which apparently you do not yet understand.

    Comment by P. G. Palmer — 1/5/2011 @ 11:05

Sorry, the comment form is closed at this time.

« Blog's main page

Powered by WordPress