Adam Turoff on 24 Nov 2003 22:52:08 -0500 |
On Mon, Nov 24, 2003 at 10:19:31PM -0500, Jeff Abrahamson wrote: > What are we losing, I would have asked Richie, as more and more files > become XML? We're losing a *LOT* of simplicity. Had Unix or Plan9 been predicated on XML instead of simple line-oriented text, it may have never gotten out the door. > I know some of you, especially sys admins, have extensive experience > both with XML and with the need to query in an ad hoc way using these > text-based tools. What are your experiences, pro and con, with this > issue? One of the basic tenets behind the Unix philosophy behind files is that *everything* is the same format: /etc/passwd, /etc/rc.conf, ~/.login and so on are all line-oriented files. Look at a portion of those files (from head, tail, or grep), and it is still a line-oriented file. Each of those files may have substructure: /etc/passwd is a set of colon delimited fields, /etc/rc.conf is a shell program, Makefile is a little language with a regular syntax, and so on. grep alone may not be able to pull out an entire make rule, but a crafty use of sed/awk/perl should be able to pull out a single rule without too much trickery. What about XML? XML is a hierarchical data structure that happens to be serialized as text. The benefit to it being text is that you can use emacs, vi, perl, and printf() when dealing with it. In a pinch, you can still use grep, sed and more on an XML file, but there's no guarantee that it'll be *as* useful as a simple line-oriented format. The guarantee that sed, grep and friends are more useful with XML than they are with JPEG, GIF, tarballs or MP3 files. Case in point: selecting a few lines out of an XML file does not produce an XML file, it produces a fragment, which is more often than not completely unparsable. Furthermore, if you are looking for a specific well-formed fragment, it may start on the middle of line 3 and end on the middle of line 7. Are the beginnings relevant, or are they noise to be stripped? Is the hierarchical structure in relation to the rest of the document relevant? Are aspects of the elements that follow relevant? Next, there's the issue of parsing XML. The act of parsing XML files "denatures" them. The data returned by a parser is *conceptually* the same information, but not *exactly* the same information. Taking a cryptographic signature on the input will not necessarily match a signature of the post-parsed information. Operations like grep and head are quite simple on line-oriented files -- just copy the input to the output. This cannot be done with XML, because content characters like <, > and & need to be converted to < > and & on output if the document (fragment) is to be parsed again, for example to another stage of the pipeline. Oh, and then there's the need to make tools XML-aware. A tool like spell works because it's dealing with just text. What should spell do with XML data? Does it fix up element names? Does it fix up attribute names or attribute contents? How does it limit itself to just text (and possibly comments) while ignoring elements (and possibly PI's). Now, look at all of the issues that must be handled to handle XML text as fluidly as Unix tools handle line-oriented text. Hm...*all* of these features would need to be implemented on *every* little tool. Each time you'd do a simple read() or write() operation, these nasty issues would have to be addressed. Even if they could be handled in the depths of stdio, stdio would need to have about 12 or more different knobs to tweak -- a *LOT* more complex than read/write or gets/printf. And that's just the tip of the iceberg. XML has been around about 5 years now, and for the most part, the simple processing pipelines don't exist to manage XML data with ad-hoc queries. Cocoon and other web-based XML publishing environments offer pipeline processing for XML, but they operate in terms of linking custom-written programs together, not linking simple command-line tools together. They also work by avoiding the read/parse/encode/serialize loop (the simple read/write loop) to passing pre-parsed XML events/trees from one stage of the pipeline to the next, *not* the canonical text-based format. Thanks for such a thought provoking question. :-) HTH, Z. ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug
|
|