Adam Turoff on 24 Nov 2003 22:52:08 -0500


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] XML, text, and the development of unix


On Mon, Nov 24, 2003 at 10:19:31PM -0500, Jeff Abrahamson wrote:
> What are we losing, I would have asked Richie, as more and more files
> become XML?

We're losing a *LOT* of simplicity.  Had Unix or Plan9 been predicated
on XML instead of simple line-oriented text, it may have never gotten
out the door.

> I know some of you, especially sys admins, have extensive experience
> both with XML and with the need to query in an ad hoc way using these
> text-based tools.  What are your experiences, pro and con, with this
> issue?

One of the basic tenets behind the Unix philosophy behind files is that
*everything* is the same format: /etc/passwd, /etc/rc.conf, ~/.login and
so on are all line-oriented files.  Look at a portion of those files (from
head, tail, or grep), and it is still a line-oriented file.

Each of those files may have substructure: /etc/passwd is a set of colon
delimited fields, /etc/rc.conf is a shell program, Makefile is a little
language with a regular syntax, and so on.  grep alone may not be able to
pull out an entire make rule, but a crafty use of sed/awk/perl should be
able to pull out a single rule without too much trickery.

What about XML?  XML is a hierarchical data structure that happens to be
serialized as text.  The benefit to it being text is that you can use
emacs, vi, perl, and printf() when dealing with it.  In a pinch, you can
still use grep, sed and more on an XML file, but there's no guarantee that
it'll be *as* useful as a simple line-oriented format.  The guarantee that
sed, grep and friends are more useful with XML than they are with JPEG,
GIF, tarballs or MP3 files.

Case in point: selecting a few lines out of an XML file does not produce
an XML file, it produces a fragment, which is more often than not
completely unparsable.  Furthermore, if you are looking for a specific
well-formed fragment, it may start on the middle of line 3 and end on the
middle of line 7.  Are the beginnings relevant, or are they noise to be
stripped?  Is the hierarchical structure in relation to the rest of the
document relevant?  Are aspects of the elements that follow relevant?

Next, there's the issue of parsing XML.  The act of parsing XML files
"denatures" them.  The data returned by a parser is *conceptually* the
same information, but not *exactly* the same information.  Taking a
cryptographic signature on the input will not necessarily match a signature
of the post-parsed information.  Operations like grep and head are quite
simple on line-oriented files -- just copy the input to the output.  This
cannot be done with XML, because content characters like <, > and & need
to be converted to &lt; &gt; and &amp; on output if the document (fragment)
is to be parsed again, for example to another stage of the pipeline.

Oh, and then there's the need to make tools XML-aware.  A tool like spell
works because it's dealing with just text.  What should spell do with XML
data?  Does it fix up element names?  Does it fix up attribute names or
attribute contents?  How does it limit itself to just text (and possibly
comments) while ignoring elements (and possibly PI's).

Now, look at all of the issues that must be handled to handle XML text as
fluidly as Unix tools handle line-oriented text.

Hm...*all* of these features would need to be implemented on *every* little
tool.  Each time you'd do a simple read() or write() operation, these
nasty issues would have to be addressed.  Even if they could be handled
in the depths of stdio, stdio would need to have about 12 or more different
knobs to tweak -- a *LOT* more complex than read/write or gets/printf.

And that's just the tip of the iceberg.  XML has been around about 5 years
now, and for the most part, the simple processing pipelines don't exist
to manage XML data with ad-hoc queries.  Cocoon and other web-based XML
publishing environments offer pipeline processing for XML, but they operate
in terms of linking custom-written programs together, not linking simple
command-line tools together.  They also work by avoiding the
read/parse/encode/serialize loop (the simple read/write loop) to passing
pre-parsed XML events/trees from one stage of the pipeline to the next,
*not* the canonical text-based format.


Thanks for such a thought provoking question.  :-)

HTH, 

Z.

___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug