Jason Proctor on 23 Oct 2003 13:47:25 -0400


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

text-parsing question


I don't have my old Data Munging book handy, and I'm looking at a text
parsing problem that I know has to be common enough that someone else
solved it, but I don't see any obvious candidates on CPAN.

I have a large (several hundred MB) file that looks roughly like this:

Record-marker
unique id: 001
other field 1: abc
other field 2: xyz
other field 2: alt-xyz
Person-marker
person name: J. Smith
person addr: 123 Whatever St
Person-marker
person name: J. Doe
person addr: 321 Main St
Link-marker
link reference: 005
link date: 20020202
Link-marker
link reference: 013
link date: 20030303
Comment-marker
note: foo
note: bar
Record-marker
unique id: 002
[...]

This is obviously ripe for some sort of database or XML-derived
representation.  Each record can have one or more Person-markers, zero or
more Link-markers, zero or one Comment-marker (which, if it exists, has
one or more notes), and some number of each of the "other field"s, the
exact rules for which depend on the nature of the field, but are all of
fairly standard character (exactly 1, 0 or 1, 1 or more, etc.) and
don't depend on presence, absence, or content of the other fields.

Here's what I'd like:

Something that will chew on this file and assemble a data structure one
record at a time (callback on identifying "Record-marker"?), which I can
then shuffle off into some more readily reaccessed place like a database
or an XML file.

I'm easily capable of solving the specific case here, but I'd rather use
someone else's solution of the general problem--a sort of flexible input
template.

And if no such thing exists yet, but there's some encouragement that it
could be developed, I'll do that too.  Any ideas what to call it so that
someone else with this kind of task could pick it out of the Data::,
File::, or Text:: namespaces at CPAN?


j
-
**Majordomo list services provided by PANIX <URL:http://www.panix.com>**
**To Unsubscribe, send "unsubscribe phl" to majordomo@lists.pm.org**