text-parsing question

Jason Proctor on 23 Oct 2003 13:47:25 -0400

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

text-parsing question

From: Jason Proctor <jproctor@persons.marlboro.edu>

To: phl@lists.pm.org, jax@lists.pm.org

Subject: text-parsing question

Date: Thu, 23 Oct 2003 13:48:11 -0400 (EDT)

Reply-to: phl@lists.pm.org

Sender: owner-phl@lists.pm.org

I don't have my old Data Munging book handy, and I'm looking at a text parsing problem that I know has to be common enough that someone else solved it, but I don't see any obvious candidates on CPAN. I have a large (several hundred MB) file that looks roughly like this: Record-marker unique id: 001 other field 1: abc other field 2: xyz other field 2: alt-xyz Person-marker person name: J. Smith person addr: 123 Whatever St Person-marker person name: J. Doe person addr: 321 Main St Link-marker link reference: 005 link date: 20020202 Link-marker link reference: 013 link date: 20030303 Comment-marker note: foo note: bar Record-marker unique id: 002 [...] This is obviously ripe for some sort of database or XML-derived representation. Each record can have one or more Person-markers, zero or more Link-markers, zero or one Comment-marker (which, if it exists, has one or more notes), and some number of each of the "other field"s, the exact rules for which depend on the nature of the field, but are all of fairly standard character (exactly 1, 0 or 1, 1 or more, etc.) and don't depend on presence, absence, or content of the other fields. Here's what I'd like: Something that will chew on this file and assemble a data structure one record at a time (callback on identifying "Record-marker"?), which I can then shuffle off into some more readily reaccessed place like a database or an XML file. I'm easily capable of solving the specific case here, but I'd rather use someone else's solution of the general problem--a sort of flexible input template. And if no such thing exists yet, but there's some encouragement that it could be developed, I'll do that too. Any ideas what to call it so that someone else with this kind of task could pick it out of the Data::, File::, or Text:: namespaces at CPAN? j - **Majordomo list services provided by PANIX <URL:http://www.panix.com>** **To Unsubscribe, send "unsubscribe phl" to majordomo@lists.pm.org**

Prev by Date: Algorithmic Complexity

Next by Date: RE: text-parsing question

Previous by thread: Re: Dennis Ritchie at UPenn

Next by thread: RE: text-parsing question

Index(es):

Date

Thread