Gay, Jerry on 23 Oct 2003 13:52:43 -0400 |
well, Parse::RecDescent is made for this, but perhaps there's something with less overhead that someone else can suggest. --jerry -----Original Message----- From: Jason Proctor [mailto:jproctor@persons.marlboro.edu] Sent: Thursday, October 23, 2003 1:48 PM To: phl@lists.pm.org; jax@lists.pm.org Subject: text-parsing question I don't have my old Data Munging book handy, and I'm looking at a text parsing problem that I know has to be common enough that someone else solved it, but I don't see any obvious candidates on CPAN. I have a large (several hundred MB) file that looks roughly like this: Record-marker unique id: 001 other field 1: abc other field 2: xyz other field 2: alt-xyz Person-marker person name: J. Smith person addr: 123 Whatever St Person-marker person name: J. Doe person addr: 321 Main St Link-marker link reference: 005 link date: 20020202 Link-marker link reference: 013 link date: 20030303 Comment-marker note: foo note: bar Record-marker unique id: 002 [...] This is obviously ripe for some sort of database or XML-derived representation. Each record can have one or more Person-markers, zero or more Link-markers, zero or one Comment-marker (which, if it exists, has one or more notes), and some number of each of the "other field"s, the exact rules for which depend on the nature of the field, but are all of fairly standard character (exactly 1, 0 or 1, 1 or more, etc.) and don't depend on presence, absence, or content of the other fields. Here's what I'd like: Something that will chew on this file and assemble a data structure one record at a time (callback on identifying "Record-marker"?), which I can then shuffle off into some more readily reaccessed place like a database or an XML file. I'm easily capable of solving the specific case here, but I'd rather use someone else's solution of the general problem--a sort of flexible input template. And if no such thing exists yet, but there's some encouragement that it could be developed, I'll do that too. Any ideas what to call it so that someone else with this kind of task could pick it out of the Data::, File::, or Text:: namespaces at CPAN? j - **Majordomo list services provided by PANIX <URL:http://www.panix.com>** **To Unsubscribe, send "unsubscribe phl" to majordomo@lists.pm.org** ************************************************************************** This e-mail and any files transmitted with it may contain privileged or confidential information. It is solely for use by the individual for whom it is intended, even if addressed incorrectly. If you received this e-mail in error, please notify the sender; do not disclose, copy, distribute, or take any action in reliance on the contents of this information; and delete it from your system. Any other use of this e-mail is prohibited. Thank you for your compliance. - **Majordomo list services provided by PANIX <URL:http://www.panix.com>** **To Unsubscribe, send "unsubscribe phl" to majordomo@lists.pm.org**
|
|