Re: Perl Ripping Event: GNU info file reader

mjd-perl-pm on Mon, 28 Feb 2000 12:53:28 -0500 (EST)

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: Perl Ripping Event: GNU info file reader

From: mjd-perl-pm@plover.com

To: phl@lists.pm.org

Subject: Re: Perl Ripping Event: GNU info file reader

Date: Mon, 28 Feb 2000 12:53:09 -0500

Reply-to: phl@lists.pm.org

Sender: owner-phl@lists.pm.org

I thought maybe I'd take a shot at this. There are two sorts of things you can say about this program. It was written in Perl 4, and there are a lot of ways that it could and should have been done differently if Perl 5 had been available when it was written. And then there are also some things that should have been done differently even though Perl 5 was not available. The biggest criticism I have is that the entire interface centers around a collection of global variables, which is not what you want a subroutine library to do. If you were writing a program that happened to be using a filehandle named INFO, or a variable named $info_file, $info_node, etc., then you would not be able to use this library without rewriting. I would replace this: > &open_info_file(INFO_FILENAME); with > &open_info_file(INFO_FILEHANDLE, INFO_FILENAME); By changing the definition of open_info_file to read: sub open_info_file { local ($info_fh, $info_filename) = @_; open($info_fh, $info_filename) || die "Couldn't open $info_filename: $!"; return &start_info_file; } Now the user can specify which filehandle to open. I also localized the $info_filename and $info_fh variables. Formerly, opening an info file would clobber the value of $info_filename elsewhere in your program; and this wasn't even documented. I also removed some spurious punctuation. It looks as though the author makes the same mistake that a lot of beginning Perl programmers do: He doesn't understand when it's necessary to quote a variable. The original code had open(INFO, "$info_filename") and the "..." marks here are not doing anything. You use "..." when you want to make a string, but $info_filename is already a string, and doesn't need to be made into one. > Then call > > &get_next_node; (This is an error; the code says `read_next_node', not `get_next_node'.) > > repeatedly to read the next node in the info file; the variables > $info_file > $info_node > $info_prev > $info_next > $info_up > > are set if the corresponding fields appear in the node's > header, and if the node has a menu, it is loaded into > %info_menu. When `get_next-node' returns false, you have > reached end-of-file or there has been an error. Here I want to just have read_next_node() return the information about the node, rather than setting half a dozen global variables. It's easy to see how to do this: ($file, $node, $prev, $next, $up, %menu) = &get_next_node; Depending on how it is to be used, it might also be good to have get_next_node cache the menu in some static variable inside the library, and allow the menu information to be retrieved with a separate function call: ($file, $node, $prev, $next, $up) = &get_next_node; %menu = &get_current_menu; This does not require many changes in the read_next_node function. You just replace the global variables with local variables, and then return the appropriate list at the end. For example, replace > ($info_file) = /File:\s*([^,]*)/; with local ($file) = /File:\s*([^,]*)/; and > return 1; with if (wantarray) { return ($file, $node, $prev, $next, $up, %menu); } else { return 1; } This requires a very small change to start_next_part: The > return 0; has to change to return; in case the functions (which are mutually recursive) are called in list context. Bare `return' yields a false value in both scalar and list context; it would have been more robust to write that instead of `return 0' in any case. I can't make up my mind whether I like the way read_next_node and start_next_part call one another. It's almost as if the author was trying to write coroutines in Perl. I think the decision about whether it was a success or not would have to depend on an attempt to write it some other way; then you'd looks to see which way was simpler, and throw away the version that wasn't. But I can't think of a better way to do it offhand, so I'm going to leave it the way it is. Maybe someone else will revisit this. The other major change I'd make is to reduce the number of uses of eof(). It's very rare to use the eof() function correctly and it's very unusual to really need it, in C or in Perl. I think most people who use eof() are coming from a Pascal world, and Pascal I/O is hamstrung by its eof() function. So although it's possible that the author used it correctly, seeing eof() all over makes me uneasy. > $_ = <INFO>; # Header line > if (eof(INFO)) { > return &start_next_part && &read_next_node; > } Here, for example, I would certainly have written $_ = <INFO>; # Header line if (not defined $_) { return &start_next_part && &read_next_node; } instead. Looking at this, I'm sure it's correct; with the eof() version I wasn't sure. I was worried wbout why both start_next_part and read_next_node were called here, but not further down, and I realized it's because the call further down has already gathered the information from a note and has it ready to return, so it doesn't need to make a recursive call to read_next_node. This is tipping my judgement away from the mututal recursion, but I'm reluctant to make up my mind without seeing it both ways. > ($info_file) = /File:\s*([^,]*)/; > ($info_node) = /Node:\s*([^,]*)/; > ($info_prev) = /Prev:\s*([^,]*)/; > ($info_next) = /Next:\s*([^,]*)/; > ($info_up) = /Up:\s*([^,]*)/; Here's a typical line that this is supposed to parse: File: r4rs.info, Node: Numerical types, Next: Exactness, Prev: Numbers, Up: Numbers I had a moment of doubt about whether I would rewrite this as a single regex, but I think the original author got it right here for several reasons. First, if you used a single regex, it would be unreadable: ($file, $node, $prev, $next, $up) = /File:\s*([^,]*),\s*Node:\s*([^,]*),\s*Next:\s*([^,]*),\s*Prev:\s*([^,]*),\s*Up:\s*([^,]*)/; And remember that /x was not available. Second, the original code is perfectly clear. Finally, it turns out that not all of the items will necessarily be available. for example, the last node in the document has no `Next' field. Fixing this in the one-liner above would require adding a whole bunch of parentheses and ?s, and making the result list into a list of ten variables instead of five (since (?:...) was not available in 1994.) In the next section there's a block with strange control flow: > $_ = <INFO> until /^(\* Menu:|\037)/ || eof(INFO); > if (eof(INFO)) { > return &start_next_part; > } elsif (/^\037/) { > return 1; # end of node, so return success. > } I would rewrite this into a more normal form: while (<INFO>) { return 1 if /^\037/; last if /^\* Menu:/; } return &start_next_part unless defined $_; Hmm. Now that I've done that, I'm not sure that I actually think it's better. I thought that using the super-common perl construction `while (<FILEHANDLE>)' would make the code more familiar, but the control flow inside the loop is rather weird. However, a point in its favor is the very similar control structure in the following block. If I left the original structure, I would replace `eof()' with `!defined($_)' > # read menu Here's a sample menu from an info file: * Menu: * Numerical types:: * Exactness:: * Implementation restrictions:: * Syntax of numerical constants:: * Numerical operations:: * Numerical input and output:: Each item has a title (which is displayed to the user) and a node name (which is the node that the user visits next if they select that menu item.) If the title and node name are different, the menu item looks like this: * The title: The node name. if they're the same (as is often the case) they end in :: as in the examples above. > while (<INFO>) { > return 1 if /^\037/; # end of node, menu is finished, success. > next unless /^\* \S/; # next unless lines is a menu item I like this; it's robust. At this point in the program the author knows that the only thing he or she cares about is finding menu items, and is quite happy to ignore anything that isn't a menu item. > if (/^\* ([^:]*)::/) { > $key = $ref = $1; > } elsif (/^\* ([^:]*):\s*([^.]*)[.]/) { > ($key, $ref) = ($1, $2); > } else { > print STDERR "Couldn't parse menu item\n\t$_"; > next; > } The person writing this program is really funny. They used the regex > if (/^\* ([^:]*)::/) { instead of > if (/^\* (.*)::/) { which also would have worked. But the one they did use is the one I would have written, because it is more efficient. Why did the original author use this? Was it because they didn't realize that the simpler one would work, or were afraid that it wouldn't? Or was it because they knew that the version with ([^:]*) would be more efficient? Based on their use of "$variable" above, I guess it was the former. In the error message, I think I would have included a filename and line number rather than printing the text of the bad menu line verbatim. Otherwise yo umight have trouble finding the bad line. Well, that's enough for one day. I have a strong sense that the code in start_next_part() could be about half the size it is, and I have some comments about how I would change the design if I were writing this program in Perl 5, but I'm going to leave it for a while. I'd be really interested to hear what other people think. **Majordomo list services provided by PANIX <URL:http://www.panix.com>** **To Unsubscribe, send "unsubscribe phl" to majordomo@lists.pm.org**

References:

Re: Perl Ripping Event
From: mjd-perl-pm@plover.com

Prev by Date: Re: Tomorrow - Hao!

Next by Date: php arrays

Previous by thread: Re: Perl Ripping Event

Next by thread: Perl Ripping Event

Index(es):

Date

Thread