mjd-perl-pm on Mon, 28 Feb 2000 12:53:28 -0500 (EST)


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: Perl Ripping Event: GNU info file reader


I thought maybe I'd take a shot at this.

There are two sorts of things you can say about this program.  It was
written in Perl 4, and there are a lot of ways that it could and
should have been done differently if Perl 5 had been available when it
was written.  And then there are also some things that should have
been done differently even though Perl 5 was not available.

The biggest criticism I have is that the entire interface centers
around a collection of global variables, which is not what you want 
a subroutine library to do.  If you were writing a program that
happened to be using a filehandle named INFO, or a variable named
$info_file, $info_node, etc., then you would not be able to use this
library without rewriting.

I would replace this:

>         	&open_info_file(INFO_FILENAME);

with

>         	&open_info_file(INFO_FILEHANDLE, INFO_FILENAME);

By changing the definition of open_info_file to read:        

        sub open_info_file {
            local ($info_fh, $info_filename) = @_;
            open($info_fh, $info_filename) 
              || die "Couldn't open $info_filename: $!";
            return &start_info_file;
        }

Now the user can specify which filehandle to open.  I also localized
the $info_filename and $info_fh variables.  Formerly, opening an info
file would clobber the value of $info_filename elsewhere in your
program; and this wasn't even documented.

I also removed some spurious punctuation.  It looks as though the
author makes the same mistake that a lot of beginning Perl programmers
do:  He doesn't understand when it's necessary to quote a variable.
The original code had

             open(INFO, "$info_filename")

and the "..." marks here are not doing anything.  You use "..." when
you want to make a string, but $info_filename is already a string, and
doesn't need to be made into one.



>         Then call 
>         
>         	&get_next_node;

(This is an error; the code says `read_next_node', not `get_next_node'.)

>         
>         repeatedly to read the next node in the info file; the variables
>         	$info_file
>         	$info_node
>         	$info_prev
>         	$info_next
>         	$info_up
>         
>         are set if the corresponding fields appear in the node's
>         header, and if the node has a menu, it is loaded into
>         %info_menu.  When `get_next-node' returns false, you have
>         reached end-of-file or there has been an error.

Here I want to just have read_next_node() return the information about
the node, rather than setting half a dozen global variables.  It's
easy to see how to do this:

          ($file, $node, $prev, $next, $up, %menu) = &get_next_node;

Depending on how it is to be used, it might also be good to have
get_next_node cache the menu in some static variable inside the
library, and allow the menu information to be retrieved with a
separate function call:

          ($file, $node, $prev, $next, $up) = &get_next_node;
          %menu = &get_current_menu;


This does not require many changes in the read_next_node function.
You just replace the global variables with local variables, and then
return the appropriate list at the end.  For example, replace

>     ($info_file) = /File:\s*([^,]*)/;

with

      local ($file) = /File:\s*([^,]*)/;

and 

>     return 1;

with

      if (wantarray) {
        return ($file, $node, $prev, $next, $up, %menu);
      } else {
        return 1;
      }

This requires a very small change to start_next_part:  The

> 	    return 0;

has to change to

            return;

in case the functions (which are mutually recursive) are called in
list context.  Bare `return' yields a false value in both scalar and
list context; it would have been more robust to write that instead of
`return 0' in any case.

I can't make up my mind whether I like the way read_next_node and
start_next_part call one another.  It's almost as if the author was
trying to write coroutines in Perl.  I think the decision about
whether it was a success or not would have to depend on an attempt to
write it some other way; then you'd looks to see which way was
simpler, and throw away the version that wasn't.  But I can't think of
a better way to do it offhand, so I'm going to leave it the way it
is.  Maybe someone else will revisit this.

The other major change I'd make is to reduce the number of uses of
eof().  It's very rare to use the eof() function correctly and it's
very unusual to really need it, in C or in Perl.  I think most people
who use eof() are coming from a Pascal world, and Pascal I/O is
hamstrung by its eof() function.  So although it's possible that the
author used it correctly, seeing eof() all over makes me uneasy.

>     $_ = <INFO>;		# Header line
>     if (eof(INFO)) {
> 	return &start_next_part && &read_next_node;
>     }

Here, for example, I would certainly have written

      $_ = <INFO>;		# Header line
      if (not defined $_) {
  	return &start_next_part && &read_next_node;
      }

instead.  Looking at this, I'm sure it's correct; with the eof()
version I wasn't sure.  

I was worried wbout why both start_next_part and read_next_node were
called here, but not further down, and I realized it's because the
call further down has already gathered the information from a note and
has it ready to return, so it doesn't need to make a recursive call to
read_next_node.  This is tipping my judgement away from the mututal
recursion, but I'm reluctant to make up my mind without seeing it both
ways.



>     ($info_file) = /File:\s*([^,]*)/;
>     ($info_node) = /Node:\s*([^,]*)/;
>     ($info_prev) = /Prev:\s*([^,]*)/;
>     ($info_next) = /Next:\s*([^,]*)/;
>     ($info_up)   = /Up:\s*([^,]*)/;

Here's a typical line that this is supposed to parse:

        File: r4rs.info,  Node: Numerical types,  Next: Exactness,  Prev: Numbers,  Up: Numbers

I had a moment of doubt about whether I would rewrite this as a single
regex, but I think the original author got it right here for several
reasons.  First, if you used a single regex, it would be unreadable:

        ($file, $node, $prev, $next, $up) = 
          /File:\s*([^,]*),\s*Node:\s*([^,]*),\s*Next:\s*([^,]*),\s*Prev:\s*([^,]*),\s*Up:\s*([^,]*)/;

And remember that /x was not available.  Second, the original code is
perfectly clear.  Finally, it turns out that not all of the items will
necessarily be available.  for example, the last node in the document
has no `Next' field.  Fixing this in the one-liner above would require
adding a whole bunch of parentheses and ?s, and making the result list
into a list of ten variables instead of five (since (?:...) was not
available in 1994.)  


In the next section there's a block with strange control flow:

>     $_ = <INFO> until /^(\* Menu:|\037)/ || eof(INFO);
>     if (eof(INFO)) {
> 	return &start_next_part;
>     } elsif (/^\037/) { 
> 	return 1;		# end of node, so return success.
>     }

I would rewrite this into a more normal form:

      while (<INFO>) {
        return 1 if /^\037/;
        last if /^\* Menu:/;
      }
      return &start_next_part unless defined $_;

Hmm.  Now that I've done that, I'm not sure that I actually think it's
better.  I thought that using the super-common perl construction
`while (<FILEHANDLE>)' would make the code more familiar, but the
control flow inside the loop is rather weird.  However, a point in its
favor is the very similar control structure in the following block.
If I left the original structure, I would replace `eof()' with
`!defined($_)'

>     # read menu

Here's a sample menu from an info file:

        * Menu:
        
        * Numerical types::
        * Exactness::
        * Implementation restrictions::
        * Syntax of numerical constants::
        * Numerical operations::
        * Numerical input and output::

Each item has a title (which is displayed to the user) and a node name
(which is the node that the user visits next if they select that menu
item.)  If the title and node name are different, the menu item looks
like this:

        * The title:       The node name.

if they're the same (as is often the case) they end in :: as in the
examples above.


>     while (<INFO>) {    
> 	return 1 if /^\037/;    # end of node, menu is finished, success.
> 	next unless /^\* \S/;   # next unless lines is a menu item

I like this; it's robust.  At this point in the program the author
knows that the only thing he or she cares about is finding menu items,
and is quite happy to ignore anything that isn't a menu item.

> 	if (/^\* ([^:]*)::/) {
> 	    $key = $ref = $1;
> 	} elsif (/^\* ([^:]*):\s*([^.]*)[.]/) {
> 	    ($key, $ref) = ($1, $2);
> 	} else {
> 	    print STDERR "Couldn't parse menu item\n\t$_";
> 	    next;
> 	}

The person writing this program is really funny.  They used the regex

> 	if (/^\* ([^:]*)::/) {

instead of

> 	if (/^\* (.*)::/) {

which also would have worked.  But the one they did use is the one I
would have written, because it is more efficient.  Why did the
original author use this?  Was it because they didn't realize that the
simpler one would work, or were afraid that it wouldn't?  Or was it
because they knew that the version with ([^:]*) would be more
efficient?  Based on their use of "$variable" above, I guess it was
the former.

In the error message, I think I would have included a filename and
line number rather than printing the text of the bad menu line
verbatim.  Otherwise yo umight have trouble finding the bad line.

Well, that's enough for one day.  I have a strong sense that the code
in start_next_part() could be about half the size it is, and I have
some comments about how I would change the design if I were writing
this program in Perl 5, but I'm going to leave it for a while.

I'd be really interested to hear what other people think.
**Majordomo list services provided by PANIX <URL:http://www.panix.com>**
**To Unsubscribe, send "unsubscribe phl" to majordomo@lists.pm.org**