Eric Roode on 27 Oct 2020 12:46:15 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

[Philadelphia-pm] Unicode BOM in input files


Hello fellow mongers!

    Today I opened and read a file.  Advanced stuff, right?  :-)

open my $fh, '<', 'file.dat';
$line = <$fh>;
if ($line =~ /^Your data:/) ....

    The problem is that the input file has a Unicode BOM (byte-order mark), so the first three bytes of the string are in fact 0xEF, 0xBB, and 0xBF.  So the match fails, even though if you look at the file in an editor, it looks like it begins with "Your data".  It took me a fair amount of time to figure this out.

    I am shocked that I have never encountered this before.  But I'm even more shocked that Perl doesn't automagically handle this internally.  What the heck, Perl??  This has me re-thinking how I open all text files!  If I am opening text files of unknown encoding, am I expected to read the BOM (if present) and then change the PerlIO encoding via 'binmode' myself?  For each and every input text file I open and read?  That's BS.  I've gotta be missing some obvious step.

    Any wisdom from the hive mind would be appreciated.  Thanks!

-- Eric Roode

_______________________________________________
Philadelphia-pm mailing list
Philadelphia-pm@pm.org
https://mail.pm.org/mailman/listinfo/philadelphia-pm