Brian Duggan on 18 Nov 2016 04:44:10 -0800 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: [Philadelphia-pm] selective splitting? |
Hi All, I'll go ahead and throw in a perl 6 solution: $ cat split.pl #!/usr/bin/env perl6 my $v = '20161116172606Z;accepted-terms-of-use via CAS;192.168.1.5;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14'; grammar log { token TOP { [ <outer> | <balanced> ]+ %% ';' } token outer { <-[;()]>+ } token inner { <-[()]>+ } token balanced { [ <outer>? '(' <inner> ')' <outer>? ] + } } say log.parse($v); And the output: $ perl6 split.pl 「20161116172606Z;accepted-terms-of-use via CAS;192.168.1.5;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14」 outer => 「20161116172606Z」 outer => 「accepted-terms-of-use via CAS」 outer => 「192.168.1.5」 balanced => 「Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14」 outer => 「Mozilla/5.0 」 inner => 「Macintosh; Intel Mac OS X 10_12_1」 outer => 「 AppleWebKit/602.2.14 」 inner => 「KHTML, like Gecko」 outer => 「 Version/10.0.1 Safari/602.2.14」 Brian On Friday, November 18, Morgan Jones wrote: > Nate, > > That’s an elegant and simple solution, thanks. It’s also much more readable than what I was working on. I’ll integrate it tomorrow. > > -morgan > > > > On Nov 17, 2016, at 21:40, Nate Smith <nate@perlhack.com> wrote: > > > > > > Hi Morgan, > > > > I totes agree re: peer review! > > > > Lookaround assertions are what I'd reach for first for your problem, too, but I think they fall short: > > > > my $v = '20161116172606Z;accepted-terms-of-use via CAS;192.168.1.5;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14'; > > my @naive_parts = split /;/, $v; > > my @parts = split /(?<!\(.+);(?!.+\))/, $v; > > map { print "$_\n" } @parts; > > > > If you run that, it'll say > > > > Variable length lookbehind not implemented in regex m/(?<!\(.+);(?!.+\))/ > > > > So my understanding is that the RE engine can't validate a variable width look-behind assertion, though I don't know why. > > > > Workarounds people have come up with are using the '\K' escape (see perldoc perlre), or reversing the string and doing a look-ahead instead! > > > > I've never used the '\K' method and don't understand it. Reversing the string won't work for you b/c you want both look-ahead /and/ look-behind in the same re. > > > > Given all of that, my brain wants to treat this as a two step process like a compiler might. > > > > 1) using either another regex or the range operator[s], substitute a placeholder for all the semicolons that are inside parens > > 2) perform your split with a dead simple split regex, /;/ > > 3) replace the placeholders with semicolons on each part after it's been split > > > > See attached sample code! > > > > Cheers, > > Nate > > > > PS Nice meeting you all on Monday! > > > > On Thu, Nov 17, 2016 at 08:40:37PM -0500, Morgan Jones wrote: > >> mjd’s talk Monday has me thinking about peer review and how helpful it can be. So here goes. I can certainly work around this but as a learning experience I’m wondering if someone has a straightforward answer. Can I split on only instances of a character that is not surrounded by in this case parentheses? > >> > >> I have a semicolon separated string that contains a date, a string, an ip address and a user agent string. The catch is the user agent string contains a semicolon however it’s between parentheses. So what I want is to split on semicolons that are not surrounded by parentheses. > >> > >> For example: > >> $v = ‘20161116172606Z;accepted-terms-of-use via CAS;192.168.1.5;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14’; > >> > >> It seems to me I should be able to split like this: > >> my ($date, $ignore, $ip, $agent) = split /[^\(]+[^\;]*\;[^\)]*[^\)]+/, $v; > >> > >> From a little reading I may need to use look aheads which are new to me. Here’s an attempt at that that is of course not working: > >> my ($date, $ignore, $ip, $agent) = > >> split /(?<!() > >> \; > >> (?!))/x, $v; > >> > >> > >> Does anyone have a suggestion or see what I’m missing? > >> > >> thanks, > >> > >> -morgan > >> _______________________________________________ > >> Philadelphia-pm mailing list > >> Philadelphia-pm@pm.org > >> http://mail.pm.org/mailman/listinfo/philadelphia-pm > > <morgan.pl.txt> > > _______________________________________________ > Philadelphia-pm mailing list > Philadelphia-pm@pm.org > http://mail.pm.org/mailman/listinfo/philadelphia-pm _______________________________________________ Philadelphia-pm mailing list Philadelphia-pm@pm.org http://mail.pm.org/mailman/listinfo/philadelphia-pm