Parsing Made Easyish
A Guide to Parsing with Marpa
http://cleverdomain.org/opw-marpa/
About Me
Parsing Quick Intro
- Taking structured textual or binary data
- And extracting information from it.
- Traditionally two or three phases
- Lexing (Scanning / Evaluating)
- Parsing
- Produces a parse tree, which can be turned into something useful.
Story Time
- I’ve always had a fear of lexing.
- How do you know exactly what you’re looking at, without context?
- Often you don’t.
Lexerless Parsing
- Some parsers don’t make you write a lexer:
- Parse::RecDescent
- Regexp::Grammars
- Parser::MGC
- Recursive descent
- backtracking
- is usually slow.
Enter Marpa
- A new parsing engine for Perl and C released in 2011.
- Based on the Earley algorithm.
- Very powerful
- Parses all context-free languages.
- No problem with left-recursion.
- And fast
- Parses a large class of languages in linear time.
- XS implementation.
My Favorite Feature of Marpa
- Marpa keeps track of multiple possible parses in parallel.
- At any time, you can ask it what tokens are expected right now.
- “Ruby slippers”: give the parser the token it wants to see.
Standard Input Model: Earleme-per-Token
while ($input_remaining) {
my $token = $lexer->next_token($input);
unless ($parser->read($token)) {
die "Parse error!";
}
}
my $value_ref = $parser->value;
unless ($value_ref) {
die "Parse error!";
}
return $$value_ref;
Earleme-per-Character Model
while ($pos < length $input) {
my $expected = $parser->expected_terminals;
for my $token_name (@$expected) {
if (my $token = $lexer->match($token_name, $input, $pos)) {
$parser->alternative($token);
}
}
try { $parser->earleme_complete } catch { die "Parse error!" };
$pos++;
}
my $value_ref = $parser->value;
unless ($value_ref) {
die "Parse error!";
}
return $$value_ref;
So there’s still a lexer?
- Well, yeah.
- But its job is easier.
- Say whether a given thing starts at a given place.
- Marpa is managing context for us.
- Longest token rule is now optional.
Code Sample: MarpaX::Lex::Easy
TOKEN: for my $token_name (@expected) {
my $token = $tokens->{$token_name};
die "Unknown token $token_name" unless defined $token;
next if $token eq 'passthrough';
my $rule = $token->[0];
pos($input) = $pos;
next TOKEN unless $input =~ $rule;
my $matched_len = $+[0] - $-[0];
my $matched_value = undef;
if (defined( my $val = $token->[1] )) {
if (ref $val eq 'CODE') {
eval { $matched_value = $val->(); 1 } || do { next TOKEN };
} else {
$matched_value = $val;
}
} elsif ($#- > 0) {
$matched_value = $1;
}
push @matches, [ $token_name, \$matched_value, $matched_len + $whitespace_consumed ];
}
return @matches;
Math Parser: Grammar
my $grammar = Marpa::R2::Grammar->new({
actions => 'Math::Parser::Actions',
start => 'Expression',
rules => q{
Expression ::= NUMBER action => number
| (LPAREN) Expression (RPAREN) action => parens assoc => group
|| Expression (ASTERISK) Expression action => multiply
| Expression (SLASH) Expression action => divide
|| Expression (PLUS) Expression action => add
| Expression (MINUS) Expression action => subtract
},
});
$grammar->precompute;
Math Parser: Tokens
my %tokens = (
'LPAREN' => [ qr/\G[(]/ ],
'RPAREN' => [ qr/\G[)]/ ],
'ASTERISK' => [ qr/\G[*]/ ],
'SLASH' => [ qr#\G[/]# ],
'PLUS' => [ qr/\G[+]/ ],
'MINUS' => [ qr/\G[-]/ ],
'NUMBER' => [ qr/\G([-+]?[0-9.]+(?:e[+-]?[0-9]+)?)/, sub { 0 + $1 } ],
);
Math Parser: Parse Method
method parse_line ($line) {
my $rec = Marpa::R2::Recognizer->new({
grammar => $grammar,
ranking_method => 'rule',
});
my $lex = MarpaX::Lex::Easy->new(
tokens => \%tokens,
recognizer => $rec,
automatic_whitespace => 1,
whitespace_pattern => qr/\G\s+/,
);
return $lex->read_and_parse($line);
}
Math Parser: Actions
package Math::Parser::Actions;
sub parens { $_[1] }
sub multiply { $_[1] * $_[2] }
sub divide { $_[1] / $_[2] }
sub add { $_[1] + $_[2] }
sub subtract { $_[1] - $_[2] }
sub number { $_[1] }
Bonus Technique: Two Parsers
- TAP::Spec::Parser is live on CPAN
- Line parser
- Earleme-per-character
- Matches any single line of TAP
- Stream parser
- Earleme-per-token
- Uses line parser’s output as input tokens
- Matches an entire TAP document
Two Parsers Cont.
- Chunking makes TAP’s “Junk Line” behavior easy
- Any line that’s not a valid line of TAP is ignored
- But a valid line in an invalid position is an error
- The error reporting is better too
Expected Result|Comment, found Plan at...
TAP: Line Parser
method parse_line ($line) {
my $rec = Marpa::R2::Recognizer->new(
{ grammar => $self->line_grammar, ranking_method => 'rule' });
for my $pos (0 .. length($line) - 1) {
my $expected_tokens = $rec->terminals_expected;
if (@$expected_tokens) {
my @matching_tokens = $self->lex(\$line, $pos, $expected_tokens);
$rec->alternative( @$_ ) for @matching_tokens;
}
my $ok = eval { $rec->earleme_complete; 1 };
if (!$ok) {
return [ 'Junk_Line', $line ];
}
}
$rec->end_input;
return ${$rec->value};
}
TAP: Stream Parser
method parse {
my $rec = Marpa::R2::Recognizer->new(
{ grammar => $self->stream_grammar, ranking_method => 'rule' });
my $reader = $self->reader;
while (defined( my $line = $reader->() )) {
my $line_token = $self->parse_line($line);
unless (defined $rec->read(@$line_token)) {
my $expected = $rec->terminals_expected;
die "Parse error, expecting [@$expected], got $line_token->[0]";
}
}
$rec->read('EOF');
return ${$rec->value};
}
The Future
- I wanted to say that MarpaX::Lex::Easy is on CPAN and you should use it.
- But more exciting things are in progress.
Watch This Space
MarpaX::Lex::Easy
(me)
Marpa::R2::Scanless
(Jeffrey & Peter)
MarpaX::Repa
(Ruslan)
Marpa::R3
(Marpa Team)
Thanks!