Parsing Made Easyish
A Guide to Parsing with Marpa
http://cleverdomain.org/yapc2014/
About Me
Parsing Quick Intro
- Taking structured textual or binary data
- And extracting information from it.
- Traditionally two or three phases
- Lexing (Scanning / Evaluating)
- Parsing
- Produces a parse tree, which can be turned into something useful.
Perl Parsing Tools
Yapp is based on Yacc, which might be better known today as GNU Bison
Marpa
- A newer library (first released 2011, actively developed).
- Uses the Earley parsing algorithm.
- Handles all kinds of grammar without fear.
- Can modify grammar at runtime.
- Has advanced error reporting.
- Fast (C + Perl).
- Lots of features!
I am:
- Not a Parsing Expert.
- Not a Marpa Expert.
- More of an evangelist.
i.e., from the Greek, a bringer of good news. I think
Marpa is pretty good news.
A Simple Parser
my $grammar = Marpa::R2::Scanless::G->new({
source => \q{
:default ::= action => [values]
lexeme default = latm => 1
Expression ::= Number bless => Number
| ('(') Expression (')') action => ::first assoc => group
|| Expression ('*') Expression bless => Multiply
| Expression ('/') Expression bless => Divide
|| Expression ('+') Expression bless => Add
| Expression ('-') Expression bless => Subtract
Number ~ [\d]+
Whitespace ~ [\s]+
:discard ~ Whitespace
},
bless_package => 'Math',
});
Rule Definitions
Addition ::= Expression '+' Expression
- Rule name on the left
::=
defines a rule
- A list of rule names, token names, and literals on the right.
Alternatives
Expression ::= Expression '*' Expression
| Expression '/' Expression
|| Expression '+' Expression
| Expression '-' Expression
|
divides two definitions of a rule with equal priority.
||
divides definitions with higher priority from definitions with lower priority.
- Defining a rule multiple times is the same as
|
.
Non-capturing Parentheses
Addition ::= Expression ('+') Expression
- Parentheses cause what a subrule matches to be ignored and not passed to actions.
- Pretty much the opposite of Perl.
Tokens
Number ~ [\d]+
- Use a very limited regex syntax:
- Character classes
+
- Concatenation
- Can refer to other tokens
- Literal strings like
'+'
are just auto-generated tokens.
Actions
- Subroutines that get passed the values matched by subrules and tokens.
- (Except the ones in parentheses).
- Determine the value of the current rule.
[values]
is a good default that just returns an arrayref.
Blessing
Expression ::= Number ('+') Number action => [values] bless => Add
- Match:
1 + 2
- Returns
bless [ 1, 2 ], 'SomeClass::Add'
bless_package
in the grammar constructor controls SomeClass
Displaying Parse Tree
while (<>) {
my $rec = Marpa::R2::Scanless::R->new({
grammar => $grammar
});
$rec->read(\$_);
my $ret = $rec->value();
if (defined $ret) {
&p($$ret);
} else {
print STDERR "Parse error\n";;
}
}
Adding Behavior
sub Math::Number::evaluate { my $self = shift;
return 0 + $self->[0]
}
sub Math::Multiply::evaluate { my $self = shift;
return $self->[0]->evaluate * $self->[1]->evaluate
}
sub Math::Divide::evaluate { my $self = shift;
return $self->[0]->evaluate / $self->[1]->evaluate
}
sub Math::Add::evaluate { my $self = shift;
return $self->[0]->evaluate + $self->[1]->evaluate
}
sub Math::Subtract::evaluate { my $self = shift;
return $self->[0]->evaluate - $self->[1]->evaluate
}
Evaluating Math
while (<>) {
my $rec = Marpa::R2::Scanless::R->new({
grammar => $grammar
});
$rec->read(\$_);
my $ret = $rec->value();
if (defined $ret) {
print $$ret->evaluate, "\n";
} else {
print STDERR "Parse error\n";;
}
}
Other Ways of Adding Behavior
- We used a default
action => [values]
.
- Each rule returns an (optionally blessed) arrayref.
- We can also have custom actions.
- Derive values as we parse,
- Don’t necessarily have a parse tree at all.
- Can take actions like modifying a data structure, or even modifying the
grammar.
Marpa runs the actions when you call value
on the
recognizer, not during parsing. This makes parsing faster.
Custom Actions Example
Number ::= HexNumber action => parse_hex
|| DecimalNumber action => parse_decimal
DecimalNumber ~ [\d]+
HexNumber ~ '0x' HexDigits
HexDigits ~ [0-9A-Fa-f]+
And lots of other stuff…
- Lists & associativity
- More advanced OO support
- Dynamic grammar modification
- Error reporting
My Favorite Feature of Marpa
- Marpa keeps track of multiple possible parses in parallel.
- As long as a parse keeps working, it’s kept alive.
- A parse that can’t continue is pruned.
- If no parses stay alive, that’s a parse error.
- At any time, you can ask Marpa what tokens are expected right now among all
possible parses.
- “Ruby slippers”: give the parser the token it wants to see.
HTML says that when you run into mis-nested HTML,
un-closed tags, etc., the parser should just pretend that it saw what it needs
to to get into a valid state. Marpa supports this.
If your grammar is ambiguous enough that there are
multiple alternatives even at the end of input, Marpa can give you all of them.
Rule ranks can control which parse is preferred.
Story Time
- I’ve always had a fear of lexing.
- How do you know exactly what you’re looking at, without context?
- Often you don’t.
Lexerless Parsing
- Some parsers don’t make you write a lexer:
- Parse::RecDescent
- Regexp::Grammars
- Parser::MGC
- Recursive descent
- backtracking
- is usually slow.
Enter Marpa
- With speculative parsing, why not allow many tokens to match?
- Marpa will tell us which tokens are expected by the grammar at a given point.
- Try matching all of them, and pass all matches to the parser as alternatives.
- Any ambiguity will be resolved at a higher level.
Standard Input Model: Token at a Time
while ($input_remaining) {
my $token = $lexer->next_token($input);
die "Parse error!" unless $parser->read($token);
}
my $value_ref = $parser->value;
die "Parse error!" unless $value_ref;
return $$value_ref;
- When I started using Marpa this was the standard way to use it.
- Provide your own lexer, provide tokens one at a time.
- Lots of parsers work this way.
But I had a weird idea
- I read about Marpa’s features:
- Asking the parser which tokens are expected,
- Multiple alternative tokens at one point,
- Tokens can overlap.
- What if I read a character a time?
Character at a Time with Speculative Lexer
while ($pos < length $input) {
my $expected = $parser->expected_terminals;
for my $token_name (@$expected) {
if (my $token = $lexer->match($token_name, $input, $pos)) {
$parser->alternative($token);
}
}
try { $parser->earleme_complete } catch { die "Parse error!" };
$pos++;
}
my $value_ref = $parser->value;
die "Parse error!" unless $value_ref;
return $$value_ref;
So there’s still a lexer?
- Well, yeah.
- But its job is easier.
- Say whether a given thing starts at a given place.
- Marpa is managing context for us.
- Longest token rule is now optional.
Brute Force Lexer
TOKEN: for my $token_name (@expected) {
my $token = $tokens->{$token_name};
die "Unknown token $token_name" unless defined $token;
my $rule = $token->[0];
pos($input) = $pos;
next TOKEN unless $input =~ $rule;
my $matched_len = $+[0] - $-[0];
my $matched_value = undef;
if (defined( my $val = $token->[1] )) {
if (ref $val eq 'CODE') {
eval { $matched_value = $val->(); 1 } || do { next TOKEN };
} else {
$matched_value = $val;
}
} elsif ($#- > 0) {
$matched_value = $1;
}
push @matches, [ $token_name, \$matched_value, $matched_len + $whitespace_consumed ];
}
return @matches;
So I wrote this thing to prove that it could be done. This
is 99% complete code for a lexer that’s compatible with the parser from 2 slides
ago.
Token Table
my %tokens = (
'LPAREN' => [ qr/\G[(]/ ],
'RPAREN' => [ qr/\G[)]/ ],
'ASTERISK' => [ qr/\G[*]/ ],
'SLASH' => [ qr#\G[/]# ],
'PLUS' => [ qr/\G[+]/ ],
'MINUS' => [ qr/\G[-]/ ],
'NUMBER' => [ qr/\G([-+]?[0-9.]+(?:e[+-]?[0-9]+)?)/, sub { 0 + $1 } ],
);
Another Crazy Idea: Two Parsers
- TAP::Spec::Parser is live on CPAN
- Line parser
- Character-at-a-time
- Matches any single line of TAP
- Stream parser
- Token-at-a-time
- Uses line parser’s output as input tokens
- Matches an entire TAP document
Two Parsers Cont.
- Chunking makes TAP’s “Junk Line” behavior easy
- Any line that’s not a valid line of TAP is ignored
- But a valid line in an invalid position is an error
- The error reporting is better too
Expected Result|Comment, found Plan at...
The Present: Marpa::R2::Scanless
- I showed off my TAP::Spec::Parser to the Marpa mailing list.
- Marpa’s author was intrigued by the idea of nesting parsers.
- Led to Marpa::R2::Scanless
- AKA SLIF (Scanless Interface).
Marpa::R2::Scanless (cont.)
- Parsing rules and tokens in a single grammar.
- Rules have
::=
separating LHS and RHS, tokens have ~
.
- LATM lexing rule is stricter than Lex::Easy, but almost as useful, and faster.
- Warning: LTM is the default! Turn on LATM!
LATM = Longest Acceptable Tokens Match. Only tokens that
are expected by the grammar at a given position are considered. Among those, the
longest one wins. If multiple tie for longest, they’re all chosen as
alternatives. Lex::Easy could be considered “all acceptable tokens match”.
Thanks!