Parsing Made Easyish

A Guide to Parsing with Marpa

http://cleverdomain.org/yapc2014/

About Me

Andrew Rodland
IRC: hobbs
Github: arodland
CPAN: ARODLAND
Employer: Shutterstock

Parsing Quick Intro

Taking structured textual or binary data
And extracting information from it.
Traditionally two or three phases
- Lexing (Scanning / Evaluating)
- Parsing
Produces a parse tree, which can be turned into something useful.

Perl Parsing Tools

Regexes (while, logic, and m/\G.../gc)
Parser::MGC
Yapp
Parse::RecDescent
Regexp::Grammars

Yapp is based on Yacc, which might be better known today as GNU Bison

Marpa

A newer library (first released 2011, actively developed).
Uses the Earley parsing algorithm.
Handles all kinds of grammar without fear.
Can modify grammar at runtime.
Has advanced error reporting.
Fast (C + Perl).
Lots of features!

I am:

Not a Parsing Expert.
Not a Marpa Expert.
More of an evangelist.

i.e., from the Greek, a bringer of good news. I think Marpa is pretty good news.

A Simple Parser

my $grammar = Marpa::R2::Scanless::G->new({
  source => \q{
    :default ::= action => [values]
    lexeme default = latm => 1

    Expression ::= Number                       bless => Number
                |  ('(') Expression (')')       action => ::first assoc => group
                || Expression ('*') Expression  bless => Multiply
                |  Expression ('/') Expression  bless => Divide
                || Expression ('+') Expression  bless => Add
                |  Expression ('-') Expression  bless => Subtract

    Number ~ [\d]+
    Whitespace ~ [\s]+
    :discard ~ Whitespace
  },
  bless_package => 'Math',
});

Rule Definitions

Addition ::= Expression '+' Expression

Rule name on the left
::= defines a rule
A list of rule names, token names, and literals on the right.

Alternatives

Expression ::= Expression '*' Expression
             | Expression '/' Expression
            || Expression '+' Expression
             | Expression '-' Expression

| divides two definitions of a rule with equal priority.
|| divides definitions with higher priority from definitions with lower priority.
Defining a rule multiple times is the same as |.

Non-capturing Parentheses

Addition ::= Expression ('+') Expression

Parentheses cause what a subrule matches to be ignored and not passed to actions.
Pretty much the opposite of Perl.

Tokens

Number ~ [\d]+

Use a very limited regex syntax:
- Character classes
- +
- Concatenation
Can refer to other tokens
Literal strings like '+' are just auto-generated tokens.

Actions

Subroutines that get passed the values matched by subrules and tokens.
- (Except the ones in parentheses).
Determine the value of the current rule.
[values] is a good default that just returns an arrayref.

Blessing

Expression ::= Number ('+') Number  action => [values] bless => Add

Match: 1 + 2
Returns bless [ 1, 2 ], 'SomeClass::Add'
bless_package in the grammar constructor controls SomeClass

Displaying Parse Tree

while (<>) {
  my $rec = Marpa::R2::Scanless::R->new({
    grammar => $grammar
  });

  $rec->read(\$_);
  my $ret = $rec->value();
  if (defined $ret) {
    &p($$ret);
  } else {
    print STDERR "Parse error\n";;
  }
}

Adding Behavior

sub Math::Number::evaluate { my $self = shift;
  return 0 + $self->[0]
}
sub Math::Multiply::evaluate { my $self = shift;
  return $self->[0]->evaluate * $self->[1]->evaluate
}
sub Math::Divide::evaluate { my $self = shift;
  return $self->[0]->evaluate / $self->[1]->evaluate
}
sub Math::Add::evaluate { my $self = shift;
  return $self->[0]->evaluate + $self->[1]->evaluate
}
sub Math::Subtract::evaluate { my $self = shift;
  return $self->[0]->evaluate - $self->[1]->evaluate
}

Evaluating Math

while (<>) {
  my $rec = Marpa::R2::Scanless::R->new({
    grammar => $grammar
  });

  $rec->read(\$_);
  my $ret = $rec->value();
  if (defined $ret) {
    print $$ret->evaluate, "\n";
  } else {
    print STDERR "Parse error\n";;
  }
}

Other Ways of Adding Behavior

We used a default action => [values].
Each rule returns an (optionally blessed) arrayref.
We can also have custom actions.
- Derive values as we parse,
- Don’t necessarily have a parse tree at all.
- Can take actions like modifying a data structure, or even modifying the grammar.

Marpa runs the actions when you call value on the recognizer, not during parsing. This makes parsing faster.

Custom Actions Example

Number ::= HexNumber      action => parse_hex
        || DecimalNumber  action => parse_decimal

DecimalNumber ~ [\d]+
HexNumber     ~ '0x' HexDigits
HexDigits     ~ [0-9A-Fa-f]+

And lots of other stuff…

Lists & associativity
More advanced OO support
Dynamic grammar modification
Error reporting

My Favorite Feature of Marpa

Marpa keeps track of multiple possible parses in parallel.
As long as a parse keeps working, it’s kept alive.
A parse that can’t continue is pruned.
If no parses stay alive, that’s a parse error.
At any time, you can ask Marpa what tokens are expected right now among all possible parses.
“Ruby slippers”: give the parser the token it wants to see.
- Ex: parsing HTML.

HTML says that when you run into mis-nested HTML, un-closed tags, etc., the parser should just pretend that it saw what it needs to to get into a valid state. Marpa supports this.

If your grammar is ambiguous enough that there are multiple alternatives even at the end of input, Marpa can give you all of them. Rule ranks can control which parse is preferred.

Story Time

I’ve always had a fear of lexing.
How do you know exactly what you’re looking at, without context?
Often you don’t.
- Hairy tricks.

Lexerless Parsing

Some parsers don’t make you write a lexer:
- Parse::RecDescent
- Regexp::Grammars
- Parser::MGC
Recursive descent
- backtracking
- is usually slow.

Enter Marpa

With speculative parsing, why not allow many tokens to match?
Marpa will tell us which tokens are expected by the grammar at a given point.
Try matching all of them, and pass all matches to the parser as alternatives.
Any ambiguity will be resolved at a higher level.

Standard Input Model: Token at a Time

while ($input_remaining) {
  my $token = $lexer->next_token($input);
  die "Parse error!" unless $parser->read($token);
}
my $value_ref = $parser->value;
die "Parse error!" unless $value_ref;
return $$value_ref;

When I started using Marpa this was the standard way to use it.
Provide your own lexer, provide tokens one at a time.
Lots of parsers work this way.

But I had a weird idea

I read about Marpa’s features:
- Asking the parser which tokens are expected,
- Multiple alternative tokens at one point,
- Tokens can overlap.
What if I read a character a time?

Character at a Time with Speculative Lexer

while ($pos < length $input) {
  my $expected = $parser->expected_terminals;
  for my $token_name (@$expected) {
    if (my $token = $lexer->match($token_name, $input, $pos)) {
      $parser->alternative($token);
    }
  }
  try { $parser->earleme_complete } catch { die "Parse error!" };
  $pos++;
}
my $value_ref = $parser->value;
die "Parse error!" unless $value_ref;
return $$value_ref;

Like this?

So there’s still a lexer?

Well, yeah.
But its job is easier.
- Say whether a given thing starts at a given place.
Marpa is managing context for us.
Longest token rule is now optional.

Brute Force Lexer

TOKEN: for my $token_name (@expected) {
  my $token = $tokens->{$token_name};
  die "Unknown token $token_name" unless defined $token;
  my $rule = $token->[0];
  pos($input) = $pos;
  next TOKEN unless $input =~ $rule;
  my $matched_len = $+[0] - $-[0];
  my $matched_value = undef;

  if (defined( my $val = $token->[1] )) {
    if (ref $val eq 'CODE') {
      eval { $matched_value = $val->(); 1 } || do { next TOKEN };
    } else {
      $matched_value = $val;
    }
  } elsif ($#- > 0) { # Captured a value
    $matched_value = $1;
  }

  push @matches, [ $token_name, \$matched_value, $matched_len + $whitespace_consumed ];
}
return @matches;

So I wrote this thing to prove that it could be done. This is 99% complete code for a lexer that’s compatible with the parser from 2 slides ago.

Token Table

my %tokens = (
  'LPAREN'   => [ qr/\G[(]/ ],
  'RPAREN'   => [ qr/\G[)]/ ],
  'ASTERISK' => [ qr/\G[*]/ ],
  'SLASH'    => [ qr#\G[/]# ],
  'PLUS'     => [ qr/\G[+]/ ],
  'MINUS'    => [ qr/\G[-]/ ],
  'NUMBER'   => [ qr/\G([-+]?[0-9.]+(?:e[+-]?[0-9]+)?)/, sub { 0 + $1 } ],
);

Another Crazy Idea: Two Parsers

TAP::Spec::Parser is live on CPAN
Line parser
- Character-at-a-time
- Matches any single line of TAP
Stream parser
- Token-at-a-time
- Uses line parser’s output as input tokens
- Matches an entire TAP document

Two Parsers Cont.

Chunking makes TAP’s “Junk Line” behavior easy
- Any line that’s not a valid line of TAP is ignored
- But a valid line in an invalid position is an error
The error reporting is better too
- Expected Result|Comment, found Plan at...

The Present: Marpa::R2::Scanless

I showed off my TAP::Spec::Parser to the Marpa mailing list.
Marpa’s author was intrigued by the idea of nesting parsers.
Led to Marpa::R2::Scanless
- AKA SLIF (Scanless Interface).

Marpa::R2::Scanless (cont.)

Parsing rules and tokens in a single grammar.
Rules have ::= separating LHS and RHS, tokens have ~.
LATM lexing rule is stricter than Lex::Easy, but almost as useful, and faster.
- Warning: LTM is the default! Turn on LATM!

LATM = Longest Acceptable Tokens Match. Only tokens that are expected by the grammar at a given position are considered. Among those, the longest one wins. If multiple tie for longest, they’re all chosen as alternatives. Lex::Easy could be considered “all acceptable tokens match”.