Parsing Made Easyish

A Guide to Parsing with Marpa

http://cleverdomain.org/yapc2014/

About Me

Parsing Quick Intro

Perl Parsing Tools

Yapp is based on Yacc, which might be better known today as GNU Bison

Marpa

I am:

i.e., from the Greek, a bringer of good news. I think Marpa is pretty good news.

A Simple Parser

my $grammar = Marpa::R2::Scanless::G->new({
  source => \q{
    :default ::= action => [values]
    lexeme default = latm => 1

    Expression ::= Number                       bless => Number
                |  ('(') Expression (')')       action => ::first assoc => group
                || Expression ('*') Expression  bless => Multiply
                |  Expression ('/') Expression  bless => Divide
                || Expression ('+') Expression  bless => Add
                |  Expression ('-') Expression  bless => Subtract

    Number ~ [\d]+
    Whitespace ~ [\s]+
    :discard ~ Whitespace
  },
  bless_package => 'Math',
});

Rule Definitions

Addition ::= Expression '+' Expression

Alternatives

Expression ::= Expression '*' Expression
             | Expression '/' Expression
            || Expression '+' Expression
             | Expression '-' Expression

Non-capturing Parentheses

Addition ::= Expression ('+') Expression

Tokens

Number ~ [\d]+

Actions

Blessing

Expression ::= Number ('+') Number  action => [values] bless => Add

Displaying Parse Tree

while (<>) {
  my $rec = Marpa::R2::Scanless::R->new({
    grammar => $grammar
  });

  $rec->read(\$_);
  my $ret = $rec->value();
  if (defined $ret) {
    &p($$ret);
  } else {
    print STDERR "Parse error\n";;
  }
}

Adding Behavior

sub Math::Number::evaluate { my $self = shift;
  return 0 + $self->[0]
}
sub Math::Multiply::evaluate { my $self = shift;
  return $self->[0]->evaluate * $self->[1]->evaluate
}
sub Math::Divide::evaluate { my $self = shift;
  return $self->[0]->evaluate / $self->[1]->evaluate
}
sub Math::Add::evaluate { my $self = shift;
  return $self->[0]->evaluate + $self->[1]->evaluate
}
sub Math::Subtract::evaluate { my $self = shift;
  return $self->[0]->evaluate - $self->[1]->evaluate
} 

Evaluating Math

while (<>) {
  my $rec = Marpa::R2::Scanless::R->new({
    grammar => $grammar
  });

  $rec->read(\$_);
  my $ret = $rec->value();
  if (defined $ret) {
    print $$ret->evaluate, "\n";
  } else {
    print STDERR "Parse error\n";;
  }
}

Other Ways of Adding Behavior

Marpa runs the actions when you call value on the recognizer, not during parsing. This makes parsing faster.

Custom Actions Example

Number ::= HexNumber      action => parse_hex
        || DecimalNumber  action => parse_decimal

DecimalNumber ~ [\d]+
HexNumber     ~ '0x' HexDigits
HexDigits     ~ [0-9A-Fa-f]+

And lots of other stuff…

My Favorite Feature of Marpa

HTML says that when you run into mis-nested HTML, un-closed tags, etc., the parser should just pretend that it saw what it needs to to get into a valid state. Marpa supports this.
If your grammar is ambiguous enough that there are multiple alternatives even at the end of input, Marpa can give you all of them. Rule ranks can control which parse is preferred.

Story Time

Lexerless Parsing

Enter Marpa

Standard Input Model: Token at a Time

while ($input_remaining) {
  my $token = $lexer->next_token($input);
  die "Parse error!" unless $parser->read($token);
}
my $value_ref = $parser->value;
die "Parse error!" unless $value_ref;
return $$value_ref;

But I had a weird idea

Character at a Time with Speculative Lexer

while ($pos < length $input) {
  my $expected = $parser->expected_terminals;
  for my $token_name (@$expected) {
    if (my $token = $lexer->match($token_name, $input, $pos)) {
      $parser->alternative($token);
    }
  }
  try { $parser->earleme_complete } catch { die "Parse error!" };
  $pos++;
}
my $value_ref = $parser->value;
die "Parse error!" unless $value_ref;
return $$value_ref;

So there’s still a lexer?

Brute Force Lexer

TOKEN: for my $token_name (@expected) {
  my $token = $tokens->{$token_name};
  die "Unknown token $token_name" unless defined $token;
  my $rule = $token->[0];
  pos($input) = $pos;
  next TOKEN unless $input =~ $rule;
  my $matched_len = $+[0] - $-[0];
  my $matched_value = undef;

  if (defined( my $val = $token->[1] )) {
    if (ref $val eq 'CODE') {
      eval { $matched_value = $val->(); 1 } || do { next TOKEN };
    } else {
      $matched_value = $val;
    }
  } elsif ($#- > 0) { # Captured a value
    $matched_value = $1;
  }

  push @matches, [ $token_name, \$matched_value, $matched_len + $whitespace_consumed ];
}
return @matches;
So I wrote this thing to prove that it could be done. This is 99% complete code for a lexer that’s compatible with the parser from 2 slides ago.

Token Table

my %tokens = (
  'LPAREN'   => [ qr/\G[(]/ ],
  'RPAREN'   => [ qr/\G[)]/ ],
  'ASTERISK' => [ qr/\G[*]/ ],
  'SLASH'    => [ qr#\G[/]# ],
  'PLUS'     => [ qr/\G[+]/ ],
  'MINUS'    => [ qr/\G[-]/ ],
  'NUMBER'   => [ qr/\G([-+]?[0-9.]+(?:e[+-]?[0-9]+)?)/, sub { 0 + $1 } ],
);

Another Crazy Idea: Two Parsers

Two Parsers Cont.

The Present: Marpa::R2::Scanless

Marpa::R2::Scanless (cont.)

LATM = Longest Acceptable Tokens Match. Only tokens that are expected by the grammar at a given position are considered. Among those, the longest one wins. If multiple tie for longest, they’re all chosen as alternatives. Lex::Easy could be considered “all acceptable tokens match”.

Marpa Community

Thanks!