Parsing Made Easyish

A Guide to Parsing with Marpa

http://cleverdomain.org/opw-marpa/

About Me

Andrew Rodland
IRC: hobbs
Github: arodland
CPAN: ARODLAND
Employer: Shutterstock

Parsing Quick Intro

Taking structured textual or binary data
And extracting information from it.
Traditionally two or three phases
- Lexing (Scanning / Evaluating)
- Parsing
Produces a parse tree, which can be turned into something useful.

Perl Parsing Tools

Regexes (while, logic, and m/\G.../gc)
Parser::MGC
Yapp
Parse::RecDescent
Regexp::Grammars

Story Time

I’ve always had a fear of lexing.
How do you know exactly what you’re looking at, without context?
Often you don’t.
- Hairy tricks.

Lexerless Parsing

Some parsers don’t make you write a lexer:
- Parse::RecDescent
- Regexp::Grammars
- Parser::MGC
Recursive descent
- backtracking
- is usually slow.

Enter Marpa

A new parsing engine for Perl and C released in 2011.
Based on the Earley algorithm.
Very powerful
- Parses all context-free languages.
- No problem with left-recursion.
And fast
- Parses a large class of languages in linear time.
- XS implementation.

My Favorite Feature of Marpa

Marpa keeps track of multiple possible parses in parallel.
At any time, you can ask it what tokens are expected right now.
“Ruby slippers”: give the parser the token it wants to see.
- Ex: parsing HTML.

Standard Input Model: Earleme-per-Token

while ($input_remaining) {
  my $token = $lexer->next_token($input);
  unless ($parser->read($token)) {
    die "Parse error!";
  }
}
my $value_ref = $parser->value;
unless ($value_ref) {
  die "Parse error!";
}
return $$value_ref;

Earleme-per-Character Model

while ($pos < length $input) {
  my $expected = $parser->expected_terminals;
  for my $token_name (@$expected) {
    if (my $token = $lexer->match($token_name, $input, $pos)) {
      $parser->alternative($token);
    }
  }
  try { $parser->earleme_complete } catch { die "Parse error!" };
  $pos++;
}
my $value_ref = $parser->value;
unless ($value_ref) {
  die "Parse error!";
}
return $$value_ref;

So there’s still a lexer?

Well, yeah.
But its job is easier.
- Say whether a given thing starts at a given place.
Marpa is managing context for us.
Longest token rule is now optional.

Code Sample: MarpaX::Lex::Easy

TOKEN: for my $token_name (@expected) {
  my $token = $tokens->{$token_name};
  die "Unknown token $token_name" unless defined $token;
  next if $token eq 'passthrough';
  my $rule = $token->[0];
  pos($input) = $pos;
  next TOKEN unless $input =~ $rule;
  my $matched_len = $+[0] - $-[0];
  my $matched_value = undef;

  if (defined( my $val = $token->[1] )) {
    if (ref $val eq 'CODE') {
      eval { $matched_value = $val->(); 1 } || do { next TOKEN };
    } else {
      $matched_value = $val;
    }
  } elsif ($#- > 0) { # Captured a value
    $matched_value = $1;
  }

  push @matches, [ $token_name, \$matched_value, $matched_len + $whitespace_consumed ];
}
return @matches;

Math Parser: Grammar

my $grammar = Marpa::R2::Grammar->new({
  actions => 'Math::Parser::Actions',
  start => 'Expression',
  rules => q{
    Expression ::= NUMBER                           action => number
                |  (LPAREN) Expression (RPAREN)     action => parens assoc => group
                || Expression (ASTERISK) Expression action => multiply
                |  Expression (SLASH) Expression    action => divide
                || Expression (PLUS) Expression     action => add
                |  Expression (MINUS) Expression    action => subtract
  },
});
$grammar->precompute;

Math Parser: Tokens

my %tokens = (
  'LPAREN'   => [ qr/\G[(]/ ],
  'RPAREN'   => [ qr/\G[)]/ ],
  'ASTERISK' => [ qr/\G[*]/ ],
  'SLASH'    => [ qr#\G[/]# ],
  'PLUS'     => [ qr/\G[+]/ ],
  'MINUS'    => [ qr/\G[-]/ ],
  'NUMBER'   => [ qr/\G([-+]?[0-9.]+(?:e[+-]?[0-9]+)?)/, sub { 0 + $1 } ],
);

Math Parser: Parse Method

method parse_line ($line) {
  my $rec = Marpa::R2::Recognizer->new({
      grammar => $grammar,
      ranking_method => 'rule',
  });

  my $lex = MarpaX::Lex::Easy->new(
    tokens => \%tokens,
    recognizer => $rec,
    automatic_whitespace => 1,
    whitespace_pattern => qr/\G\s+/,
  );

  return $lex->read_and_parse($line);
}

Math Parser: Actions

package Math::Parser::Actions;

sub parens   { $_[1] }
sub multiply { $_[1] * $_[2] }
sub divide   { $_[1] / $_[2] }
sub add      { $_[1] + $_[2] }
sub subtract { $_[1] - $_[2] }
sub number   { $_[1] }

Bonus Technique: Two Parsers

TAP::Spec::Parser is live on CPAN
Line parser
- Earleme-per-character
- Matches any single line of TAP
Stream parser
- Earleme-per-token
- Uses line parser’s output as input tokens
- Matches an entire TAP document

Two Parsers Cont.

Chunking makes TAP’s “Junk Line” behavior easy
- Any line that’s not a valid line of TAP is ignored
- But a valid line in an invalid position is an error
The error reporting is better too
- Expected Result|Comment, found Plan at...

TAP: Line Parser

method parse_line ($line) {
  my $rec = Marpa::R2::Recognizer->new(
      { grammar => $self->line_grammar, ranking_method => 'rule' });

  for my $pos (0 .. length($line) - 1) {
    my $expected_tokens = $rec->terminals_expected;
    if (@$expected_tokens) {
      my @matching_tokens = $self->lex(\$line, $pos, $expected_tokens);
      $rec->alternative( @$_ ) for @matching_tokens;
    }
    my $ok = eval { $rec->earleme_complete; 1 };
    if (!$ok) {
      return [ 'Junk_Line', $line ];
    }
  }
  $rec->end_input;
  return ${$rec->value};
}

TAP: Stream Parser

method parse {
  my $rec = Marpa::R2::Recognizer->new(
      { grammar => $self->stream_grammar, ranking_method => 'rule' });
  my $reader = $self->reader;

  while (defined( my $line = $reader->() )) {
    my $line_token = $self->parse_line($line);
    unless (defined $rec->read(@$line_token)) {
      my $expected = $rec->terminals_expected;
      die "Parse error, expecting [@$expected], got $line_token->[0]";
    }
  }
  $rec->read('EOF');
  return ${$rec->value};
}

The Future

I wanted to say that MarpaX::Lex::Easy is on CPAN and you should use it.
But more exciting things are in progress.

Watch This Space

MarpaX::Lex::Easy (me)
Marpa::R2::Scanless (Jeffrey & Peter)
MarpaX::Repa (Ruslan)
Marpa::R3 (Marpa Team)

Parsing Made Easyish

A Guide to Parsing with Marpa

About Me

Parsing Quick Intro

Perl Parsing Tools

Story Time

Lexerless Parsing

Enter Marpa

My Favorite Feature of Marpa

Standard Input Model: Earleme-per-Token

Earleme-per-Character Model

So there’s still a lexer?

Code Sample: MarpaX::Lex::Easy

Math Parser: Grammar

Math Parser: Tokens

Math Parser: Parse Method

Math Parser: Actions

Bonus Technique: Two Parsers

Two Parsers Cont.

TAP: Line Parser

TAP: Stream Parser

The Future

Watch This Space

Marpa Community

Thanks!