On Sun, Aug 16, 2015 at 5:58 PM, Holger Freyther <holger@freyther.de> wrote:

Hi,

once again I am not sure if this is the right list. The first parser I wrote using
PetitParser was a SIP (and then MGCP) parser. I have recently ported[1] the
code to Pharo and with Pharo it is very tempting to Use BlockClosure>>#bench
to get an idea of the speed.

I have two performance “issues” and wonder if others hand similar issues with
PetitParser and if there is a general approach to this.

1.) Combining two PPCharsetPredicates does not combine the “classification”
table it had. One could create a PPPredicateObjectParser subclass that is
special casing >>#/ to build a combined classification table.

2.) When blindly following a BNF enumeration of "A or B or C or D or E
or CatchAll” and each “A, B” follow common pattern (e.g. token COLON value)
one pays a high cost in the backtracking and constructing the PPFailure for
each failed case.

In my SIPGrammar I have action parsers for To ==>.. From ==> and would
like to keep that. At the same time I would be happy if the token in front of the
colon is only consumed once and then delegated to the right parser and if that
one failed use the ‘catch all’ one.

I don’t know which abstraction would be needed to allow creating optimized
PetitParsers for such grammars.

sorry for the long mail, long details and context is below.

kind regards
holger

Full details:

1.) CharSetPredicate

| aParser bParser combinedParser aTime bTime cTime |

aParser := #digit asParser.
bParser := #letter asParser.
combinedParser := aParser / bParser.

aTime := [ aParser parse: 'b'] bench.
bTime := [ bParser parse: 'b'] bench.
cTime := [ combinedParser parse: 'b'] bench.
{ aTime. bTime. cTime }

cTime is bounded by the time execution time of of the slowest
of these parsers + overhead (e.g. PPFailure creation).

e.g.

#('559,000 per second.' '1,010,000 per second.' '429,000 per second.')

With a proof of concept PPPredicateCharSetParser

#('1,330,000 per second.' '1,550,000 per second.' '1,580,000 per second.’)

The noise is pretty string here but what is important is that bParser and the
combinedParser are in the same ballpark.

2.) Choice Parser

The BNF grammar of the parser is roughly:

Request = Request-Line
*( message-header )
CRLF
[ message-body ]

message-header = (Accept
…
/ To
/ From
/ Via
/ extension-header) CRLF

Alert-Info = "Alert-Info" HCOLON alert-param *(COMMA alert-param)
Accept = "Accept" HCOLON
[ accept-range *(COMMA accept-range) ]

So there can be several lines of “message-header”. And each method header
starts with a token/word, a colon and then the parameter. “extension-header”
is kind of a catch all if no other rule matched. E.g. if a client sends a To which is
wrongly encoded it would end up with the extension-header.

I transferred the above naively to PetitParser and end up with something like
parsing ~500 messages a second. The main cost appears to come from the
choice parser that needs to create a PetitFailure all the time. E.g. if you have a
line like this:

‘From: “Holger Freyther” <sip:323234@foo.de>’

The choice parser will start with the “Accept” rule, parse the token (“From” and
then create a PPFailure, then … rules, then “To”, parse the token.. So we have
parsing the same token more than once and creating PPFailures all the time. I
ended up creating a custom parser that will peek the token, have a horrible chain
of token = ‘XYZ’ ifTrue and then dispatch to the other rule.

It would be nice if PetitParser could be taught to only parse the token once and
then delegate to the param rule. E.g. a PPAnyOfParser that allows to specify the
token to match, the parser to continue with and a fallback parser?

[1]
http://smalltalkhub.com/#!/~osmocom/SIP
http://smalltalkhub.com/#!/~osmocom/MGCP

www.tudorgirba.com

"Every thing has its own flow"