Logo

14.1.2 Manipulating the output of parsing strings and numbers

Let us now tackle the parsing of MSE srings and numbers. Valid strings start and end with a single quote character, and in between we should have any non single-quote character:

STRING := ( "'" [^'] * "'" ) +

The tricky part is expressing any character except. This is achieved through negate. We specify our parser through a sequence in which the contents of the string are given by the parser of single-quote negated and repeated multiple times:

string := $' asParser , 
$' asParser negate star ,
$' asParser.

A more challenging task would be to add support for escape characters. For example, in MSE, it is allowed to have the character single-quote if it is escaped by another single-quote. The parser could be written as:

string := $' asParser , 
('''''' asParser / $' asParser negate) star ,
$' asParser.

The parser now says that either we have two single-quotes one after the other or we do not have any single-quote.

Why are there six single-quotes? This happens because the code is written in Smalltalk, and in Smalltalk strings are also marked with single-quotes. For this reason we have two single-quotes surrounding the Smalltalk string, and inside we need to escape the two single-quotes with two more as required by Smalltalk. In total, we get six.

We can test the code:

string parse: '''string'''.  "--> #($' #($s $t $r $i $n $g) $')"
string parse: '''qu''''ote'''. "--> #($' #($q $u '''''' $o $t $e) $')"

The grammar works fine, but the result we obtain is less useful. PetitParser has two major responsibilities: to consume the input according to a grammar, and to transform it into a desired output. Given that the consuming boils down to a traversal of an input stream, the default result is nothing but a nested collection, where the nesting mirrors the specified grammar. For example, #($’ #($s $t $r $i $n $g) $’) is a Smalltalk array with three elements:

  • $’,
  • ($s $t $r $i $n $g) which is a nested array produced by star, and
  • $’.

Let us produce a more convenient output in the form of a regular Smalltalk string. First, we want the second element in the resulting array to not be an array but directly a flatten string. For this we have a convenient flatten operator.

string := $' asParser , 
('''''' asParser / $' asParser negate) star flatten ,
$' asParser.

The new result looks better:

string parse: '''string'''.  "--> #($' 'string' $')"

Ultimately, we would want our little parser to ignore the first and the last element and return only the second one. For this, PetitParser offers the possibility to specify a custom transformation through the ==> operator:

string := ($' asParser , 
('''''' asParser / $' asParser negate) star flatten ,
$' asParser)
==> [ :token | token second ].

The transformation operator can be applied to any parser and it takes a block with one argument. The value of the argument is given by the default result of the current parser.

In our case, the token argument holds the array #($’ ’string’ $’) and we simply say we want to return the second element. The result is finally what we want:

string parse: '''string'''. "--> 'string'"

Now, let us move on and create a parser for numbers that produces Smalltalk numbers. For the simplicity of the discussion, let us focus on a smaller grammar:

NATURAL := digit +
NUMBER := "-" ? digit + ( "." digit + ) ?

The correspondent parser can look like:

natural := #digit asParser plus.
number := ($- asParser optional , natural , ($. asParser , natural) optional).

While the above parser is a direct translation of the grammar definition, we can decompose it better to make the several parts more explicit:

natural        := #digit asParser plus. 
decimalPart := ($. asParser , natural).
positiveNumber := natural , decimalPart optional.
negativeNumber := $- asParser , positiveNumber.
number := positiveNumber / negativeNumber.

In the end, we want to produce a number. To achieve this we use the transformation blocks for each parser.

natural        := #digit asParser plus flatten 
==> [:token | token asNumber].
decimalPart := ($. asParser , natural)
==> [:token | (token at: 2) * (10 raisedTo: (0- (token at: 2) asString size)) asFloat ].
positiveNumber := natural , decimalPart optional
==> [:token | (token at: 1) + ((token at: 2) ifNil: [0]) ].
negativeNumber := $- asParser , positiveNumber
==> [:token | 0 - (token at: 2)] .
number := positiveNumber / negativeNumber.

We start with the natural parser and simply use the asNumber method available in a Smalltalk string that transforms the contents of the string into a number. From this point on, when the natural parser is used, the output will always be a Smalltalk number, and not a string anymore. Using this approach we can build up the result out of fine grained pieces.

One thing to notice is what happens with an optional part. For example, the positiveNumber has an optional decimalPart. This means that if the decimal part is available, the second element in the token array will hold the value of applying the decimalNumber parser to the input, but if the decimal part is missing, the corresponding value will be nil. Thus, we typically have to guard the manipulation of an optional part with an if in the transformation block. In our case, we want 0 when no decimal part is specified.

One way to go around the use of an optional is to use the choice parser combinator (i.e., /). For example, even if in the original grammar there are two optional productions (i.e., one for the sign, and one for the decimal part) in the transformation blocks we have only one if. This happens because we modeled the optional sign in a number with a choice between a positiveNumber and a negativeNumber. While both approaches can have their benefits, it is important to know of their existence and choose the one that better fits the problem at hand.

User Contributed Notes

rafatals3ode (13 August 2012, 3:10 am)

how i can compare between 10 MSE file to extract the common and variable part for each product or system

tudor (24 July 2011, 11:56 am)

@stephan: what is the issue with strings?

stephan (23 July 2011, 6:37 pm)

strings

tudor (25 May 2011, 9:00 am)

Thanks. Fixed.

renggli (23 May 2011, 8:05 pm)

'numberPart' is referred to several times, but not defined.

Add a Note