XML mapping DSL (SimpleXO)

List overview All Threads
Download

newer

older

Issue 747 in moose-technology: We...

Getting the source code of a...

Steffen Märcker

1 Nov 2011 1 Nov '11

6:58 p.m.

Hi, I am currently working on an XML to object mapping library. I came up with an early version for a project, since I had to map weird/complex/poor XML files. I'm in contact with Stephané Ducasse about a planned Pharo port. He suggested to ask here for your opinions and ideas on this DSL/library. SimpleXO consists of two parts, a Smalltalk Builder API that constructs a parser object and a non-Smalltalk syntax meant for external binding configuration. This post is on the API. SimpleXO uses two concepts: types and mappings. A type describes how to construct an object. It is configured with a class, a constructor and mappings. A mapping defines which type a set of nodes is mapped to. The nodes are given by a expression similar to XPath. Short Example: <geo id="1"> <rect> <pos x="2" y="3" /> <width>4</width> <height>5</height> </rect> </geo> "Smalltalk'ish way:" builder := SimpleXOParserBuilder new. (builder defineElement: 'Rect') class: 'Rectangle'; "given as string to postpone resolving" mapPath: ('pos' /@ 'x') toType: 'Int'; mapPath: ('pos' /@ 'y') toType: 'Int'; mapPath: ('width') toType: 'Int'; mapPath: ('height') toType: 'Int'; mapPath: (AnyNodeTest /@ 'id') toType: 'Int'. (builder defineCData: 'Int') class: 'Integer'; constructor: #fromString:. parser := builder buildParser: 'rect'. parser mapNode: xmlNode. "this gives the following object:" (Rectangle new) x: 2; y: 2; width: 4; height: 5. Actually we see two kinds of types: element and cdata. The only difference is that elements are constructed using an unary constructor and cdata with a one-argument message that is sent with the string-value of the current node. Of course, SimpleXO supports id resolving, collecting values and tokenizing attribute values. You can find further information on https://wiki.aleturo.com/alpha/simplexo:start I am really interested in your impressions, questions and ideas so far? Regards, Steffen

Show replies by date

Stéphane Ducasse

3 Nov 3 Nov

8:58 a.m.

Thanks Steffen

...

I am currently working on an XML to object mapping library. I came up with an early version for a project, since I had to map weird/complex/poor XML files. I'm in contact with Stephané Ducasse about a planned Pharo port. He suggested to ask here for your opinions and ideas on this DSL/library. SimpleXO consists of two parts, a Smalltalk Builder API that constructs a parser object and a non-Smalltalk syntax meant for external binding configuration. This post is on the API. SimpleXO uses two concepts: types and mappings. A type describes how to construct an object. It is configured with a class, a constructor and mappings. A mapping defines which type a set of nodes is mapped to. The nodes are given by a expression similar to XPath. Short Example: <geo id="1"> <rect> <pos x="2" y="3" /> <width>4</width> <height>5</height> </rect> </geo> "Smalltalk'ish way:" builder := SimpleXOParserBuilder new. (builder defineElement: 'Rect') class: 'Rectangle'; "given as string to postpone resolving" mapPath: ('pos' /@ 'x') toType: 'Int'; mapPath: ('pos' /@ 'y') toType: 'Int'; mapPath: ('width') toType: 'Int'; mapPath: ('height') toType: 'Int'; mapPath: (AnyNodeTest /@ 'id') toType: 'Int'. (builder defineCData: 'Int') class: 'Integer'; constructor: #fromString:. parser := builder buildParser: 'rect'. parser mapNode: xmlNode. "this gives the following object:" (Rectangle new) x: 2; y: 2; width: 4; height: 5. Actually we see two kinds of types: element and cdata. The only difference is that elements are constructed using an unary constructor and cdata with a one-argument message that is sent with the string-value of the current node. Of course, SimpleXO supports id resolving, collecting values and tokenizing attribute values. You can find further information on https://wiki.aleturo.com/alpha/simplexo:start I am really interested in your impressions, questions and ideas so far?

Right now I do not have concrete case but for Moose I would like to have a library that I can use to map xml to moose objects

...

Regards, Steffen _______________________________________________ Moose-dev mailing list Moose-dev(a)iam.unibe.ch https://www.iam.unibe.ch/mailman/listinfo/moose-dev

Norbert Hartl

4 Nov 4 Nov

11:06 a.m.

Am 01.11.2011 um 18:58 schrieb Steffen Märcker:

...

I don't know the simpleXO stuff. But something came to my mind: I would be really careful to get the mental mapping right in first place before mapping the XML :) Your coarse dimensions are element and cdata. Cdata is associated with primitive types. But then in your "width" mapping you map an element to a native type. That makes only sense if this a convenient way of writing it. Basically I would rather think of the schema types SimpleType and ComplexType instead of element and cdata. Or as primitive and complex/composed type. Element in XML can be both a SimpleType and a ComplexType. If you then map SimpleType to primitive classes (e.g. Integer) and ComplexType to complex classes (e.g. Rectangle) then you might think that the "width" case is a problem. Well, it isn't really a problem. What you actually map is not width -> Int but width / text() ->Int text() retrieves the text nodes from an element. And voila your mapping is straight again. To make your example above valid again you would need to add auto coercion meaning your mapping is still SimpleType -> primitive and ComplexType -> complex type. So width -> Int => ComplexType -> primitve *coerce widht to SimpleType by adding /text()* => width/text() -> Int If we solve the SimpleType -> SimpleType case what's about the ComplexType -> ComplexType case? From your example I cannot see if this all is supposed to support recursion. As we fixed the mapping we should be able to map mapPath: ('width') toType: 'Rectangle'; Now we have the ComplexType element width mapping to ComplexType Rectangle. Everything needed now should be a type lookup that includes all elements created via defineElement: and defineCData: hope this adds something, Norbert

Steffen Märcker

7 Nov 7 Nov

8:07 a.m.

Hi Norbert, thanks a lot for your extensive reply.

...

I don't know the simpleXO stuff.

If you like, please load it from Cincom public repository and play a bit. The Pharo port will take some additional time.

...

Cdata is associated with primitive types. But then in your "width" mapping you map an element to a native type. That makes only sense if this a convenient way of writing it.

My idea behind the name 'cdata' was to hint that these types use the string-value (see XPath) of an XML node to build objects. I had in mind that these types should not be restricted to simple (schema) types. E.g. a use case requires to extract the text from an html paragraph element: <p>This is an <em>emphasized text</em>.</p> <div><p>And this a text as well.</p></div> "Consider the mapping" doc := (builder defineElement: 'Document' class: 'OrderedCollection'). doc mapPath: (RootStep // 'p') toType: 'Paragraph') setter: #add:. (builder defineCData: 'Paragraph' class: 'Text') constructor: #fromString:. "Which gives us" (OrderedCollection new) add: (Text fromString: 'This is an emph ...'); add: (Text fromString: 'And this is ...') What's your opinion on that example? Are 'element' and 'cdata' still misleading? I am not sure whether 'simple' and 'complex' reflect the nature of the types well, since a cdata type my have it's own mappings as well: text := (builder defineCData: 'Paragraph' class: 'Text') constructor: #fromString:. (text mapPath: (AttributeAxis ? 'id') toType: 'String') key: 'p-ids'. This example uses the value of @id as key to allow referencing the created object elsewhere. A more sophisticated example could use a parser that parses a given string-value and the fields of the result are set by SimpleXO later. I am looking forward to your (and anyone else's) thoughts! Regards, Steffen

Norbert Hartl

10:49 a.m.

Am 07.11.2011 um 08:07 schrieb Steffen Märcker:

...

Hi Norbert, thanks a lot for your extensive reply.

I don't know the simpleXO stuff.

If you like, please load it from Cincom public repository and play a bit. The Pharo port will take some additional time.

Cdata is associated with primitive types. But then in your "width" mapping you map an element to a native type. That makes only sense if this a convenient way of writing it.

Well, sort of. I think I'm starting to get it. defineElement: and defineCData: are meant that you get the content _from_ the element or cdata? So defineCData: extracts the text from the Node?

...

From that point of view I think that Cdata is the wrong name anyway. CData and PCData are mainly present in the textual form of XML. After the XML is parsed you usually consider both as text. In PCData <em> would lead to an element while in CData it will be the text "<em>". So I think you basically have Element and Text. If the methods would be a little bit more intention revealing even a person like me would get it :) I mean the builder "builds" "your type" from "another (XML) type".

Did you consider instead of having a constructor: setter to provide a block instead. Usually this adds a little complexity but opens a whole set of possible use cases.

...

I am not sure whether 'simple' and 'complex' reflect the nature of the types well, since a cdata type my have it's own mappings as well: text := (builder defineCData: 'Paragraph' class: 'Text') constructor: #fromString:. (text mapPath: (AttributeAxis ? 'id') toType: 'String') key: 'p-ids'. This example uses the value of @id as key to allow referencing the created object elsewhere. A more sophisticated example could use a parser that parses a given string-value and the fields of the result are set by SimpleXO later.

That sounds interesting but I don't get the example. Can you elaborate on this? Or provide a more concrete example. Norbert

...

I am looking forward to your (and anyone else's) thoughts! Regards, Steffen _______________________________________________ Moose-dev mailing list Moose-dev(a)iam.unibe.ch https://www.iam.unibe.ch/mailman/listinfo/moose-dev

Steffen Märcker

9 Nov 9 Nov

10:25 p.m.

Hi Norbert!

...

Well, sort of. I think I'm starting to get it. defineElement: and defineCData: are meant that you get the content _from_ the element or cdata? So defineCData: extracts the text from the Node?

Yes that's the idea. #defineElement: and #defineCData: have in common that both create an new object. While uses not data from the node at all, #defineCData: gets the string-value and hands it over to the constructor. (*) The node at which a type is applied can be viewed as a pivot node from where further processing, namely the types mappings, starts. Thus, it serves as context node for the xpath-expressions. (*) Actually, the given class name can refer to any binding. In VW this can be classes (of course), Shared Variables and namespaces as well.

...

From that point of view I think that Cdata is the wrong name anyway. CData and PCData are mainly present in the textual form of XML.

I see your point. =) Perhaps we can go even further and use #defineNode:, because a type can be applied not only to elements, but all kinds of xml nodes. And how about #defineStringValue: instead of #defineText:? But there's is actually another type, available via #defineStruct:. It behaves similar to the element type but requires that the created objects respond to #at:put:. The default class is Dictionary here. Remember the rectangle example: builder defineStruct: 'Rect') mapPath: ('pos' /@ 'x') toType: 'Int'; "... and so on" Since we use struct now, we get: (Dictionary new) at: 'x' put: 2; "... and so on" This third (and last type) proofed to be very useful for rapid prototyping of a mapping. Later in development, the structs can be easily replaced be the actual domain objects.

...

Did you consider instead of having a constructor: setter to provide a block instead. Usually this adds a little complexity but opens a whole set of possible use cases.

Do you think of something like a factory block that takes a node and produces an new object? E.g. something like (defineFactory: 'Hypothetical' block: [:node | node copy]) This idea looks promising. If it takes the context node, there are indeed plenty of new use cases. =) Perhaps we should allow here the binding approach too, to avoid wrapping existing facilities in blocks, e.g. (defineFactory: 'Hypothetical' class: 'MyFactoryClass' call: #processNode:) This would enable to call SimpleXO parsers for specific nodes and thus a more modular design in complex situations...

...

[Example on ID resolving]

That sounds interesting but I don't get the example. Can you elaborate on this? Or provide a more concrete example.

The basic idea is the following: A document may contain nodes with ids and other nodes that refer to them by their id. To parse this, we first put all elements with ids in a dictionary at the respective key. Setting referenced values is delayed until all nodes has been parsed. This allows forward references. In fact, a document in general, may have several categories of ids, e.g. the attributes 'id' and 'domain:id'. Thus we want to have a separate keychain for each category. When we call #key: or #reference:, the argument is the name of a keychain. <ex> <list> <geo ref="1"/> <geo ref="2"/> </list> <geo id="1"> <comment value="1"/> <rect> <pos x="2" y="3"/> <width>4</width> <height>5</height> </rect> </geo> <geo id="2"> <comment value="2"> <rect> <pos x="6" y="7"/> <width>8</width> <height>9</height> </rect> </geo> <comments> <comment cid="1">First Rectangle</comment> <comment cid="2">Second Rectangle</comment> </comments> </ex> Now consider: rect := builder defineElement: 'Rect' class: 'Rectangle' rect mapPath: ('pos' /@ 'x') toType: 'Int'; "... and so on" (rect mapPath: (ParentAxis /@ 'id') toType: 'Int') key: 'geo-keychain'. (rect mapPath: (ParentAxis / 'comment' /@ 'value') toType: 'Int') reference: 'comment-keychain'. comment := builder defineCData: 'Comment'. (comment mapPath: (AttributeAxis ? 'cid') toType: 'Int') key: 'comment-keychain'. doc := builder defineElement: 'doc' class: 'Set'. (doc mapPath: ('ex' / 'list' / 'geo' /@ 'ref') toType: 'Int') reference: 'geo-keychain'; setter: #add:. (doc mapPath: ('ex' / 'geo' / 'rect') toType: 'Rect') transient. (doc mapPath: ('ex' / 'comments' / 'comment') toType: 'Comment') transient. Ignoring my potential typos, we get: (Set new) add: ((Rectangle new "...") comment: 'First Rectangle'); add: ((Rectangle new "...") comment: 'Second Rectangle'). Please note #transient in the doc's definition. This setting is used to parse the matched nodes but without setting the created objects in their parent. When configuring an id via #key:, the mapping is transient by default, since we rarely want to preserve the xml ids. Although this example is a bit bigger, I think it illustrates how SimpleXO manages to map a complex xml straight to a much simpler object tree. Hope this gives you further insights! Best regards, Steffen PS: Using the external DSL, the mapping can be written as follows: 'element Rect { class: Rectangle pos/@x >> Int #... and so on ../@id >> Int (key: geo-keychain) ../comment/@value >> Int (ref: comment-keychain) } cdata Comment { @cid >> Int (key: comment-keychain) } root element Doc { class: Set ex/list/geo/@ref >> Int (ref: geo-keychain setter: #add:) ex/geo/rect >> Rect (transient) ex/comments/comment >> Comment (transient) }'

Steffen Märcker

10 Nov 10 Nov

8:33 a.m.

It seems I've produced a couple of typos in the mail before. At least this needs to be clearified:

...

Yes that's the idea. #defineElement: and #defineCData: have in common that both create an new object. While #defineElement: uses no data from the node at all, #defineCData: gets the string-value and hands it over to the constructor.

Ciao, Steffen Am 09.11.2011, 22:25 Uhr, schrieb Steffen Märcker <merkste(a)web.de>de>:

...

Hi Norbert!

Well, sort of. I think I'm starting to get it. defineElement: and defineCData: are meant that you get the content _from_ the element or cdata? So defineCData: extracts the text from the Node?

From that point of view I think that Cdata is the wrong name anyway. CData and PCData are mainly present in the textual form of XML.

Did you consider instead of having a constructor: setter to provide a block instead. Usually this adds a little complexity but opens a whole set of possible use cases.

[Example on ID resolving]

That sounds interesting but I don't get the example. Can you elaborate on this? Or provide a more concrete example.

5012

days inactive

5021

days old

moose-dev@list.inf.unibe.ch

Manage subscription

6 comments

3 participants

tags (0)

participants (3)

Norbert Hartl
Steffen Märcker
Stéphane Ducasse