If your scripts contain string literals with '<script>' or '</script>' in
them (I've seen this before), then your mileage may vary with Tudor's
approach. Also consider that script tags may have attributes, and those
attributes may have single or double quotes. Also, script tags may or may
not refer to javascript. Many javascript libraries use script tags for HTML
template sources, for instance. These tags you'd probably want to keep (and
perhaps follow the reference for the third):
<script type='text/javascript'> [code here] </script>
<script type='text/javascript'> document.write('<script
src="somewhere.js"></script>");</script> <!-- here be dragons! -->
<script type='text/javascript' src="path/to/javascript/source.js"></script>
However, something like this you might want to ignore:
<script type='text/html' id='someTemplate'>
<span>{{some template syntax}}</span>
</script>
If you can make some assumptions about what you're parsing you might be
able to adapt Tudor's solution to be more robust. However, if you're trying
for a general-purpose solution, I'd highly recommend using an existing HTML
parsing library, not an XML parser.
In general, parsing HTML as XML is the wrong approach. HTML is technically
not a subset of XML (closing tags aren't required), so most true XML
parsers are going to barf on it.
Some further reading:
https://en.wikipedia.org/wiki/Tag_souphttps://en.wikipedia.org/wiki/HTML5#XHTML5_.28XML-serialized_HTML5.29
I'm new to smalltalk so I can't recommend a library, but in Java I've used
Tag Soup and I've used Beautiful Soup in Python.
Hope this helps,
Floyd
On Fri, Aug 14, 2015 at 9:40 AM, Tudor Girba <tudor(a)tudorgirba.com> wrote:
Hi,
>
> You can also consider using island parsing, this very cool addition to
> PetitParser developed by Jan:
>
> beginScript := '<script>' asParser.
> endScript := '</script>' asParser.
> script := beginScript , endScript negate star flatten , endScript ==>
> #second.
> islandScripts := (script island ==> #second) star.
>
> If you apply it on:
>
> code := 'uninteresting part
> <script>
> some code
> </script>
> another
> uninteresting part
> <script>
> some other
> code
> </script>
> yet another
> uninteresting part
> '.
>
> You get:
> islandScripts parse: code
> ==> "#('some code' 'some other
> code')"
>
> Quite cool, no? :)
>
> Doru
>
>
> On Fri, Aug 14, 2015 at 1:31 AM, Alexandre Bergel <alexandre.bergel(a)me.com
> > wrote:
>
>
> Hi!
>>
>>>
>> Together with Nicolas we are trying to get all the <script …> … </script>
>> from html files.
>>
>>> We have tried to use XMLDOMParser, but many webpages are actually not
>> well formed, therefore the parser is complaining.
>>
>>>
>> Anyone has tried to get some particular tags from HTML files? This looks
>> like a classical thing to do. Maybe some of you have done it.
>>
>>> Is there a way to configure the parser to accept a broken XML/HTML
>> content?
>>
>>>
>> Cheers,
>>
>>> Alexandre
>>
>>> --
>>
>>> _,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
>>
>>> Alexandre Bergel http://www.bergel.eu
>>
>>> ^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.
>>
>>>
>>
>>
>>
>>
> --
> www.tudorgirba.com
>
> "Every thing has its own flow"
>
> _______________________________________________
>
> Moose-dev mailing list
>
> Moose-dev(a)iam.unibe.ch
>
> https://www.iam.unibe.ch/mailman/listinfo/moose-dev
>
>
Hi!
I see the project: github.com/jig08/sQuick_new
Can I use it as a replacement of Spotlight on OS X? Is it means to replace it?
Cheers,
Alexandre
--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.
You will probably have more chances to get a response on the Moose mailing
list.
Cheers,
Doru
On Sun, Aug 16, 2015 at 5:58 PM, Holger Freyther <holger(a)freyther.de> wrote:
> Hi,
>
> once again I am not sure if this is the right list. The first parser I
> wrote using
> PetitParser was a SIP (and then MGCP) parser. I have recently ported[1] the
> code to Pharo and with Pharo it is very tempting to Use
> BlockClosure>>#bench
> to get an idea of the speed.
>
>
> I have two performance “issues” and wonder if others hand similar issues
> with
> PetitParser and if there is a general approach to this.
>
>
>
> 1.) Combining two PPCharsetPredicates does not combine the “classification”
> table it had. One could create a PPPredicateObjectParser subclass that is
> special casing >>#/ to build a combined classification table.
>
>
> 2.) When blindly following a BNF enumeration of "A or B or C or D or E
> or CatchAll” and each “A, B” follow common pattern (e.g. token COLON value)
> one pays a high cost in the backtracking and constructing the PPFailure for
> each failed case.
>
> In my SIPGrammar I have action parsers for To ==>.. From ==> and would
> like to keep that. At the same time I would be happy if the token in front
> of the
> colon is only consumed once and then delegated to the right parser and if
> that
> one failed use the ‘catch all’ one.
>
> I don’t know which abstraction would be needed to allow creating optimized
> PetitParsers for such grammars.
>
> sorry for the long mail, long details and context is below.
>
>
> kind regards
> holger
>
>
>
>
>
>
> Full details:
>
>
> 1.) CharSetPredicate
>
> | aParser bParser combinedParser aTime bTime cTime |
>
> aParser := #digit asParser.
> bParser := #letter asParser.
> combinedParser := aParser / bParser.
>
> aTime := [ aParser parse: 'b'] bench.
> bTime := [ bParser parse: 'b'] bench.
> cTime := [ combinedParser parse: 'b'] bench.
> { aTime. bTime. cTime }
>
> cTime is bounded by the time execution time of of the slowest
> of these parsers + overhead (e.g. PPFailure creation).
>
> e.g.
>
> #('559,000 per second.' '1,010,000 per second.' '429,000 per second.')
>
> With a proof of concept PPPredicateCharSetParser
>
> #('1,330,000 per second.' '1,550,000 per second.' '1,580,000 per second.’)
>
> The noise is pretty string here but what is important is that bParser and
> the
> combinedParser are in the same ballpark.
>
> 2.) Choice Parser
>
>
>
> The BNF grammar of the parser is roughly:
>
> Request = Request-Line
> *( message-header )
> CRLF
> [ message-body ]
>
> message-header = (Accept
> …
> / To
> / From
> / Via
> / extension-header) CRLF
>
> Alert-Info = "Alert-Info" HCOLON alert-param *(COMMA alert-param)
> Accept = "Accept" HCOLON
> [ accept-range *(COMMA accept-range) ]
>
>
> So there can be several lines of “message-header”. And each method header
> starts with a token/word, a colon and then the parameter.
> “extension-header”
> is kind of a catch all if no other rule matched. E.g. if a client sends a
> To which is
> wrongly encoded it would end up with the extension-header.
>
> I transferred the above naively to PetitParser and end up with something
> like
> parsing ~500 messages a second. The main cost appears to come from the
> choice parser that needs to create a PetitFailure all the time. E.g. if
> you have a
> line like this:
>
> ‘From: “Holger Freyther” <sip:323234@foo.de>’
>
> The choice parser will start with the “Accept” rule, parse the token
> (“From” and
> then create a PPFailure, then … rules, then “To”, parse the token.. So we
> have
> parsing the same token more than once and creating PPFailures all the
> time. I
> ended up creating a custom parser that will peek the token, have a
> horrible chain
> of token = ‘XYZ’ ifTrue and then dispatch to the other rule.
>
> It would be nice if PetitParser could be taught to only parse the token
> once and
> then delegate to the param rule. E.g. a PPAnyOfParser that allows to
> specify the
> token to match, the parser to continue with and a fallback parser?
>
>
>
> [1]
> http://smalltalkhub.com/#!/~osmocom/SIP
> http://smalltalkhub.com/#!/~osmocom/MGCP
>
--
www.tudorgirba.com
"Every thing has its own flow"
Hi!
I have noticed a slowdown in moose. Opening a menu is now particularly slow. Here is a
https://dl.dropboxusercontent.com/u/31543901/TMP/slowdown.mov
I have just tried on Pharo 5, and it looks like to have a similar problem.
Cheers,
Alexandre
--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.
Hi!
I am a bit worried.
The jenkins for Moose 6.0 has not been green for ages.
I am using a fresh 6.0, and I get error when tracing in the debugger:
-=-=-=-=-=-=-=-=-=-=-=-=
DebuggerMethodMapOpal >> tempNamesForContext: aContext
"Answer an Array of all the temp names in scope in aContext starting with
the home's first local (the first argument or first temporary if no arguments)."
^ aContext sourceNode scope allTempNames.
-=-=-=-=-=-=-=-=-=-=-=-=
#scope is sent to nil. A good starting point would be to have the jenkins green back. Having it yellow for too long is not constructive.
cheers,
Alexandre
--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.
Hi!
Milton worked on a StackPlot builder.
This is currently highly prototypal.
Inspect the expression:
RTExperimentalExample new exampleStackOnRoassal
It gives something like, which shows the amount of code subclasses of RTShape:
Pretty cool!
Many other examples are contained in the class RTExperimentalExample
It is worth having a look at them.
And yes, it is exportable to HTML.
Cheers,
Alexandre
--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.
How is this possible?
aCollection is an array with 1 element (an AdaParameter ...)
but each in the do: block contains nil (so the add: gives a DNU) !?!?!?
I have no idea how this can be possible
Any clue
nicolas