If your scripts contain string literals with '<script>' or '</script>' in them (I've seen this before), then your mileage may vary with Tudor's approach. Also consider that script tags may have attributes, and those attributes may have single or double quotes. Also, script tags may or may not refer to javascript. Many javascript libraries use script tags for HTML template sources, for instance. These tags you'd probably want to keep (and perhaps follow the reference for the third):

<script type='text/javascript'> [code here] </script>
<script type='text/javascript'> document.write('<script src="somewhere.js"></script>");</script> <!-- here be dragons! -->
<script type='text/javascript' src="path/to/javascript/source.js"></script>

However, something like this you might want to ignore:
<script type='text/html' id='someTemplate'>
  <span>{{some template syntax}}</span>

If you can make some assumptions about what you're parsing you might be able to adapt Tudor's solution to be more robust. However, if you're trying for a general-purpose solution, I'd highly recommend using an existing HTML parsing library, not an XML parser.

In general, parsing HTML as XML is the wrong approach. HTML is technically not a subset of XML (closing tags aren't required), so most true XML parsers are going to barf on it.

Some further reading:

I'm new to smalltalk so I can't recommend a library, but in Java I've used Tag Soup and I've used Beautiful Soup in Python.

Hope this helps,


On Fri, Aug 14, 2015 at 9:40 AM, Tudor Girba <tudor@tudorgirba.com> wrote:


You can also consider using island parsing, this very cool addition to PetitParser developed by Jan:

beginScript := '<script>' asParser.
endScript := '</script>' asParser.
script := beginScript , endScript negate star flatten , endScript ==> #second.
islandScripts := (script island ==> #second) star.

If you apply it on:

code := 'uninteresting part
some code
uninteresting part
some other
yet another
uninteresting part

You get:
islandScripts parse: code
==>  "#('some code' 'some other

Quite cool, no? :)


On Fri, Aug 14, 2015 at 1:31 AM, Alexandre Bergel <alexandre.bergel@me.com> wrote:


Together with Nicolas we are trying to get all the <script …> … </script> from html files.
We have tried to use XMLDOMParser, but many webpages are actually not well formed, therefore the parser is complaining.

Anyone has tried to get some particular tags from HTML files? This looks like a classical thing to do. Maybe some of you have done it.
Is there a way to configure the parser to accept a broken XML/HTML content?

Alexandre Bergel  http://www.bergel.eu


"Every thing has its own flow"

Moose-dev mailing list