[Moose-dev] Re: [Pharo-dev] Getting some tag in an HTML file

19 Aug 2015


      Welcome, Flyod. And thank for participating :).
Indeed, my suggestion was only a starter and was meant to be used as a
prototype. I wanted to remind people that we have this cool parser engine
that can be used in many ways.
So, I spent 10 more minutes to deal with the cases you just mentioned:
attributes := '>' asParser negate star flatten.
beginScript := '<script' asParser , attributes , '>' asParser ==> #second.
endScript := '</script>' asParser.
string := $' asParser , $' asParser negate star, $' asParser.
code := (string / endScript negate) star flatten.
script := beginScript , code , endScript ==> [:t | t first -> t second].
islandScripts := (script island ==> #second) star.
When applied:
string := '
something irrelevant
<script> [ simple script ] </script>
something else irrelevant
<script type=''text/javascript''> [code here] </script>
yet something
else irrelevant
<script type=''text/javascript''> document.write(''<script
src="somewhere.js"></script>'');</script> <!-- here be dragons! -->'.
(islandScripts parse: string)
You get:
"{''->' [ simple script ] '.
' type=''text/javascript'''->' [code here] '.
' type=''text/javascript'''->' document.write(''<script
src=""somewhere.js""></script>'');'}"
And of course, the playground makes it reasonably easy to prototype:
[image: Inline image 1]
Cheers,
Tudor
On Mon, Aug 17, 2015 at 3:24 AM, Floyd May floyd.may+moose-dev@gmail.com
wrote:
...
If your scripts contain string literals with '<script>' or '</script>' in
them (I've seen this before), then your mileage may vary with Tudor's
approach. Also consider that script tags may have attributes, and those
attributes may have single or double quotes. Also, script tags may or may
not refer to javascript. Many javascript libraries use script tags for HTML
template sources, for instance. These tags you'd probably want to keep (and
perhaps follow the reference for the third):
<script type='text/javascript'> [code here] </script>
<script type='text/javascript'> document.write('<script
src="somewhere.js"></script>");</script> <!-- here be dragons! -->
<script type='text/javascript' src="path/to/javascript/source.js"></script>
However, something like this you might want to ignore:
<script type='text/html' id='someTemplate'>
  <span>{{some template syntax}}</span>
</script>
If you can make some assumptions about what you're parsing you might be
able to adapt Tudor's solution to be more robust. However, if you're trying
for a general-purpose solution, I'd highly recommend using an existing HTML
parsing library, not an XML parser.
In general, parsing HTML as XML is the wrong approach. HTML is technically
not a subset of XML (closing tags aren't required), so most true XML
parsers are going to barf on it.
Some further reading:
https://en.wikipedia.org/wiki/Tag_soup
https://en.wikipedia.org/wiki/HTML5#XHTML5_.28XML-serialized_HTML5.29
I'm new to smalltalk so I can't recommend a library, but in Java I've used
Tag Soup and I've used Beautiful Soup in Python.
Hope this helps,
Floyd
On Fri, Aug 14, 2015 at 9:40 AM, Tudor Girba tudor@tudorgirba.com wrote:
Hi,
...
You can also consider using island parsing, this very cool addition to
PetitParser developed by Jan:
beginScript := '<script>' asParser.
endScript := '</script>' asParser.
script := beginScript , endScript negate star flatten , endScript ==>
#second.
islandScripts := (script island ==> #second) star.
If you apply it on:
code := 'uninteresting part
<script>
some code
</script>
another
uninteresting part
<script>
some other
code
</script>
yet another
uninteresting part
'.
You get:
islandScripts parse: code
==>  "#('some code' 'some other
code')"
Quite cool, no? :)
Doru
On Fri, Aug 14, 2015 at 1:31 AM, Alexandre Bergel <
alexandre.bergel@me.com> wrote:
Hi!
...
...
Together with Nicolas we are trying to get all the <script …> …
</script> from html files.
...
We have tried to use XMLDOMParser, but many webpages are actually not
well formed, therefore the parser is complaining.
...
Anyone has tried to get some particular tags from HTML files? This looks
like a classical thing to do. Maybe some of you have done it.
...
Is there a way to configure the parser to accept a broken XML/HTML
content?
...
Cheers,
...
Alexandre
...
--
...
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
...
Alexandre Bergel  http://www.bergel.eu
...
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.
...
--
www.tudorgirba.com
"Every thing has its own flow"

...
Moose-dev mailing list
...
Moose-dev@iam.unibe.ch
...
https://www.iam.unibe.ch/mailman/listinfo/moose-dev

Moose-dev mailing list
Moose-dev@iam.unibe.ch
https://www.iam.unibe.ch/mailman/listinfo/moose-dev
-- 
www.tudorgirba.com

"Every thing has its own flow"

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

[Moose-dev] Re: [Pharo-dev] Getting some tag in an HTML file