From hannes.hirzel@gmail.com Mon Aug 17 10:33:51 2015 From: "H. Hirzel" To: moose-dev@list.inf.unibe.ch Subject: [Moose-dev] Re: [Pharo-dev] Getting some tag in an HTML file Date: Mon, 17 Aug 2015 10:33:47 +0200 Message-ID: In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============5091801169347925101==" --===============5091801169347925101== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit A question about soup https://ci.inria.fr/pharo-contribution/job/Soup/ The test runs for Pharo 2 and Pharo 3. Who needs to be contacted to set up a test for Pharo 4? --Hannes On 8/17/15, Floyd May wrote: > If your scripts contain string literals with '' in > them (I've seen this before), then your mileage may vary with Tudor's > approach. Also consider that script tags may have attributes, and those > attributes may have single or double quotes. Also, script tags may or may > not refer to javascript. Many javascript libraries use script tags for HTML > template sources, for instance. These tags you'd probably want to keep (and > perhaps follow the reference for the third): > > > "); > > > However, something like this you might want to ignore: > > > If you can make some assumptions about what you're parsing you might be > able to adapt Tudor's solution to be more robust. However, if you're trying > for a general-purpose solution, I'd highly recommend using an existing HTML > parsing library, not an XML parser. > > In general, parsing HTML as XML is the wrong approach. HTML is technically > not a subset of XML (closing tags aren't required), so most true XML > parsers are going to barf on it. > > Some further reading: > https://en.wikipedia.org/wiki/Tag_soup > https://en.wikipedia.org/wiki/HTML5#XHTML5_.28XML-serialized_HTML5.29 > > I'm new to smalltalk so I can't recommend a library, but in Java I've used > Tag Soup and I've used Beautiful Soup in Python. > > Hope this helps, > > Floyd > > On Fri, Aug 14, 2015 at 9:40 AM, Tudor Girba wrote: > > Hi, >> >> You can also consider using island parsing, this very cool addition to >> PetitParser developed by Jan: >> >> beginScript := '' asParser. >> script := beginScript , endScript negate star flatten , endScript ==> >> #second. >> islandScripts := (script island ==> #second) star. >> >> If you apply it on: >> >> code := 'uninteresting part >> >> another >> uninteresting part >> >> yet another >> uninteresting part >> '. >> >> You get: >> islandScripts parse: code >> ==> "#('some code' 'some other >> code')" >> >> Quite cool, no? :) >> >> Doru >> >> >> On Fri, Aug 14, 2015 at 1:31 AM, Alexandre Bergel >> > > wrote: >> >> >> Hi! >>> >>>> >>> Together with Nicolas we are trying to get all the >>> from html files. >>> >>>> We have tried to use XMLDOMParser, but many webpages are actually not >>> well formed, therefore the parser is complaining. >>> >>>> >>> Anyone has tried to get some particular tags from HTML files? This looks >>> like a classical thing to do. Maybe some of you have done it. >>> >>>> Is there a way to configure the parser to accept a broken XML/HTML >>> content? >>> >>>> >>> Cheers, >>> >>>> Alexandre >>> >>>> -- >>> >>>> _,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;: >>> >>>> Alexandre Bergel http://www.bergel.eu >>> >>>> ^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;. >>> >>>> >>> >>> >>> >>> >> -- >> www.tudorgirba.com >> >> "Every thing has its own flow" >> >> _______________________________________________ >> >> Moose-dev mailing list >> >> Moose-dev(a)iam.unibe.ch >> >> https://www.iam.unibe.ch/mailman/listinfo/moose-dev >> >> > --===============5091801169347925101==--