From vincent.blondeau@polytech-lille.net Fri Aug 14 02:31:43 2015 From: Vincent Blondeau To: moose-dev@list.inf.unibe.ch Subject: [Moose-dev] Re: Getting some tag in an HTML file Date: Fri, 14 Aug 2015 02:31:34 +0200 Message-ID: <20150814003140.2D58C8056F@mailhub-lb1.unibe.ch> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============2666028300465280216==" --===============2666028300465280216== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Hi,=20 Look at the class side, there is the method parse: namespace: validation: . c= all this method instead of parse: with false in the two last arguments. It sh= ould work. Anyway, you should use the sax parser. It is faster and memory less consuming= . It is very simple to get only one tag. Cheers Vincent Le 14 ao=C3=BBt 2015 01:31, Alexandre Bergel a = =C3=A9crit : > > Hi! > > Together with Nicolas we are trying to get all the from html files. > We have tried to use XMLDOMParser, but many webpages are actually not well = formed, therefore the parser is complaining. > > Anyone has tried to get some particular tags from HTML files? This looks li= ke a classical thing to do. Maybe some of you have done it. > Is there a way to configure the parser to accept a broken XML/HTML content? > > Cheers, > Alexandre > --=20 > _,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;: > Alexandre Bergel=C2=A0 http://www.bergel.eu > ^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;. > > > _______________________________________________ > Moose-dev mailing list > Moose-dev(a)iam.unibe.ch > https://www.iam.unibe.ch/mailman/listinfo/moose-dev --===============2666028300465280216==-- From hannes.hirzel@gmail.com Fri Aug 14 06:05:50 2015 From: "H. Hirzel" To: moose-dev@list.inf.unibe.ch Subject: [Moose-dev] Re: Getting some tag in an HTML file Date: Fri, 14 Aug 2015 04:05:40 +0000 Message-ID: In-Reply-To: <20150814003140.2D58C8056F@mailhub-lb1.unibe.ch> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============4534846172879161691==" --===============4534846172879161691== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable http://ss3.gemtalksystems.com/ss/Tabular.html contains an application example of a SAX parser. You only pick what is of interest. On 8/14/15, Vincent Blondeau wrote: > Hi, > > Look at the class side, there is the method parse: namespace: validation: . > call this method instead of parse: with false in the two last arguments. It > should work. > > Anyway, you should use the sax parser. It is faster and memory less > consuming. It is very simple to get only one tag. > > Cheers > Vincent > > Le 14 ao=C3=BBt 2015 01:31, Alexandre Bergel a = =C3=A9crit : >> >> Hi! >> >> Together with Nicolas we are trying to get all the >> from html files. >> We have tried to use XMLDOMParser, but many webpages are actually not well >> formed, therefore the parser is complaining. >> >> Anyone has tried to get some particular tags from HTML files? This looks >> like a classical thing to do. Maybe some of you have done it. >> Is there a way to configure the parser to accept a broken XML/HTML >> content? >> >> Cheers, >> Alexandre >> -- >> _,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;: >> Alexandre Bergel=C2=A0 http://www.bergel.eu >> ^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;. >> >> >> _______________________________________________ >> Moose-dev mailing list >> Moose-dev(a)iam.unibe.ch >> https://www.iam.unibe.ch/mailman/listinfo/moose-dev > > _______________________________________________ > Moose-dev mailing list > Moose-dev(a)iam.unibe.ch > https://www.iam.unibe.ch/mailman/listinfo/moose-dev > --===============4534846172879161691==--