From alexandre.bergel@me.com Fri Aug 14 01:31:46 2015 From: Alexandre Bergel To: moose-dev@list.inf.unibe.ch Subject: [Moose-dev] Getting some tag in an HTML file Date: Thu, 13 Aug 2015 20:31:26 -0300 Message-ID: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============3322859064117668830==" --===============3322859064117668830== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Hi! Together with Nicolas we are trying to get all the from html files. We have tried to use XMLDOMParser, but many webpages are actually not well fo= rmed, therefore the parser is complaining. Anyone has tried to get some particular tags from HTML files? This looks like= a classical thing to do. Maybe some of you have done it. Is there a way to configure the parser to accept a broken XML/HTML content? Cheers, Alexandre --=20 _,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;: Alexandre Bergel http://www.bergel.eu ^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;. --===============3322859064117668830==-- From tudor@tudorgirba.com Fri Aug 14 16:40:49 2015 From: Tudor Girba To: moose-dev@list.inf.unibe.ch Subject: [Moose-dev] Re: [Pharo-dev] Getting some tag in an HTML file Date: Fri, 14 Aug 2015 16:40:46 +0200 Message-ID: In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="===============4006767903888481843==" --===============4006767903888481843== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Hi, You can also consider using island parsing, this very cool addition to PetitParser developed by Jan: beginScript :=3D '' asParser. script :=3D beginScript , endScript negate star flatten , endScript =3D=3D> #second. islandScripts :=3D (script island =3D=3D> #second) star. If you apply it on: code :=3D 'uninteresting part another uninteresting part yet another uninteresting part '. You get: islandScripts parse: code =3D=3D> "#('some code' 'some other code')" Quite cool, no? :) Doru On Fri, Aug 14, 2015 at 1:31 AM, Alexandre Bergel wrote: > Hi! > > Together with Nicolas we are trying to get all the > from html files. > We have tried to use XMLDOMParser, but many webpages are actually not well > formed, therefore the parser is complaining. > > Anyone has tried to get some particular tags from HTML files? This looks > like a classical thing to do. Maybe some of you have done it. > Is there a way to configure the parser to accept a broken XML/HTML content? > > Cheers, > Alexandre > -- > _,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;: > Alexandre Bergel http://www.bergel.eu > ^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;. > > > > > --=20 www.tudorgirba.com "Every thing has its own flow" --===============4006767903888481843== Content-Type: text/html Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="attachment.htm" MIME-Version: 1.0 PGRpdiBkaXI9Imx0ciI+PGRpdj5IaSw8L2Rpdj48ZGl2Pjxicj48L2Rpdj5Zb3UgY2FuIGFsc28g Y29uc2lkZXIgdXNpbmcgaXNsYW5kIHBhcnNpbmcsIHRoaXMgdmVyeSBjb29sIGFkZGl0aW9uIHRv IFBldGl0UGFyc2VyIGRldmVsb3BlZCBieSBKYW46PGRpdj48YnI+PC9kaXY+PGRpdj48ZGl2PmJl Z2luU2NyaXB0IDo9ICYjMzk7Jmx0O3NjcmlwdCZndDsmIzM5OyBhc1BhcnNlci48L2Rpdj48ZGl2 PmVuZFNjcmlwdCA6PSAmIzM5OyZsdDsvc2NyaXB0Jmd0OyYjMzk7IGFzUGFyc2VyLjwvZGl2Pjxk aXY+c2NyaXB0IDo9IGJlZ2luU2NyaXB0ICwgZW5kU2NyaXB0IG5lZ2F0ZSBzdGFyIGZsYXR0ZW4g LCBlbmRTY3JpcHQgPT0mZ3Q7ICNzZWNvbmQuPC9kaXY+PGRpdj5pc2xhbmRTY3JpcHRzIDo9IChz Y3JpcHQgaXNsYW5kID09Jmd0OyAjc2Vjb25kKSBzdGFyLjwvZGl2PjwvZGl2PjxkaXY+PGJyPjwv ZGl2PjxkaXY+SWYgeW91IGFwcGx5IGl0IG9uOjwvZGl2PjxkaXY+PGJyPjwvZGl2PjxkaXY+PGRp dj5jb2RlIDo9ICYjMzk7dW5pbnRlcmVzdGluZyBwYXJ0PC9kaXY+PGRpdj4mbHQ7c2NyaXB0Jmd0 OzwvZGl2PjxkaXY+c29tZSBjb2RlPC9kaXY+PGRpdj4mbHQ7L3NjcmlwdCZndDs8L2Rpdj48ZGl2 PmFub3RoZXI8L2Rpdj48ZGl2PnVuaW50ZXJlc3RpbmcgcGFydDwvZGl2PjxkaXY+Jmx0O3Njcmlw dCZndDs8L2Rpdj48ZGl2PnNvbWUgb3RoZXI8L2Rpdj48ZGl2PmNvZGU8L2Rpdj48ZGl2PiZsdDsv c2NyaXB0Jmd0OzwvZGl2PjxkaXY+eWV0IGFub3RoZXI8L2Rpdj48ZGl2PnVuaW50ZXJlc3Rpbmcg cGFydDwvZGl2PjxkaXY+JiMzOTsuPC9kaXY+PGRpdj48YnI+PC9kaXY+PGRpdj5Zb3UgZ2V0Ojwv ZGl2PjxkaXY+aXNsYW5kU2NyaXB0cyBwYXJzZTogY29kZTwvZGl2PjwvZGl2PjxkaXY+PT0mZ3Q7 IMKgJnF1b3Q7IygmIzM5O3NvbWUgY29kZSYjMzk7ICYjMzk7c29tZSBvdGhlcjwvZGl2PjxkaXY+ Y29kZSYjMzk7KSZxdW90OzwvZGl2PjxkaXY+PGJyPjwvZGl2PjxkaXY+UXVpdGUgY29vbCwgbm8/ IDopPC9kaXY+PGRpdj48YnI+PC9kaXY+PGRpdj5Eb3J1PC9kaXY+PGRpdj48YnI+PC9kaXY+PC9k aXY+PGRpdiBjbGFzcz0iZ21haWxfZXh0cmEiPjxicj48ZGl2IGNsYXNzPSJnbWFpbF9xdW90ZSI+ T24gRnJpLCBBdWcgMTQsIDIwMTUgYXQgMTozMSBBTSwgQWxleGFuZHJlIEJlcmdlbCA8c3BhbiBk aXI9Imx0ciI+Jmx0OzxhIGhyZWY9Im1haWx0bzphbGV4YW5kcmUuYmVyZ2VsQG1lLmNvbSIgdGFy Z2V0PSJfYmxhbmsiPmFsZXhhbmRyZS5iZXJnZWxAbWUuY29tPC9hPiZndDs8L3NwYW4+IHdyb3Rl Ojxicj48YmxvY2txdW90ZSBjbGFzcz0iZ21haWxfcXVvdGUiIHN0eWxlPSJtYXJnaW46MCAwIDAg LjhleDtib3JkZXItbGVmdDoxcHggI2NjYyBzb2xpZDtwYWRkaW5nLWxlZnQ6MWV4Ij5IaSE8YnI+ Cjxicj4KVG9nZXRoZXIgd2l0aCBOaWNvbGFzIHdlIGFyZSB0cnlpbmcgdG8gZ2V0IGFsbCB0aGUg Jmx0O3NjcmlwdCDigKYmZ3Q7IOKApiAmbHQ7L3NjcmlwdCZndDsgZnJvbSBodG1sIGZpbGVzLjxi cj4KV2UgaGF2ZSB0cmllZCB0byB1c2UgWE1MRE9NUGFyc2VyLCBidXQgbWFueSB3ZWJwYWdlcyBh cmUgYWN0dWFsbHkgbm90IHdlbGwgZm9ybWVkLCB0aGVyZWZvcmUgdGhlIHBhcnNlciBpcyBjb21w bGFpbmluZy48YnI+Cjxicj4KQW55b25lIGhhcyB0cmllZCB0byBnZXQgc29tZSBwYXJ0aWN1bGFy IHRhZ3MgZnJvbSBIVE1MIGZpbGVzPyBUaGlzIGxvb2tzIGxpa2UgYSBjbGFzc2ljYWwgdGhpbmcg dG8gZG8uIE1heWJlIHNvbWUgb2YgeW91IGhhdmUgZG9uZSBpdC48YnI+CklzIHRoZXJlIGEgd2F5 IHRvIGNvbmZpZ3VyZSB0aGUgcGFyc2VyIHRvIGFjY2VwdCBhIGJyb2tlbiBYTUwvSFRNTCBjb250 ZW50Pzxicj4KPGJyPgpDaGVlcnMsPGJyPgpBbGV4YW5kcmU8YnI+CjxzcGFuIGNsYXNzPSJIT0Vu WmIiPjxmb250IGNvbG9yPSIjODg4ODg4Ij4tLTxicj4KXywuOzp+Xn46Oy5fLC47On5efjo7Ll8s Ljs6fl5+OjsuXywuOzp+Xn46Oy5fLC47Ojxicj4KQWxleGFuZHJlIEJlcmdlbMKgIDxhIGhyZWY9 Imh0dHA6Ly93d3cuYmVyZ2VsLmV1IiByZWw9Im5vcmVmZXJyZXIiIHRhcmdldD0iX2JsYW5rIj5o dHRwOi8vd3d3LmJlcmdlbC5ldTwvYT48YnI+Cl5+OjsuXywuOzp+Xn46Oy5fLC47On5efjo7Ll8s Ljs6fl5+OjsuXywuOzp+Xn46Oy48YnI+Cjxicj4KPGJyPgo8YnI+Cjxicj4KPC9mb250Pjwvc3Bh bj48L2Jsb2NrcXVvdGU+PC9kaXY+PGJyPjxiciBjbGVhcj0iYWxsIj48ZGl2Pjxicj48L2Rpdj4t LSA8YnI+PGRpdiBjbGFzcz0iZ21haWxfc2lnbmF0dXJlIj48ZGl2PjxhIGhyZWY9Imh0dHA6Ly93 d3cudHVkb3JnaXJiYS5jb20iIHRhcmdldD0iX2JsYW5rIj53d3cudHVkb3JnaXJiYS5jb208L2E+ PC9kaXY+PGRpdj48YnI+PC9kaXY+PGRpdj4mcXVvdDtFdmVyeSB0aGluZyBoYXMgaXRzIG93biBm bG93JnF1b3Q7PC9kaXY+PC9kaXY+CjwvZGl2Pgo= --===============4006767903888481843==--