I've created a page on Swiki
<http://wiki.squeak.org/squeak/572> that
shows how you can use extrnal converters to get the plain text out
from
various file formats like pdf, postscript, excel, word and powerpoint.
There are also some small examples to evaluate. Perhaps this will help
you or other Pier developer to implement a full-text search for some
external file formats.
Thanks for writing down this summary.
I am afraid that I don't have the time to work on this. Maybe a
Google Summer of Code project? The stemming library is also
interesting, but I am afraid that this only works for english
sources? Maybe there are multilingual stemming libraries where we can
just pipe the results from the converters?
Lukas
--
Lukas Renggli
http://www.lukas-renggli.ch