What is the result of this script?
Alexandre
On Jan 17, 2013, at 11:02 AM, Hernán Morales Durand <hernan.morales(a)gmail.com>
wrote:
Hi guys,
For those working in information retrieval, for example for doing td-idf ranking, you can
find adapted packages: "Hapax" and "CodeFu" in the BioSmalltalk
repository
http://ss3.gemstone.com/ss/BioSmalltalk.html . I have translated some VW
specific code to Pharo 1.4 (under Windows requires the ProcessWrapper package) and adapted
some Hapax methods to work with corpus in different languages.
This is an example script for a corpus in Spanish:
| corpus tdm documents |
corpus := HXSpanishCorpus new.
documents := 'el río Danubio pasa por Viena, su color es azul
el caudal de un río asciende en Invierno
el río Rhin y el río Danubio tienen mucho caudal
si un río es navegable, es porque tiene mucho caudal'.
documents lines doWithIndex: [: doc : index |
corpus
addDocument: index asString
with: (Terms new
addString: doc
using: CamelcaseScanner;
yourself)].
corpus removeStopwords.
corpus stemAll.
tdm := TermDocumentMatrix on: corpus.
Feel free to integrate to any repository. If you want to add a language just see methods
with selectors including "spanish".
Cheers,
Hernán
_______________________________________________
Moose-dev mailing list
Moose-dev(a)iam.unibe.ch
https://www.iam.unibe.ch/mailman/listinfo/moose-dev
--
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel
http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.