New subject: [Pharo-users] Hapax/CodeFu changes and example script

17 Jan 2013


      Hi guys,
For those working in information retrieval, for example for doing td-idf 
ranking, you can find adapted packages: "Hapax" and "CodeFu" in the 
BioSmalltalk repository http://ss3.gemstone.com/ss/BioSmalltalk.html . I 
have translated some VW specific code to Pharo 1.4 (under Windows 
requires the ProcessWrapper package) and adapted some Hapax methods to 
work with corpus in different languages.
This is an example script for a corpus in Spanish:
| corpus tdm documents |
corpus := HXSpanishCorpus new.
documents := 'el río Danubio pasa por Viena, su color es azul
el caudal de un río asciende en Invierno
el río Rhin y el río Danubio tienen mucho caudal
si un río es navegable, es porque tiene mucho caudal'.
documents lines doWithIndex: [: doc : index |
    corpus
    	addDocument: index asString
    	with: (Terms new
    		addString: doc
    		using: CamelcaseScanner;
    		yourself)].
corpus removeStopwords.
corpus stemAll.
tdm := TermDocumentMatrix on: corpus.
Feel free to integrate to any repository. If you want to add a language 
just see methods with selectors including "spanish".
Cheers,
Hernán