[Moose-dev] Re: Hapax/CodeFu changes and example script

17 Jan 2013


      A TermDocumentMatrix with word mappings and frequencies for the given 
documents (consider each line a different document).
El 17/01/2013 13:14, Alexandre Bergel escribió:
...
What is the result of this script?
Alexandre
On Jan 17, 2013, at 11:02 AM, Hernán Morales Durand hernan.morales@gmail.com wrote:
...
Hi guys,
For those working in information retrieval, for example for doing td-idf ranking, you can find adapted packages: "Hapax" and "CodeFu" in the BioSmalltalk repository http://ss3.gemstone.com/ss/BioSmalltalk.html . I have translated some VW specific code to Pharo 1.4 (under Windows requires the ProcessWrapper package) and adapted some Hapax methods to work with corpus in different languages.
This is an example script for a corpus in Spanish:
| corpus tdm documents |
corpus := HXSpanishCorpus new.
documents := 'el río Danubio pasa por Viena, su color es azul
el caudal de un río asciende en Invierno
el río Rhin y el río Danubio tienen mucho caudal
si un río es navegable, es porque tiene mucho caudal'.
documents lines doWithIndex: [: doc : index |
   corpus
   	addDocument: index asString
   	with: (Terms new
   		addString: doc
   		using: CamelcaseScanner;
   		yourself)].
corpus removeStopwords.
corpus stemAll.
tdm := TermDocumentMatrix on: corpus.
Feel free to integrate to any repository. If you want to add a language just see methods with selectors including "spanish".
Cheers,
Hernán

Moose-dev mailing list
Moose-dev@iam.unibe.ch
https://www.iam.unibe.ch/mailman/listinfo/moose-dev

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

[Moose-dev] Re: Hapax/CodeFu changes and example script