Hi guys, For those working in information retrieval, for example for doing td-idf ranking, you can find adapted packages: "Hapax" and "CodeFu" in the BioSmalltalk repository http://ss3.gemstone.com/ss/BioSmalltalk.html . I have translated some VW specific code to Pharo 1.4 (under Windows requires the ProcessWrapper package) and adapted some Hapax methods to work with corpus in different languages.
This is an example script for a corpus in Spanish:
| corpus tdm documents |
corpus := HXSpanishCorpus new.
documents := 'el río Danubio pasa por Viena, su color es azul el caudal de un río asciende en Invierno el río Rhin y el río Danubio tienen mucho caudal si un río es navegable, es porque tiene mucho caudal'.
documents lines doWithIndex: [: doc : index | corpus addDocument: index asString with: (Terms new addString: doc using: CamelcaseScanner; yourself)]. corpus removeStopwords. corpus stemAll. tdm := TermDocumentMatrix on: corpus.
Feel free to integrate to any repository. If you want to add a language just see methods with selectors including "spanish". Cheers,
Hernán
What is the result of this script?
Alexandre
On Jan 17, 2013, at 11:02 AM, Hernán Morales Durand hernan.morales@gmail.com wrote:
Hi guys, For those working in information retrieval, for example for doing td-idf ranking, you can find adapted packages: "Hapax" and "CodeFu" in the BioSmalltalk repository http://ss3.gemstone.com/ss/BioSmalltalk.html . I have translated some VW specific code to Pharo 1.4 (under Windows requires the ProcessWrapper package) and adapted some Hapax methods to work with corpus in different languages.
This is an example script for a corpus in Spanish:
| corpus tdm documents |
corpus := HXSpanishCorpus new.
documents := 'el río Danubio pasa por Viena, su color es azul el caudal de un río asciende en Invierno el río Rhin y el río Danubio tienen mucho caudal si un río es navegable, es porque tiene mucho caudal'.
documents lines doWithIndex: [: doc : index | corpus addDocument: index asString with: (Terms new addString: doc using: CamelcaseScanner; yourself)]. corpus removeStopwords. corpus stemAll. tdm := TermDocumentMatrix on: corpus.
Feel free to integrate to any repository. If you want to add a language just see methods with selectors including "spanish". Cheers,
Hernán
Moose-dev mailing list Moose-dev@iam.unibe.ch https://www.iam.unibe.ch/mailman/listinfo/moose-dev
A TermDocumentMatrix with word mappings and frequencies for the given documents (consider each line a different document).
El 17/01/2013 13:14, Alexandre Bergel escribió:
What is the result of this script?
Alexandre
On Jan 17, 2013, at 11:02 AM, Hernán Morales Durand hernan.morales@gmail.com wrote:
Hi guys, For those working in information retrieval, for example for doing td-idf ranking, you can find adapted packages: "Hapax" and "CodeFu" in the BioSmalltalk repository http://ss3.gemstone.com/ss/BioSmalltalk.html . I have translated some VW specific code to Pharo 1.4 (under Windows requires the ProcessWrapper package) and adapted some Hapax methods to work with corpus in different languages.
This is an example script for a corpus in Spanish:
| corpus tdm documents |
corpus := HXSpanishCorpus new.
documents := 'el río Danubio pasa por Viena, su color es azul el caudal de un río asciende en Invierno el río Rhin y el río Danubio tienen mucho caudal si un río es navegable, es porque tiene mucho caudal'.
documents lines doWithIndex: [: doc : index | corpus addDocument: index asString with: (Terms new addString: doc using: CamelcaseScanner; yourself)]. corpus removeStopwords. corpus stemAll. tdm := TermDocumentMatrix on: corpus.
Feel free to integrate to any repository. If you want to add a language just see methods with selectors including "spanish". Cheers,
Hernán
Moose-dev mailing list Moose-dev@iam.unibe.ch https://www.iam.unibe.ch/mailman/listinfo/moose-dev
Ah okay, a kind of Latent Semantic Indexing system then. Looks good!
Alexandre
On Jan 17, 2013, at 1:12 PM, Hernán Morales Durand hernan.morales@gmail.com wrote:
A TermDocumentMatrix with word mappings and frequencies for the given documents (consider each line a different document).
El 17/01/2013 13:14, Alexandre Bergel escribió:
What is the result of this script?
Alexandre
On Jan 17, 2013, at 11:02 AM, Hernán Morales Durand hernan.morales@gmail.com wrote:
Hi guys, For those working in information retrieval, for example for doing td-idf ranking, you can find adapted packages: "Hapax" and "CodeFu" in the BioSmalltalk repository http://ss3.gemstone.com/ss/BioSmalltalk.html . I have translated some VW specific code to Pharo 1.4 (under Windows requires the ProcessWrapper package) and adapted some Hapax methods to work with corpus in different languages.
This is an example script for a corpus in Spanish:
| corpus tdm documents |
corpus := HXSpanishCorpus new.
documents := 'el río Danubio pasa por Viena, su color es azul el caudal de un río asciende en Invierno el río Rhin y el río Danubio tienen mucho caudal si un río es navegable, es porque tiene mucho caudal'.
documents lines doWithIndex: [: doc : index | corpus addDocument: index asString with: (Terms new addString: doc using: CamelcaseScanner; yourself)]. corpus removeStopwords. corpus stemAll. tdm := TermDocumentMatrix on: corpus.
Feel free to integrate to any repository. If you want to add a language just see methods with selectors including "spanish". Cheers,
Hernán
Moose-dev mailing list Moose-dev@iam.unibe.ch https://www.iam.unibe.ch/mailman/listinfo/moose-dev
Moose-dev mailing list Moose-dev@iam.unibe.ch https://www.iam.unibe.ch/mailman/listinfo/moose-dev
Thanks hernan!
Stef
On Jan 17, 2013, at 5:02 PM, Hernán Morales Durand wrote:
Hi guys, For those working in information retrieval, for example for doing td-idf ranking, you can find adapted packages: "Hapax" and "CodeFu" in the BioSmalltalk repository http://ss3.gemstone.com/ss/BioSmalltalk.html . I have translated some VW specific code to Pharo 1.4 (under Windows requires the ProcessWrapper package) and adapted some Hapax methods to work with corpus in different languages.
This is an example script for a corpus in Spanish:
| corpus tdm documents |
corpus := HXSpanishCorpus new.
documents := 'el río Danubio pasa por Viena, su color es azul el caudal de un río asciende en Invierno el río Rhin y el río Danubio tienen mucho caudal si un río es navegable, es porque tiene mucho caudal'.
documents lines doWithIndex: [: doc : index | corpus addDocument: index asString with: (Terms new addString: doc using: CamelcaseScanner; yourself)]. corpus removeStopwords. corpus stemAll. tdm := TermDocumentMatrix on: corpus.
Feel free to integrate to any repository. If you want to add a language just see methods with selectors including "spanish". Cheers,
Hernán
Thanks, indeed!
It would be great to have this back in Moose. Anyone interested in looking at it?
Cheers, Doru
On Jan 17, 2013, at 10:43 PM, Stéphane Ducasse stephane.ducasse@inria.fr wrote:
Thanks hernan!
Stef
On Jan 17, 2013, at 5:02 PM, Hernán Morales Durand wrote:
Hi guys, For those working in information retrieval, for example for doing td-idf ranking, you can find adapted packages: "Hapax" and "CodeFu" in the BioSmalltalk repository http://ss3.gemstone.com/ss/BioSmalltalk.html . I have translated some VW specific code to Pharo 1.4 (under Windows requires the ProcessWrapper package) and adapted some Hapax methods to work with corpus in different languages.
This is an example script for a corpus in Spanish:
| corpus tdm documents |
corpus := HXSpanishCorpus new.
documents := 'el río Danubio pasa por Viena, su color es azul el caudal de un río asciende en Invierno el río Rhin y el río Danubio tienen mucho caudal si un río es navegable, es porque tiene mucho caudal'.
documents lines doWithIndex: [: doc : index | corpus addDocument: index asString with: (Terms new addString: doc using: CamelcaseScanner; yourself)]. corpus removeStopwords. corpus stemAll. tdm := TermDocumentMatrix on: corpus.
Feel free to integrate to any repository. If you want to add a language just see methods with selectors including "spanish". Cheers,
Hernán
-- www.tudorgirba.com
"Every successful trip needs a suitable vehicle."
What's the progress of this integration?
I'm new at the team. Some time ago, i developed Hapax (ir, clustering, visualization) in Java with some improvements, because i had developed some tools for ir and clustering before.
Now i would like to contribute and implement these improvements in Hapax. Is there some advance in the last bundle from 2011, or someone who worked in it for the last time?
-- Gustavo Jansen
-- View this message in context: http://moose-dev.97923.n3.nabble.com/Hapax-CodeFu-changes-and-example-script... Sent from the moose-dev mailing list archive at Nabble.com.
Hi!
I am not really familiar with this. But having a strong tools for text analysis is really important.
Alexandre
On Oct 3, 2013, at 9:11 AM, Gustavo Jansen gugajansen@gmail.com wrote:
What's the progress of this integration?
I'm new at the team. Some time ago, i developed Hapax (ir, clustering, visualization) in Java with some improvements, because i had developed some tools for ir and clustering before.
Now i would like to contribute and implement these improvements in Hapax. Is there some advance in the last bundle from 2011, or someone who worked in it for the last time?
-- Gustavo Jansen
-- View this message in context: http://moose-dev.97923.n3.nabble.com/Hapax-CodeFu-changes-and-example-script... Sent from the moose-dev mailing list archive at Nabble.com. _______________________________________________ Moose-dev mailing list Moose-dev@iam.unibe.ch https://www.iam.unibe.ch/mailman/listinfo/moose-dev
Welcome, Gustavo!
I am not aware of any effort in this direction, but as Alex said, anything in the area of text manipulation would be greatly appreciated.
Cheers, Doru
On Thu, Oct 3, 2013 at 2:11 PM, Gustavo Jansen gugajansen@gmail.com wrote:
What's the progress of this integration?
I'm new at the team. Some time ago, i developed Hapax (ir, clustering, visualization) in Java with some improvements, because i had developed some tools for ir and clustering before.
Now i would like to contribute and implement these improvements in Hapax. Is there some advance in the last bundle from 2011, or someone who worked in it for the last time?
-- Gustavo Jansen
-- View this message in context: http://moose-dev.97923.n3.nabble.com/Hapax-CodeFu-changes-and-example-script... Sent from the moose-dev mailing list archive at Nabble.com. _______________________________________________ Moose-dev mailing list Moose-dev@iam.unibe.ch https://www.iam.unibe.ch/mailman/listinfo/moose-dev