Hi all,
I am reproducing some experiments I did time ago, and I am again using Hapax (latest version for VW).
FYI, I just noticed that there could be an issue in how the weighting in the TermDocumentMatrix is evaluated. I attach the method here for convenience:
TermDocumentMatrix>>weight
| newMatrix | newMatrix := SparseRowMatrix new: matrix dimension. matrix rows with: newMatrix rows do: [ :row :newRow | | globalWeight | globalWeight := globalWeighting forTerm: row. row doSparseWithIndex: [ :each :index | newRow at: index put: (localWeighting forValue: each) * globalWeight ]]. matrix := newMatrix.
this method should apply the tf-idf weighting [1] to the matrix. This weighting is composed of two parts: a global weighting (i.e., idf: the more a term is common, the less its weight) a local weighting (i.e., tf: each term is normalized on the number of terms appearing).
The global weighting is correctly done in the line: globalWeight := globalWeighting forTerm: row.
while the local weighting is NOT correctly done in the line: newRow at: index put: (localWeighting forValue: each) * globalWeight
in fact, the "localWeighting forValue: each" will always return "each" back. This means that it simply does not apply any local weighting.
Right now I am working on fixing this issue. Please let me know if I am wrong, or if you have an elegant solution :)
Cheers, Alberto