Hi all,
I am reproducing some experiments I did time ago,
and I am again using Hapax (latest version for VW).
FYI, I just noticed that there could be an issue in how the
weighting in the TermDocumentMatrix is evaluated.
I attach the method here for convenience:
TermDocumentMatrix>>weight
| newMatrix |
newMatrix := SparseRowMatrix new: matrix dimension.
matrix rows with: newMatrix rows do: [ :row :newRow |
| globalWeight |
globalWeight := globalWeighting forTerm: row.
row doSparseWithIndex: [ :each :index |
newRow at: index put: (localWeighting
forValue: each) * globalWeight ]].
matrix := newMatrix.
this method should apply the tf-idf weighting [1] to the matrix.
This weighting is composed of two parts:
a global weighting (i.e., idf: the more a term is common, the less its weight)
a local weighting (i.e., tf: each term is normalized on the number of
terms appearing).
The global weighting is correctly done in the line:
globalWeight := globalWeighting forTerm: row.
while the local weighting is NOT correctly done in the line:
newRow at: index put: (localWeighting forValue: each) * globalWeight
in fact, the "localWeighting forValue: each" will always return "each"
back.
This means that it simply does not apply any local weighting.
Right now I am working on fixing this issue.
Please let me know if I am wrong, or if you have an elegant solution :)
Cheers,
Alberto
[1]
http://en.wikipedia.org/wiki/Tf%E2%80%93idf