Tuesday, March 18, 2008

Homogenization of NICM by covariance

I chose to homogenize by covariance since it looks like that's the main anti-hub correlate for this data. The plot below shows the log-determinant for each component of each model (that's 32 x 897 components). I'm convinced that the super tiny variance components are just ones that collapsed in EM and should definitely be removed. So, I picked two thresholds for now: -300 to remove all the super tiny components, and -150 to remove most everything outside of that massive band around -100.

Artist R-precision increased for both homogenization (I'm not attempting another table):
35.85% for -300
38.24% for -150

Compared to the un-homogenized 32.27% (different than the last post because of some meta-data clean-up). Differences are significant under the Wilcoxon test (p-values ~ 0).

Seems the hubness increases though.
# of hubs (100-occurrences greater than 200):
105 for no homo.
121 for -300
119 for -150

# of anti-hubs (100-occurrences less than 20):
121 for no homo.
114 for -300
131 for -150

So, I guess we're trading smooth hub distribution for precision.

I'll look into the other homogenization methods soon.