We see that all of the song-to-song distances are significantly changed by homogenization.
The r-precision values are more mixed. For the distance-based method, only the highest threshold (the one with the least affect) was not significantly changed. For the activation-based method, only the lowest two thresholds are significant. This means our only hope at improvement (activation homogenization at a -100 threshold) does not have statistically large enough lead to mean anything. Oh well.
The reason we see no significance in the hubness with the t-test is that it's a constant sum measure, so all of the changes will cancel out and the mean remains the same (in fact, equals the number of occurances we observe, in this case 100). This way our null hypothesis of a zero mean difference distribution will always be true.
Looking at the distributions of hubness differences (dist, act) , it seems they aren't really normal: some seem to have a marked skew. A better significance test for hubness change is the Wilcoxon signed-rank test, where the null hypothesis is that the median difference between pairs is zero. More tables!
Now, we see some significance. For the distance-based method, the top three thresholds have seemingly small difference medians (2, 2, and 1, meaning the homogenization decreased the median song's occurances by 2, 2, and 1 occurances, respectively) but large enough to be significant. The top three thresholds for the activation-based method were also significant (with 95% confidence). This is encouraging, but the changes are still small.
I'd love to hear any suggestions or complaints; my stats skills are admittedly a little rusty.