Thursday, January 31, 2008

Statistical significance of homogenization

To see if any of my homogenization experiments are actually meaningful, it's important to check for statistical significance. I first used a standard paired t-test to compare the pair-wise distances between songs, hubness values, and r-precision values for each song obtained for both homogenization methods. And I made a table.

homo by dist

p-values

t-test

30-thresh25-thresh20-thresh15-thresh

dist0000

hubness1111

r-precision0.35320.032~0~0


homo by activation

p-values

t-test

-110-thresh-100-thresh-90-thresh-80-thresh

dist ~0 ~0~0~0

hubness1111

r-precision0.10630.3980.017~0


We see that all of the song-to-song distances are significantly changed by homogenization.
The r-precision values are more mixed. For the distance-based method, only the highest threshold (the one with the least affect) was not significantly changed. For the activation-based method, only the lowest two thresholds are significant. This means our only hope at improvement (activation homogenization at a -100 threshold) does not have statistically large enough lead to mean anything. Oh well.

The reason we see no significance in the hubness with the t-test is that it's a constant sum measure, so all of the changes will cancel out and the mean remains the same (in fact, equals the number of occurances we observe, in this case 100). This way our null hypothesis of a zero mean difference distribution will always be true.
Looking at the distributions of hubness differences (dist, act) , it seems they aren't really normal: some seem to have a marked skew. A better significance test for hubness change is the Wilcoxon signed-rank test, where the null hypothesis is that the median difference between pairs is zero. More tables!

homo by dist

p-values

signed-rank-test

30-thresh25-thresh20-thresh15-thresh

dist0000

hubness~0~00.00490.8251

r-precision0.2710.0112~0~0


homo by activation

p-values

signed-rank-test

-110-thresh-100-thresh-90-thresh-80-thresh

dist0000

hubness0.0045~00.04750.3494

r-precision0.22580.16020.057~0


Now, we see some significance. For the distance-based method, the top three thresholds have seemingly small difference medians (2, 2, and 1, meaning the homogenization decreased the median song's occurances by 2, 2, and 1 occurances, respectively) but large enough to be significant. The top three thresholds for the activation-based method were also significant (with 95% confidence). This is encouraging, but the changes are still small.

I'd love to hear any suggestions or complaints; my stats skills are admittedly a little rusty.

2 comments:

Graham said...

So, if I understand correctly, when super-consistent (in terms of frame features) songs are stealing the show (associating themselves with too many other songs), dropping some non-central clusters of frames from diverse songs pulls them closer to the cluster centers and more evenly distributes the love (recommendations).

Perhaps there are alternate distance metrics or matching techniques which could do something similar... either by explicitly rationing associations within varying sized neighborhoods (isomap or something similar), or scaling distance with density or something like that.

What features are you using? Just frame-based timbral features? Would the same behavior observed with more temporally based features or comparison, such as temporal shape features, or local alignment comparisons over songs?

Excelsior, sir!

Mark T. Godfrey said...

Thanks for your comments!
I have indeed looked at dimensionality reduction and hierarchical clustering to try and find a better space in which to squash these hubs. I should look into it more.
I'm using the uspop collection, so just short-time MFCC frames. I'd like to check out other features at some point, as well as timescales, but one step at a time...