To see if any of my homogenization experiments are actually meaningful, it's important to check for statistical significance. I first used a standard paired ttest to compare the pairwise distances between songs, hubness values, and rprecision values for each song obtained for both homogenization methods. And I made a table.
       homo by dist       pvalues       ttest        30thresh  25thresh  20thresh  15thresh   dist  0  0  0  0   hubness  1  1  1  1   rprecision  0.3532  0.032  ~0  ~0 

       homo by activation       pvalues       ttest        110thresh  100thresh  90thresh  80thresh   dist  ~0  ~0  ~0  ~0   hubness  1  1  1  1   rprecision  0.1063  0.398  0.017  ~0 

We see that all of the songtosong distances are significantly changed by homogenization.
The rprecision values are more mixed. For the distancebased method, only the highest threshold (the one with the least affect) was not significantly changed. For the activationbased method, only the lowest two thresholds are significant. This means our only hope at improvement (activation homogenization at a 100 threshold) does not have statistically large enough lead to mean anything. Oh well.
The reason we see no significance in the hubness with the ttest is that it's a constant sum measure, so all of the changes will cancel out and the mean remains the same (in fact, equals the number of occurances we observe, in this case 100). This way our null hypothesis of a zero mean difference distribution will always be true.
Looking at the distributions of hubness differences (
dist,
act) , it seems they aren't really normal: some seem to have a marked skew. A better significance test for hubness change is the Wilcoxon signedrank test, where the null hypothesis is that the
median difference between pairs is zero. More tables!
       homo by dist       pvalues       signedranktest        30thresh  25thresh  20thresh  15thresh   dist  0  0  0  0   hubness  ~0  ~0  0.0049  0.8251   rprecision  0.271  0.0112  ~0  ~0 

       homo by activation       pvalues       signedranktest        110thresh  100thresh  90thresh  80thresh   dist  0  0  0  0   hubness  0.0045  ~0  0.0475  0.3494   rprecision  0.2258  0.1602  0.057  ~0 

Now, we see some significance. For the distancebased method, the top three thresholds have seemingly small difference medians (2, 2, and 1, meaning the homogenization decreased the median song's occurances by 2, 2, and 1 occurances, respectively) but large enough to be significant. The top three thresholds for the activationbased method were also significant (with 95% confidence). This is encouraging, but the changes are still small.
I'd love to hear any suggestions or complaints; my stats skills are admittedly a little rusty.
2 comments:
So, if I understand correctly, when superconsistent (in terms of frame features) songs are stealing the show (associating themselves with too many other songs), dropping some noncentral clusters of frames from diverse songs pulls them closer to the cluster centers and more evenly distributes the love (recommendations).
Perhaps there are alternate distance metrics or matching techniques which could do something similar... either by explicitly rationing associations within varying sized neighborhoods (isomap or something similar), or scaling distance with density or something like that.
What features are you using? Just framebased timbral features? Would the same behavior observed with more temporally based features or comparison, such as temporal shape features, or local alignment comparisons over songs?
Excelsior, sir!
Thanks for your comments!
I have indeed looked at dimensionality reduction and hierarchical clustering to try and find a better space in which to squash these hubs. I should look into it more.
I'm using the uspop collection, so just shorttime MFCC frames. I'd like to check out other features at some point, as well as timescales, but one step at a time...
Post a Comment