Thursday, January 31, 2008

Statistical significance of homogenization

To see if any of my homogenization experiments are actually meaningful, it's important to check for statistical significance. I first used a standard paired t-test to compare the pair-wise distances between songs, hubness values, and r-precision values for each song obtained for both homogenization methods. And I made a table.

homo by dist

p-values

t-test

30-thresh25-thresh20-thresh15-thresh

dist0000

hubness1111

r-precision0.35320.032~0~0


homo by activation

p-values

t-test

-110-thresh-100-thresh-90-thresh-80-thresh

dist ~0 ~0~0~0

hubness1111

r-precision0.10630.3980.017~0


We see that all of the song-to-song distances are significantly changed by homogenization.
The r-precision values are more mixed. For the distance-based method, only the highest threshold (the one with the least affect) was not significantly changed. For the activation-based method, only the lowest two thresholds are significant. This means our only hope at improvement (activation homogenization at a -100 threshold) does not have statistically large enough lead to mean anything. Oh well.

The reason we see no significance in the hubness with the t-test is that it's a constant sum measure, so all of the changes will cancel out and the mean remains the same (in fact, equals the number of occurances we observe, in this case 100). This way our null hypothesis of a zero mean difference distribution will always be true.
Looking at the distributions of hubness differences (dist, act) , it seems they aren't really normal: some seem to have a marked skew. A better significance test for hubness change is the Wilcoxon signed-rank test, where the null hypothesis is that the median difference between pairs is zero. More tables!

homo by dist

p-values

signed-rank-test

30-thresh25-thresh20-thresh15-thresh

dist0000

hubness~0~00.00490.8251

r-precision0.2710.0112~0~0


homo by activation

p-values

signed-rank-test

-110-thresh-100-thresh-90-thresh-80-thresh

dist0000

hubness0.0045~00.04750.3494

r-precision0.22580.16020.057~0


Now, we see some significance. For the distance-based method, the top three thresholds have seemingly small difference medians (2, 2, and 1, meaning the homogenization decreased the median song's occurances by 2, 2, and 1 occurances, respectively) but large enough to be significant. The top three thresholds for the activation-based method were also significant (with 95% confidence). This is encouraging, but the changes are still small.

I'd love to hear any suggestions or complaints; my stats skills are admittedly a little rusty.

Wednesday, January 30, 2008

Homogenization by activation

There was an arguably small improvement in hubness and a decline in r-precision when homogenizing GMM's by distance from their global centroid. Since we saw early on that there are frames in the "activation-gram" that show certain components are only active (i.e. likely to represent a sample) for a relative small number of frames, why not base the homogenization criterion on the activation itself, instead of an indirect correlate?
So, I looked at the median activation level for each component over the entirety of a song and simply dropped any component whose median activation did not meet a threshold (again, empirically derived).
Below are the same plots used in the distance-based homogenization: first, hubness vs. number of components removed; second, hubness histograms.
From the first figure, we see that, again, homogenization indeed tends to affect anti-hubs more than hubs, as intended.
The number of hubs (more than 200 100-occurances) for each threshold (-110, -100, -90, -80 in log likelihood) were 162, 155, 160, and 153, compared to 156 for no homogenization. The number of anti-hubs for each run were 138, 110, 142, and 149, compared to 124 for no homogenization. It seems, and is clear from the histograms, that the only threshold that is helping us (decreasing both hubs and anti-hubs) is -100. We saw over-homogenization adversely affect hubness in the distance-based method also. I should look into this.
Max hubs values for each run were 601, 592, 576, and 570, compared to the original 580, so there's at least a monotonic decrease.
Interestingly, the -100 threshold also yields a slightly higher r-precision value as well (0.16339, compared to the non-homogenized 0.16169). The other average r-precisions are 0.15947, 0.15767, and 0.15196 (for -110, -90, and -80 thresholds, I should learn to make tables). This is in contrast to the distance homogenization, where hubness was seemingly improved but r-precision suffered for all thresholds. Granted the improvement is small and may not be statistically significant (more on this in a later post).
So, even with "manually" kicking out components that do not contribute much to the model (and usually corresponding to outlier musical sections), we don't see much overall improvement. I must look into this more.

Wednesday, January 23, 2008

R-precision as consistency

Last post, I mentioned r-precision as a way to measure the accuracy of a recommendation algorithm. I thought it may be pertinent to analyze in more detail the r-precision results of the certain bag-of-frames approach I'm working with.
For completeness, the results here are from using the uspop2002 dataset (105 artists, 10 tracks per artist, 20 MFCC's per 66.67 ms frame) modeled with 32 component GMM's and using the KL-divergence-based earth-mover's distance (KL-EMD) as the similarity metric. This is a standard introduced years ago, and one I'm inclined to stick with for comparison's sake.
Below are listed the top ranking artists by r-precision along with their average r-precision values. This means that their songs are more closely connected to each other in the similarity network than other artists'. Again, I'm only modeling timbre, so artists with a highly consistent "sound" will have high average r-precision.
  1. westlife - 0.711
  2. korn - 0.589
  3. mya - 0.456
  4. goo goo dolls - 0.422
  5. lionel richie - 0.411
  6. deftones - 0.378
  7. craig david - 0.378
  8. ricky martin - 0.367
  9. staind - 0.367
  10. savage garden - 0.356
We see Westlife, a Irish boy-band, at the top of the list. While I'd like to chalk this up to the homogeneous sound of teen pop music, I think some of these files are only fragments of songs, perhaps making the models particularly distant from others. But Korn and the Goo Goo Dolls don't have this excuse.
Looking at the bottom of the list:
  1. chemical brothers - 0.0
  2. depeche mode - 0.0111
  3. radiohead - 0.0222
  4. fatboy slim - 0.0222
  5. daft punk - 0.0222
  6. coldplay - 0.0222
  7. sting - 0.0333
  8. portishead - 0.0333
  9. pet shop boys - 0.0333
  10. oasis - 0.0333
So, it appears that artists we would naturally associate with being charmingly inconsistent are indeed at the bottom of the r-precision list. Furthermore, the collection actually uses song from multiple albums for several of these artists (3 + a single for Chemical Bros., 3 + a singles collection for Depeche Mode, and 5 for Radiohead), compared to the top artists (1 each except for Craig David's 2 + a single).
This shows that my content-based recommendation engine just may be doing what it's suppose to. A track from The Bends would not be the most appropriate result for a query seeded from a Kid A track, something I wouldn't expect a collaborative-filtering-based engine to necessarily deal with. This agnostic power is what appeals to me most about this approach. A machine trained to analyze, and dare I say "understand", music recommends based on the music as it is encoded as audio (which, after all, is how humans perceive it), not by any tags or hype that may be attached to it.

Homogenization by distance

An easy way to remove these distant components seen in anti-hubs is to simply ignore them. So, using several empirically determined thresholds, I simply removed components greater than the threshold away from the GMM centroid. I did this iteratively: removing the farthest (see footnote) from the centroid and recomputing the centroid, until all components are inside the distance requirement. This was too remove the effect of the distant components on the original centroid. The thresholds I ran were 30, 25, 20, 15 (this is Euclidean distance in 20-dimensional MFCC space). This is similar to what JJ does his thesis, but he used prior probabilities to homogenize, which do not have a strong correlation with their parent model's hubness. This, in itself, is sort of non-intuitive, since one would think priors show the "importance" of a component, but one must remember that with mixture models components are often highly overlapped. In this way, a particular component's prior could be relatively low, but its neighboring components together could be quite large or "important".
First a sanity check: the idea is that hubs are modeled appropriately, and anti-hubs have components modeling timbrally distant song sections, in turn making the models inaccurately distant from others. By this logic, homogenization should affect anti-hubs more than hubs. To verify this, I looked at the difference in the number of components in the homogenized models to the originals (which had 32 components) in relation to the hubness of each song. Below are scatter plots for each homogenization run.
We can see that with slight homogenization (e.g. 30 or 25) most strong hubs are unaffected (i.e. difference = 0) but with increased homogenization, songs across the board are seeing reduced components. So, I'd say this is reasonable.
The end results turn out to be mixed. The overall hubness of the set seems to improve (ie decrease). Below is the histogram for each homogenization run.
As the models are homogenized, we see the middle of the histogram "fatten" as the number of strong hubs and anti-hubs both decrease. Using the 100-occurances measure, the number of hubs (h greater than 200) is 157, 150, 146, and 151 for no homogenization, a thresholding of 30, 25, 20, and 15, respectively. The number of anti-hubs (h less than 5) are 124, 113, 102, 91, and 78, respectively. This is promising but may simply be another sanity check since I based the homogenization on the observation that there was a strong correlation between hubness and distant components. The real question is whether the recommendations are better. Since there is no really ground truth with this kind of work (although some have sought it), one simple measure to look at is r-precision. This is the proportion of songs in by the same artist are returned in the top-9 recommendations (9 because there are 10 songs per artist in the uspop2002 collection). If an artist is highly consistent, in that each of his songs is closer to the his other songs than any other artist's songs, r-precision will be high. This is of course problematic since an artist's sound can vary significantly from song to song, not mention albums. But since it's easy and relatively reasonable, I'll use it anyway.
It turns out that homogenization actually hurts r-precision. Over the same runs as above, the average r-precisions over all songs are 0.16169, 0.15989, 0.1564, 0.1472, and 0.12275. This means that the distant components did have some discriminative power, at least in the problem of artist classification.
So, homogenization, believe it or not, may not be the holy grail to improving this approach.

Footnote: A professor of mine once pointed out that the definitions of farther and further, which are distinct only in their literal or figurative usage (e.g. farther down the road, further into debt), tend to gradually exchange meanings back and forth over time, usually with a period of only about a few decades. So, if you're reading this blog in twenty years, know that at the time of this post, farther indeed refers to a physical distance.

Friday, January 18, 2008

Can we really trust EM?

To make my case against the bag-of-frames approach, I've been looking at the origin of "hubs", songs that are found to be unusually similar to many other songs in a database. These not only produce lots of false-positives, but because recommendation lists are constant-sum, they also lead to false-negatives by beating out appropriate recommendations.
Working from Adam Berenzweig's blogged experiments, I've found a some-what strong correlation between hub songs and the overall spread of the components. Hubs tend to have components tightly clustered around their centroid, whereas anti-hubs have components significantly far from each other. To verify, I found the median intra-component KL-divergence for each song model. The correlation with this and the song's "hubness" (the number of times occurring on other top-100 lists, aka JJ's 100-occurance measure) was -0.4179 (p-value = 1.21e-45). In other words, the stronger the hub the more compact the GMM components are in MFCC space.
Then, I started looking at the activation of the individual GMM components over the MFCC frames of the songs and noticed that the more distant the component (in relation to the GMM's centroid) the more likely it came from a timbrally spurious section of the song. These sections can be as short as a few frames, but EM apparently still devotes components to them. Below is a good example from the GMM (16 components, diag covar) of The Bloodhound Gang's "Right Turn Clyde" (hubness value of zero!). The activations are shown on the right and the Euclidean distance from the GMM centroid is on the left. It's clear at least 7 of the 16 components are given to the short section in the middle, and these components are the furthest from the model's centroid.
Hubs, on the other hand, have nice dense activations where every component seems to be modeling a large part of the song. Of course, this is due to the song itself being timbrally homogeneous, but it's also due to EM simply better modeling it. Example below is from the top hub, Sugar Ray's "Ours" (hubness value of 604!, 57.52% of the database).
So, what can we do about this? Is it reasonable to neglect the outlier sections of songs,which are probably just breaks or intros (as opposed to salient parts like choruses)? Is it right to think songs' models should be more centralized?