An easy way to remove these distant components seen in anti-hubs is to simply ignore them. So, using several empirically determined thresholds, I simply removed components greater than the threshold away from the GMM centroid. I did this iteratively: removing the farthest (see footnote) from the centroid and recomputing the centroid, until all components are inside the distance requirement. This was too remove the effect of the distant components on the original centroid. The thresholds I ran were 30, 25, 20, 15 (this is Euclidean distance in 20-dimensional MFCC space). This is similar to what JJ does his thesis, but he used prior probabilities to homogenize, which do not have a strong correlation with their parent model's hubness. This, in itself, is sort of non-intuitive, since one would think priors show the "importance" of a component, but one must remember that with mixture models components are often highly overlapped. In this way, a particular component's prior could be relatively low, but its neighboring components together could be quite large or "important".
First a sanity check: the idea is that hubs are modeled appropriately, and anti-hubs have components modeling timbrally distant song sections, in turn making the models inaccurately distant from others. By this logic, homogenization should affect anti-hubs more than hubs. To verify this, I looked at the difference in the number of components in the homogenized models to the originals (which had 32 components) in relation to the hubness of each song. Below are scatter plots for each homogenization run.
We can see that with slight homogenization (e.g. 30 or 25) most strong hubs are unaffected (i.e. difference = 0) but with increased homogenization, songs across the board are seeing reduced components. So, I'd say this is reasonable.
The end results turn out to be mixed. The overall hubness of the set seems to improve (ie decrease). Below is the histogram for each homogenization run.
As the models are homogenized, we see the middle of the histogram "fatten" as the number of strong hubs and anti-hubs both decrease. Using the 100-occurances measure, the number of hubs (h greater than 200) is 157, 150, 146, and 151 for no homogenization, a thresholding of 30, 25, 20, and 15, respectively. The number of anti-hubs (h less than 5) are 124, 113, 102, 91, and 78, respectively. This is promising but may simply be another sanity check since I based the homogenization on the observation that there was a strong correlation between hubness and distant components. The real question is whether the recommendations are better. Since there is no really ground truth with this kind of work (although some have sought it), one simple measure to look at is r-precision. This is the proportion of songs in by the same artist are returned in the top-9 recommendations (9 because there are 10 songs per artist in the uspop2002 collection). If an artist is highly consistent, in that each of his songs is closer to the his other songs than any other artist's songs, r-precision will be high. This is of course problematic since an artist's sound can vary significantly from song to song, not mention albums. But since it's easy and relatively reasonable, I'll use it anyway.
It turns out that homogenization actually hurts r-precision. Over the same runs as above, the average r-precisions over all songs are 0.16169, 0.15989, 0.1564, 0.1472, and 0.12275. This means that the distant components did have some discriminative power, at least in the problem of artist classification.
So, homogenization, believe it or not, may not be the holy grail to improving this approach.
Footnote: A professor of mine once pointed out that the definitions of farther and further, which are distinct only in their literal or figurative usage (e.g. farther down the road, further into debt), tend to gradually exchange meanings back and forth over time, usually with a period of only about a few decades. So, if you're reading this blog in twenty years, know that at the time of this post, farther indeed refers to a physical distance.