To make my case against the bag-of-frames approach, I've been looking at the origin of "hubs", songs that are found to be unusually similar to many other songs in a database. These not only produce lots of false-positives, but because recommendation lists are constant-sum, they also lead to false-negatives by beating out appropriate recommendations.
Working from Adam Berenzweig's blogged experiments, I've found a some-what strong correlation between hub songs and the overall spread of the components. Hubs tend to have components tightly clustered around their centroid, whereas anti-hubs have components significantly far from each other. To verify, I found the median intra-component KL-divergence for each song model. The correlation with this and the song's "hubness" (the number of times occurring on other top-100 lists, aka JJ's 100-occurance measure) was -0.4179 (p-value = 1.21e-45). In other words, the stronger the hub the more compact the GMM components are in MFCC space.
Then, I started looking at the activation of the individual GMM components over the MFCC frames of the songs and noticed that the more distant the component (in relation to the GMM's centroid) the more likely it came from a timbrally spurious section of the song. These sections can be as short as a few frames, but EM apparently still devotes components to them. Below is a good example from the GMM (16 components, diag covar) of The Bloodhound Gang's "Right Turn Clyde" (hubness value of zero!). The activations are shown on the right and the Euclidean distance from the GMM centroid is on the left. It's clear at least 7 of the 16 components are given to the short section in the middle, and these components are the furthest from the model's centroid.
Hubs, on the other hand, have nice dense activations where every component seems to be modeling a large part of the song. Of course, this is due to the song itself being timbrally homogeneous, but it's also due to EM simply better modeling it. Example below is from the top hub, Sugar Ray's "Ours" (hubness value of 604!, 57.52% of the database).
So, what can we do about this? Is it reasonable to neglect the outlier sections of songs,which are probably just breaks or intros (as opposed to salient parts like choruses)? Is it right to think songs' models should be more centralized?