beyond bag of frames: activation

Another way to perhaps more directly see the consistency of an artist is to look at the computed distances between songs. If our models are working, songs from the same artists should be relatively similar, so their distances should tend to be low. To contrast the r-precision fun we had in a previous post (and to keep those reading who aren't MIR obsessives entertained), I found the mean intra-artist distances.

Top 10!

ricky martin - 37.5232
smash mouth - 43.5782
steve winwood - 47.0957
third eye blind - 47.3322
korn - 48.7898
fleetwood mac - 49.2755
jennifer paige - 49.5098
sugar ray - 49.724
lionel richie - 51.0326
mya - 51.204

Bottom 10! (ignoring Westlife whose distances are inaccurately huge)

prince -708.6668
jamiroquai - 673.2426
natalie imbruglia - 610.4856
oasis - 493.2065
bloodhound gang - 342.4461
daft punk - 231.6445
radiohead - 229.4721
tool - 198.714
miles davis - 197.7051
frank sinatra - 184.6662

We see a few repeats from the r-precision lists, but maybe not as many as we'd expect. There is a significantly sorta-strong (de)correlation between r-precision and mean intra-artist distance (-0.2373, p-value = 0.0153).
So, I dig deeper. The r-precision ranks were based on the top recommendations for each song, just what hubs (and anti-hubs) are best at mucking up. The list above are based on means of distances, which are particularly sensitive to outliers (which we have seen are usually badly modeled anti-hubs).
Let's take Jamiroquai. Below is a visualization his inter-song log-distances (red = distant).

Looks like we have an outlier, and it's name is "Picture Of My Life" from the epic album "Funk Odyssey". This track's hubness (using 100-occurances) is 2, so it's easily considered an anti-hub. Taking a look at the "activation-gram" we see a weird section about 3.5 minutes in.

Clearly at least 7 of the 32 components were trained to solely model this part of the track. What is this strange musical section, you may ask? It turns out to be the silence between the end of the song and the beginning of the "hidden track", "So Good To Feel Real". You can even see that the second song is also not as well modeled as the first. Oh, the joys of content-based recommendation!
After more looking around, it looks like most artists at the bottom of the list above have just one song in their set that, for some weird but reasonable reason (e.g. there's a Michael Jackson "song" which seems to just be a bonus voice-over included on the remastered edition of "Off The Wall"), doesn't fit with the others.
There's a statistic that's particularly good at weeding out these outlier songs: the median. It turns out, and makes sense, that the median intra-artist distances are more (de)correlated with the average r-precision: corr. coef. = -0.4195, p-value = ~0. And we indeed replace the suspect bottom of the above list with our familiar, typically inconsistent artists. So, the median is good and I am a fan (although without first using the mean I would have never listened to those hot Jamiroquai tracks).
I'd also like to counter what you may be asking: why not use better data? I could and have often thought about it, but the uspop collection is something of a standard and will be easier for anyone to cross-check my work against his. Besides, input problems like the ones shown here are realistic problems any good recommendation engine should be able to handle.

There was an arguably small improvement in hubness and a decline in r-precision when homogenizing GMM's by distance from their global centroid. Since we saw early on that there are frames in the "activation-gram" that show certain components are only active (i.e. likely to represent a sample) for a relative small number of frames, why not base the homogenization criterion on the activation itself, instead of an indirect correlate?
So, I looked at the median activation level for each component over the entirety of a song and simply dropped any component whose median activation did not meet a threshold (again, empirically derived).
Below are the same plots used in the distance-based homogenization: first, hubness vs. number of components removed; second, hubness histograms.

From the first figure, we see that, again, homogenization indeed tends to affect anti-hubs more than hubs, as intended.
The number of hubs (more than 200 100-occurances) for each threshold (-110, -100, -90, -80 in log likelihood) were 162, 155, 160, and 153, compared to 156 for no homogenization. The number of anti-hubs for each run were 138, 110, 142, and 149, compared to 124 for no homogenization. It seems, and is clear from the histograms, that the only threshold that is helping us (decreasing both hubs and anti-hubs) is -100. We saw over-homogenization adversely affect hubness in the distance-based method also. I should look into this.
Max hubs values for each run were 601, 592, 576, and 570, compared to the original 580, so there's at least a monotonic decrease.
Interestingly, the -100 threshold also yields a slightly higher r-precision value as well (0.16339, compared to the non-homogenized 0.16169). The other average r-precisions are 0.15947, 0.15767, and 0.15196 (for -110, -90, and -80 thresholds, I should learn to make tables). This is in contrast to the distance homogenization, where hubness was seemingly improved but r-precision suffered for all thresholds. Granted the improvement is small and may not be statistically significant (more on this in a later post).
So, even with "manually" kicking out components that do not contribute much to the model (and usually corresponding to outlier musical sections), we don't see much overall improvement. I must look into this more.

beyond bag of frames

Friday, February 1, 2008

Intra-artist distance (or adventures in CBR)

Wednesday, January 30, 2008

Homogenization by activation

mir blogs

Blog Archive

About Me