I chose to homogenize by covariance since it looks like that's the main antihub correlate for this data. The plot below shows the logdeterminant for each component of each model (that's 32 x 897 components). I'm convinced that the super tiny variance components are just ones that collapsed in EM and should definitely be removed. So, I picked two thresholds for now: 300 to remove all the super tiny components, and 150 to remove most everything outside of that massive band around 100.
Artist Rprecision increased for both homogenization (I'm not attempting another table):
35.85% for 300
38.24% for 150
Compared to the unhomogenized 32.27% (different than the last post because of some metadata cleanup). Differences are significant under the Wilcoxon test (pvalues ~ 0).
Seems the hubness increases though.
# of hubs (100occurrences greater than 200):
105 for no homo.
121 for 300
119 for 150
# of antihubs (100occurrences less than 20):
121 for no homo.
114 for 300
131 for 150
So, I guess we're trading smooth hub distribution for precision.
I'll look into the other homogenization methods soon.
Tuesday, March 18, 2008
Thursday, March 13, 2008
North Indian Classical Music dataset
To compare against results from the uspop dataset, we put together a set of North Indian classical music (NICM) and ran it through the same CBR fun. This was done with my advisor, Parag Chordia, who's done a lot of work with MIR and Indian music. In all, there are 897 tracks from 141 artists.
For ground truth, we can of course look at artist Rprecision as before: it came out to be 30.97%, random baseline of 2.3%, about the same as I was getting with the uspop set. Parag also labeled each artist with a primary instrument name. With these we can see if the modeling is matching songs based on the timbral characteristics of the main sound source present in the song, or if it's locking onto more abstract qualities (like audio fidelity).
I used a kNN classifier with leaveoneout crossvalidation, in the same way Elias did in his thesis. The results are below and hopefully readable. The means accuracies (as %) are shown for each number of nearest neighbors polled. The means basically represent the average proportion of the k nearest neighbors that share the seed's primary instrument. For a baseline, I averaged the scores over 100 random kernels for each k level; it was about 23.1% for each level.
Not bad, but to ensure the accuracy is based solely on the instrument similarity, we apply an artist filter, as advocated by Elias. This basically removes any other songs from the same artist as the seed from the potential nearest neighbor pool. This removes the chance that neighbor songs are matches only because of other timbral similarities (e.g. producer effect or audio fidelity). Guess what happens?
The random baseline is about the same at 21.4%. So, accuracy markedly decreases, but it's still significantly above random. It also doesn't falloff as fast with increased k.
Next, I'd like to homogenize the models and see if these scores improve.
For ground truth, we can of course look at artist Rprecision as before: it came out to be 30.97%, random baseline of 2.3%, about the same as I was getting with the uspop set. Parag also labeled each artist with a primary instrument name. With these we can see if the modeling is matching songs based on the timbral characteristics of the main sound source present in the song, or if it's locking onto more abstract qualities (like audio fidelity).
I used a kNN classifier with leaveoneout crossvalidation, in the same way Elias did in his thesis. The results are below and hopefully readable. The means accuracies (as %) are shown for each number of nearest neighbors polled. The means basically represent the average proportion of the k nearest neighbors that share the seed's primary instrument. For a baseline, I averaged the scores over 100 random kernels for each k level; it was about 23.1% for each level.
k=1  k=3  k=5  k=10  k=20  
nicm kernel  81.05  74.96  70.68  64.97  58.12 
Not bad, but to ensure the accuracy is based solely on the instrument similarity, we apply an artist filter, as advocated by Elias. This basically removes any other songs from the same artist as the seed from the potential nearest neighbor pool. This removes the chance that neighbor songs are matches only because of other timbral similarities (e.g. producer effect or audio fidelity). Guess what happens?
k=1  k=3  k=5  k=10  k=20  
nicm kernel (with af)  58.86  57.30  54.69  50.85  46.91 
The random baseline is about the same at 21.4%. So, accuracy markedly decreases, but it's still significantly above random. It also doesn't falloff as fast with increased k.
Next, I'd like to homogenize the models and see if these scores improve.
Wednesday, February 20, 2008
Nice Gaussian hubs
In trying to prove that antihubs are the root of all CBR evil (at least in my world), I've mainly looked at the fact that they have strange distributions, which typical modeling methods (i.e. GMMs!) tend to model in a way that isn't exactly helpful.
Recently, I've been looking at nonparametric modeling and through this got a nice visualization of the distributions of MFCC frames (probably something I should have done from the beginning!). Below I show the distributions for the first 6 MFCC dimensions (by row, so first row has MFCC's 1 through 3), first for a prototypical hub (Carly Simon's "We Have No Secrets", 368 100occurrences) and then for a prototypical antihub (Bloodhound Gang's "Your Only Friends Are Make Believe", 2 100occurrences).
We see that, indeed, a hub has nice, relatively Gaussian distributions, while the antihub's are nasty and multimodal. This further vindicates the rationale for homogenization: modes exist in the distribution of antihubs' frames that are perhaps not relevant to a good timbral model and we'd like to get rid of them. Homogenization pushed to its extreme, after all, would lead to a nice single Gaussian.
To see if this idea really generalizes, I looked at how well each distribution fit a single Gaussian distribution, parameterized to the distribution's mean and variance. Below is the scatter plot of hubness vs. the loglikelihood.
It's not as strong a correlation (rho = 0.0939, pval = 0.0196) as I was expecting from just looking at the histograms. We do see that the top most and least likely single Gaussians are antihubs. I think this could be explained by songs with lots of a single timbre (i.e. silence). This would mean lots of samples fall near the mean and there would be a very small variance, leading to a high likelihood values, while all of the relevant frames are far from this mean (i.e. music) and receive low likelihood values. This is the case with our favorite subset of tracks: those with "hidden" tracks, like the previously mentioned Jamiroquai track and "Chris Cayton" by Goldfinger (2 100occurences, MFCC histograms below) (check out the comments on its last.fm page).
Recently, I've been looking at nonparametric modeling and through this got a nice visualization of the distributions of MFCC frames (probably something I should have done from the beginning!). Below I show the distributions for the first 6 MFCC dimensions (by row, so first row has MFCC's 1 through 3), first for a prototypical hub (Carly Simon's "We Have No Secrets", 368 100occurrences) and then for a prototypical antihub (Bloodhound Gang's "Your Only Friends Are Make Believe", 2 100occurrences).
We see that, indeed, a hub has nice, relatively Gaussian distributions, while the antihub's are nasty and multimodal. This further vindicates the rationale for homogenization: modes exist in the distribution of antihubs' frames that are perhaps not relevant to a good timbral model and we'd like to get rid of them. Homogenization pushed to its extreme, after all, would lead to a nice single Gaussian.
To see if this idea really generalizes, I looked at how well each distribution fit a single Gaussian distribution, parameterized to the distribution's mean and variance. Below is the scatter plot of hubness vs. the loglikelihood.
It's not as strong a correlation (rho = 0.0939, pval = 0.0196) as I was expecting from just looking at the histograms. We do see that the top most and least likely single Gaussians are antihubs. I think this could be explained by songs with lots of a single timbre (i.e. silence). This would mean lots of samples fall near the mean and there would be a very small variance, leading to a high likelihood values, while all of the relevant frames are far from this mean (i.e. music) and receive low likelihood values. This is the case with our favorite subset of tracks: those with "hidden" tracks, like the previously mentioned Jamiroquai track and "Chris Cayton" by Goldfinger (2 100occurences, MFCC histograms below) (check out the comments on its last.fm page).
Friday, February 8, 2008
Strengthening neighborhoods through homogenization
One concern with homogenization is that it may uniformly pull all songs toward a central point in timbral space. Intuitively, as Aucouturier nicely points out in his thesis, homogenized models, after a point, have lost the unique timbral subtleties that provide a system with its discriminative power (the models go "from representing a given song, down to a more global style of music, down to the even simpler fact that it is music"). We hope to homogenize just enough to throw out the outlier GMM components (typical of antihubs) so that these songs are introduced to the pool, thus decreasing hubness and improving recommendation.
So, ideally, we'd expect homogenization (at the right level) to not affect most songs' placement in the timbral space and just bring in the outliers. I see it as strengthening timbral neighborhoods: where nasty components were breaking up these spots before, keeping fine songs from ever getting too close, homogenization (hopefully) comes in to bring these tracks together, where they belong.
Since we can't rely on artists to be selfsimilar or consistent, any kind of metric involving intra vs. interclass relations is inherently flawed to some extent. I'd just like to see if songs are simply clustering better after homogenization. So, I looked at the average distance to the top k nearest neighbors for each song, before and after homogenization. The plots shows most of these distances (truncated for clarity). x = before, y = after (using the distancebased method at a threshold of 15).
k = 20:
k = 100:
The relation seems to be moreorless linear, with a significant yoffset (from the the decrease in all distances). The slope seems to be fairly onetoone (i.e. songs with lots of close neighbors remain with lots of neighbors). We see that songs with distant neighbors tend to be affected more by homogenization (i.e. more off the center of the imaginary regression line). Also, the distribution about this imaginary regression line doesn't seem even. How about a histogram of the differences?
k = 20:
The differences are before minus after homogenization, so positive values indicate a decrease in neighbor distances. We do see it's almost belllike, but with a fatter end on the right. This means that homogenization is bringing more songs closer to their neighbors than pulling them apart. In fact, the mean difference is 18.42, median 11.36, and a skewness of 17.23, verifying that long tail. (The differences passed a ttest, pvalue = ~0)
So, we see that all songs are indeed getting closer to their neighbors, but a good portion more than others. Is this a sign that clusters are forming? Timbral neighborhood strengthening? Are these new neighbors good (i.e. perceptually valid) neighbors? If all songs were being pulled toward some global center by homogenization, what would we expect to see? More to come after I think.
So, ideally, we'd expect homogenization (at the right level) to not affect most songs' placement in the timbral space and just bring in the outliers. I see it as strengthening timbral neighborhoods: where nasty components were breaking up these spots before, keeping fine songs from ever getting too close, homogenization (hopefully) comes in to bring these tracks together, where they belong.
Since we can't rely on artists to be selfsimilar or consistent, any kind of metric involving intra vs. interclass relations is inherently flawed to some extent. I'd just like to see if songs are simply clustering better after homogenization. So, I looked at the average distance to the top k nearest neighbors for each song, before and after homogenization. The plots shows most of these distances (truncated for clarity). x = before, y = after (using the distancebased method at a threshold of 15).
k = 20:
k = 100:
The relation seems to be moreorless linear, with a significant yoffset (from the the decrease in all distances). The slope seems to be fairly onetoone (i.e. songs with lots of close neighbors remain with lots of neighbors). We see that songs with distant neighbors tend to be affected more by homogenization (i.e. more off the center of the imaginary regression line). Also, the distribution about this imaginary regression line doesn't seem even. How about a histogram of the differences?
k = 20:
The differences are before minus after homogenization, so positive values indicate a decrease in neighbor distances. We do see it's almost belllike, but with a fatter end on the right. This means that homogenization is bringing more songs closer to their neighbors than pulling them apart. In fact, the mean difference is 18.42, median 11.36, and a skewness of 17.23, verifying that long tail. (The differences passed a ttest, pvalue = ~0)
So, we see that all songs are indeed getting closer to their neighbors, but a good portion more than others. Is this a sign that clusters are forming? Timbral neighborhood strengthening? Are these new neighbors good (i.e. perceptually valid) neighbors? If all songs were being pulled toward some global center by homogenization, what would we expect to see? More to come after I think.
Monday, February 4, 2008
Homogenization and artist distance
Continuing my look into artist distance, I wanted to see if homogenization has any effect on smoothing out the nastiness that would keep an artist from coming up as selfsimilar across songs. I decided a good metric would be looking at the same median intraartist distances as well as interartist distances (distances between an artist's songs and every other artists' songs). Ideally, we'd expect an artist's songs to be tightly clustered in some area of the timbral space, reasonably distant from other artists' songs. So, I looked at the differences between baseline (no homogenization) intra and interartist distance and these distances after homogenization (currently only looking at the distancebased method since it seemed more wellbehaved). The plots below show these distance differences for each componentdistance threshold across artists, with peaks labeled for fun (it can be easy to forget we're not just dealing with numbers).
The first plot shows the intraartist distance differences. Since we'd like to see songs from the same artist move closer to each other (i.e. decrease in distance), we consider positive differences "successes". The opposite is true for the second plot of interartist difference, since we'd like to see artists move away from others, so smaller differences are considered "successful" here.
In general, both distance differences tend to be positive, indicating that while we are moving artists closer to themselves, we are also moving them closer to everyone else. In other words, homogenization seems to compact the entire collection in timbre space. So, when we discard outlier components from models, we are in effect making all models more similar. This effect also seems to monotonically increase with the severity of the homogenization, which makes some sense.
It's interesting to note the peaks. Certain artists (like Daft Punk, Bloodhound Gang, and Mike Oldfield) see strong improvements in intraartist distance through homogenization. These artists also tend to be ones that are the least selfsimilar before homogenization, so we are helping the artists who seem to need it the most. But, these same artists tend to be also growing increasingly close to other artists, which may not be helpful.
To combine these measures of homogenization success, I next looked at the ratio between these distances for each artist. Using intra over inter, smaller values are better. I again looked at the difference between the baseline and each homogenization run. We'd like to see a positive difference since we'd like homogenization to lower the distance ratio.
We see differences here varying a lot, with no clear acrosstheboard tendency. We see some of the same artists whose intraartist distances improved the most here, but not all. And we see homogenization hurts a handful of artists sharply. Both Tool and the Fugees seem to fair significantly worse after homogenization. The Fugees are near the middle of the list of "consistent" artists (by distance), but Tool is second to last, just whom we aimed to help with homogenization. Perhaps Tool's high distances between songs isn't a result of antihubness or bad modeling at all, so homogenization of this kind is of no consequence?
Since I'm just looking at median distances, it'd be interesting to get an idea of how these models are compacting in timbral space. We simply see distances decrease with increased homogenization; we don't know whether the songs are converging to a global center or to localized neighborhoods. Maybe a visualization of the timbral space projected into a lowerdimensional space is in order.
The first plot shows the intraartist distance differences. Since we'd like to see songs from the same artist move closer to each other (i.e. decrease in distance), we consider positive differences "successes". The opposite is true for the second plot of interartist difference, since we'd like to see artists move away from others, so smaller differences are considered "successful" here.
In general, both distance differences tend to be positive, indicating that while we are moving artists closer to themselves, we are also moving them closer to everyone else. In other words, homogenization seems to compact the entire collection in timbre space. So, when we discard outlier components from models, we are in effect making all models more similar. This effect also seems to monotonically increase with the severity of the homogenization, which makes some sense.
It's interesting to note the peaks. Certain artists (like Daft Punk, Bloodhound Gang, and Mike Oldfield) see strong improvements in intraartist distance through homogenization. These artists also tend to be ones that are the least selfsimilar before homogenization, so we are helping the artists who seem to need it the most. But, these same artists tend to be also growing increasingly close to other artists, which may not be helpful.
To combine these measures of homogenization success, I next looked at the ratio between these distances for each artist. Using intra over inter, smaller values are better. I again looked at the difference between the baseline and each homogenization run. We'd like to see a positive difference since we'd like homogenization to lower the distance ratio.
We see differences here varying a lot, with no clear acrosstheboard tendency. We see some of the same artists whose intraartist distances improved the most here, but not all. And we see homogenization hurts a handful of artists sharply. Both Tool and the Fugees seem to fair significantly worse after homogenization. The Fugees are near the middle of the list of "consistent" artists (by distance), but Tool is second to last, just whom we aimed to help with homogenization. Perhaps Tool's high distances between songs isn't a result of antihubness or bad modeling at all, so homogenization of this kind is of no consequence?
Since I'm just looking at median distances, it'd be interesting to get an idea of how these models are compacting in timbral space. We simply see distances decrease with increased homogenization; we don't know whether the songs are converging to a global center or to localized neighborhoods. Maybe a visualization of the timbral space projected into a lowerdimensional space is in order.
Friday, February 1, 2008
Intraartist distance (or adventures in CBR)
Another way to perhaps more directly see the consistency of an artist is to look at the computed distances between songs. If our models are working, songs from the same artists should be relatively similar, so their distances should tend to be low. To contrast the rprecision fun we had in a previous post (and to keep those reading who aren't MIR obsessives entertained), I found the mean intraartist distances.
Top 10!
So, I dig deeper. The rprecision ranks were based on the top recommendations for each song, just what hubs (and antihubs) are best at mucking up. The list above are based on means of distances, which are particularly sensitive to outliers (which we have seen are usually badly modeled antihubs).
Let's take Jamiroquai. Below is a visualization his intersong logdistances (red = distant).
Looks like we have an outlier, and it's name is "Picture Of My Life" from the epic album "Funk Odyssey". This track's hubness (using 100occurances) is 2, so it's easily considered an antihub. Taking a look at the "activationgram" we see a weird section about 3.5 minutes in.
Clearly at least 7 of the 32 components were trained to solely model this part of the track. What is this strange musical section, you may ask? It turns out to be the silence between the end of the song and the beginning of the "hidden track", "So Good To Feel Real". You can even see that the second song is also not as well modeled as the first. Oh, the joys of contentbased recommendation!
After more looking around, it looks like most artists at the bottom of the list above have just one song in their set that, for some weird but reasonable reason (e.g. there's a Michael Jackson "song" which seems to just be a bonus voiceover included on the remastered edition of "Off The Wall"), doesn't fit with the others.
There's a statistic that's particularly good at weeding out these outlier songs: the median. It turns out, and makes sense, that the median intraartist distances are more (de)correlated with the average rprecision: corr. coef. = 0.4195, pvalue = ~0. And we indeed replace the suspect bottom of the above list with our familiar, typically inconsistent artists. So, the median is good and I am a fan (although without first using the mean I would have never listened to those hot Jamiroquai tracks).
I'd also like to counter what you may be asking: why not use better data? I could and have often thought about it, but the uspop collection is something of a standard and will be easier for anyone to crosscheck my work against his. Besides, input problems like the ones shown here are realistic problems any good recommendation engine should be able to handle.
Top 10!
 ricky martin  37.5232
 smash mouth  43.5782
 steve winwood  47.0957
 third eye blind  47.3322
 korn  48.7898
 fleetwood mac  49.2755
 jennifer paige  49.5098
 sugar ray  49.724
 lionel richie  51.0326
 mya  51.204
 prince 708.6668
 jamiroquai  673.2426
 natalie imbruglia  610.4856
 oasis  493.2065
 bloodhound gang  342.4461
 daft punk  231.6445
 radiohead  229.4721
 tool  198.714
 miles davis  197.7051
 frank sinatra  184.6662
So, I dig deeper. The rprecision ranks were based on the top recommendations for each song, just what hubs (and antihubs) are best at mucking up. The list above are based on means of distances, which are particularly sensitive to outliers (which we have seen are usually badly modeled antihubs).
Let's take Jamiroquai. Below is a visualization his intersong logdistances (red = distant).
Looks like we have an outlier, and it's name is "Picture Of My Life" from the epic album "Funk Odyssey". This track's hubness (using 100occurances) is 2, so it's easily considered an antihub. Taking a look at the "activationgram" we see a weird section about 3.5 minutes in.
Clearly at least 7 of the 32 components were trained to solely model this part of the track. What is this strange musical section, you may ask? It turns out to be the silence between the end of the song and the beginning of the "hidden track", "So Good To Feel Real". You can even see that the second song is also not as well modeled as the first. Oh, the joys of contentbased recommendation!
After more looking around, it looks like most artists at the bottom of the list above have just one song in their set that, for some weird but reasonable reason (e.g. there's a Michael Jackson "song" which seems to just be a bonus voiceover included on the remastered edition of "Off The Wall"), doesn't fit with the others.
There's a statistic that's particularly good at weeding out these outlier songs: the median. It turns out, and makes sense, that the median intraartist distances are more (de)correlated with the average rprecision: corr. coef. = 0.4195, pvalue = ~0. And we indeed replace the suspect bottom of the above list with our familiar, typically inconsistent artists. So, the median is good and I am a fan (although without first using the mean I would have never listened to those hot Jamiroquai tracks).
I'd also like to counter what you may be asking: why not use better data? I could and have often thought about it, but the uspop collection is something of a standard and will be easier for anyone to crosscheck my work against his. Besides, input problems like the ones shown here are realistic problems any good recommendation engine should be able to handle.
Labels:
activation,
antihub,
intraartist distance
Thursday, January 31, 2008
Statistical significance of homogenization
To see if any of my homogenization experiments are actually meaningful, it's important to check for statistical significance. I first used a standard paired ttest to compare the pairwise distances between songs, hubness values, and rprecision values for each song obtained for both homogenization methods. And I made a table.
We see that all of the songtosong distances are significantly changed by homogenization.
The rprecision values are more mixed. For the distancebased method, only the highest threshold (the one with the least affect) was not significantly changed. For the activationbased method, only the lowest two thresholds are significant. This means our only hope at improvement (activation homogenization at a 100 threshold) does not have statistically large enough lead to mean anything. Oh well.
The reason we see no significance in the hubness with the ttest is that it's a constant sum measure, so all of the changes will cancel out and the mean remains the same (in fact, equals the number of occurances we observe, in this case 100). This way our null hypothesis of a zero mean difference distribution will always be true.
Looking at the distributions of hubness differences (dist, act) , it seems they aren't really normal: some seem to have a marked skew. A better significance test for hubness change is the Wilcoxon signedrank test, where the null hypothesis is that the median difference between pairs is zero. More tables!
Now, we see some significance. For the distancebased method, the top three thresholds have seemingly small difference medians (2, 2, and 1, meaning the homogenization decreased the median song's occurances by 2, 2, and 1 occurances, respectively) but large enough to be significant. The top three thresholds for the activationbased method were also significant (with 95% confidence). This is encouraging, but the changes are still small.
I'd love to hear any suggestions or complaints; my stats skills are admittedly a little rusty.


We see that all of the songtosong distances are significantly changed by homogenization.
The rprecision values are more mixed. For the distancebased method, only the highest threshold (the one with the least affect) was not significantly changed. For the activationbased method, only the lowest two thresholds are significant. This means our only hope at improvement (activation homogenization at a 100 threshold) does not have statistically large enough lead to mean anything. Oh well.
The reason we see no significance in the hubness with the ttest is that it's a constant sum measure, so all of the changes will cancel out and the mean remains the same (in fact, equals the number of occurances we observe, in this case 100). This way our null hypothesis of a zero mean difference distribution will always be true.
Looking at the distributions of hubness differences (dist, act) , it seems they aren't really normal: some seem to have a marked skew. A better significance test for hubness change is the Wilcoxon signedrank test, where the null hypothesis is that the median difference between pairs is zero. More tables!


Now, we see some significance. For the distancebased method, the top three thresholds have seemingly small difference medians (2, 2, and 1, meaning the homogenization decreased the median song's occurances by 2, 2, and 1 occurances, respectively) but large enough to be significant. The top three thresholds for the activationbased method were also significant (with 95% confidence). This is encouraging, but the changes are still small.
I'd love to hear any suggestions or complaints; my stats skills are admittedly a little rusty.
Wednesday, January 30, 2008
Homogenization by activation
There was an arguably small improvement in hubness and a decline in rprecision when homogenizing GMM's by distance from their global centroid. Since we saw early on that there are frames in the "activationgram" that show certain components are only active (i.e. likely to represent a sample) for a relative small number of frames, why not base the homogenization criterion on the activation itself, instead of an indirect correlate?
So, I looked at the median activation level for each component over the entirety of a song and simply dropped any component whose median activation did not meet a threshold (again, empirically derived).
Below are the same plots used in the distancebased homogenization: first, hubness vs. number of components removed; second, hubness histograms.
From the first figure, we see that, again, homogenization indeed tends to affect antihubs more than hubs, as intended.
The number of hubs (more than 200 100occurances) for each threshold (110, 100, 90, 80 in log likelihood) were 162, 155, 160, and 153, compared to 156 for no homogenization. The number of antihubs for each run were 138, 110, 142, and 149, compared to 124 for no homogenization. It seems, and is clear from the histograms, that the only threshold that is helping us (decreasing both hubs and antihubs) is 100. We saw overhomogenization adversely affect hubness in the distancebased method also. I should look into this.
Max hubs values for each run were 601, 592, 576, and 570, compared to the original 580, so there's at least a monotonic decrease.
Interestingly, the 100 threshold also yields a slightly higher rprecision value as well (0.16339, compared to the nonhomogenized 0.16169). The other average rprecisions are 0.15947, 0.15767, and 0.15196 (for 110, 90, and 80 thresholds, I should learn to make tables). This is in contrast to the distance homogenization, where hubness was seemingly improved but rprecision suffered for all thresholds. Granted the improvement is small and may not be statistically significant (more on this in a later post).
So, even with "manually" kicking out components that do not contribute much to the model (and usually corresponding to outlier musical sections), we don't see much overall improvement. I must look into this more.
So, I looked at the median activation level for each component over the entirety of a song and simply dropped any component whose median activation did not meet a threshold (again, empirically derived).
Below are the same plots used in the distancebased homogenization: first, hubness vs. number of components removed; second, hubness histograms.
From the first figure, we see that, again, homogenization indeed tends to affect antihubs more than hubs, as intended.
The number of hubs (more than 200 100occurances) for each threshold (110, 100, 90, 80 in log likelihood) were 162, 155, 160, and 153, compared to 156 for no homogenization. The number of antihubs for each run were 138, 110, 142, and 149, compared to 124 for no homogenization. It seems, and is clear from the histograms, that the only threshold that is helping us (decreasing both hubs and antihubs) is 100. We saw overhomogenization adversely affect hubness in the distancebased method also. I should look into this.
Max hubs values for each run were 601, 592, 576, and 570, compared to the original 580, so there's at least a monotonic decrease.
Interestingly, the 100 threshold also yields a slightly higher rprecision value as well (0.16339, compared to the nonhomogenized 0.16169). The other average rprecisions are 0.15947, 0.15767, and 0.15196 (for 110, 90, and 80 thresholds, I should learn to make tables). This is in contrast to the distance homogenization, where hubness was seemingly improved but rprecision suffered for all thresholds. Granted the improvement is small and may not be statistically significant (more on this in a later post).
So, even with "manually" kicking out components that do not contribute much to the model (and usually corresponding to outlier musical sections), we don't see much overall improvement. I must look into this more.
Wednesday, January 23, 2008
Rprecision as consistency
Last post, I mentioned rprecision as a way to measure the accuracy of a recommendation algorithm. I thought it may be pertinent to analyze in more detail the rprecision results of the certain bagofframes approach I'm working with.
For completeness, the results here are from using the uspop2002 dataset (105 artists, 10 tracks per artist, 20 MFCC's per 66.67 ms frame) modeled with 32 component GMM's and using the KLdivergencebased earthmover's distance (KLEMD) as the similarity metric. This is a standard introduced years ago, and one I'm inclined to stick with for comparison's sake.
Below are listed the top ranking artists by rprecision along with their average rprecision values. This means that their songs are more closely connected to each other in the similarity network than other artists'. Again, I'm only modeling timbre, so artists with a highly consistent "sound" will have high average rprecision.
Looking at the bottom of the list:
This shows that my contentbased recommendation engine just may be doing what it's suppose to. A track from The Bends would not be the most appropriate result for a query seeded from a Kid A track, something I wouldn't expect a collaborativefilteringbased engine to necessarily deal with. This agnostic power is what appeals to me most about this approach. A machine trained to analyze, and dare I say "understand", music recommends based on the music as it is encoded as audio (which, after all, is how humans perceive it), not by any tags or hype that may be attached to it.
For completeness, the results here are from using the uspop2002 dataset (105 artists, 10 tracks per artist, 20 MFCC's per 66.67 ms frame) modeled with 32 component GMM's and using the KLdivergencebased earthmover's distance (KLEMD) as the similarity metric. This is a standard introduced years ago, and one I'm inclined to stick with for comparison's sake.
Below are listed the top ranking artists by rprecision along with their average rprecision values. This means that their songs are more closely connected to each other in the similarity network than other artists'. Again, I'm only modeling timbre, so artists with a highly consistent "sound" will have high average rprecision.
 westlife  0.711
 korn  0.589
 mya  0.456
 goo goo dolls  0.422
 lionel richie  0.411
 deftones  0.378
 craig david  0.378
 ricky martin  0.367
 staind  0.367
 savage garden  0.356
Looking at the bottom of the list:
 chemical brothers  0.0
 depeche mode  0.0111
 radiohead  0.0222
 fatboy slim  0.0222
 daft punk  0.0222
 coldplay  0.0222
 sting  0.0333
 portishead  0.0333
 pet shop boys  0.0333
 oasis  0.0333
This shows that my contentbased recommendation engine just may be doing what it's suppose to. A track from The Bends would not be the most appropriate result for a query seeded from a Kid A track, something I wouldn't expect a collaborativefilteringbased engine to necessarily deal with. This agnostic power is what appeals to me most about this approach. A machine trained to analyze, and dare I say "understand", music recommends based on the music as it is encoded as audio (which, after all, is how humans perceive it), not by any tags or hype that may be attached to it.
Homogenization by distance
An easy way to remove these distant components seen in antihubs is to simply ignore them. So, using several empirically determined thresholds, I simply removed components greater than the threshold away from the GMM centroid. I did this iteratively: removing the farthest (see footnote) from the centroid and recomputing the centroid, until all components are inside the distance requirement. This was too remove the effect of the distant components on the original centroid. The thresholds I ran were 30, 25, 20, 15 (this is Euclidean distance in 20dimensional MFCC space). This is similar to what JJ does his thesis, but he used prior probabilities to homogenize, which do not have a strong correlation with their parent model's hubness. This, in itself, is sort of nonintuitive, since one would think priors show the "importance" of a component, but one must remember that with mixture models components are often highly overlapped. In this way, a particular component's prior could be relatively low, but its neighboring components together could be quite large or "important".
First a sanity check: the idea is that hubs are modeled appropriately, and antihubs have components modeling timbrally distant song sections, in turn making the models inaccurately distant from others. By this logic, homogenization should affect antihubs more than hubs. To verify this, I looked at the difference in the number of components in the homogenized models to the originals (which had 32 components) in relation to the hubness of each song. Below are scatter plots for each homogenization run.
We can see that with slight homogenization (e.g. 30 or 25) most strong hubs are unaffected (i.e. difference = 0) but with increased homogenization, songs across the board are seeing reduced components. So, I'd say this is reasonable.
The end results turn out to be mixed. The overall hubness of the set seems to improve (ie decrease). Below is the histogram for each homogenization run.
As the models are homogenized, we see the middle of the histogram "fatten" as the number of strong hubs and antihubs both decrease. Using the 100occurances measure, the number of hubs (h greater than 200) is 157, 150, 146, and 151 for no homogenization, a thresholding of 30, 25, 20, and 15, respectively. The number of antihubs (h less than 5) are 124, 113, 102, 91, and 78, respectively. This is promising but may simply be another sanity check since I based the homogenization on the observation that there was a strong correlation between hubness and distant components. The real question is whether the recommendations are better. Since there is no really ground truth with this kind of work (although some have sought it), one simple measure to look at is rprecision. This is the proportion of songs in by the same artist are returned in the top9 recommendations (9 because there are 10 songs per artist in the uspop2002 collection). If an artist is highly consistent, in that each of his songs is closer to the his other songs than any other artist's songs, rprecision will be high. This is of course problematic since an artist's sound can vary significantly from song to song, not mention albums. But since it's easy and relatively reasonable, I'll use it anyway.
It turns out that homogenization actually hurts rprecision. Over the same runs as above, the average rprecisions over all songs are 0.16169, 0.15989, 0.1564, 0.1472, and 0.12275. This means that the distant components did have some discriminative power, at least in the problem of artist classification.
So, homogenization, believe it or not, may not be the holy grail to improving this approach.
Footnote: A professor of mine once pointed out that the definitions of farther and further, which are distinct only in their literal or figurative usage (e.g. farther down the road, further into debt), tend to gradually exchange meanings back and forth over time, usually with a period of only about a few decades. So, if you're reading this blog in twenty years, know that at the time of this post, farther indeed refers to a physical distance.
First a sanity check: the idea is that hubs are modeled appropriately, and antihubs have components modeling timbrally distant song sections, in turn making the models inaccurately distant from others. By this logic, homogenization should affect antihubs more than hubs. To verify this, I looked at the difference in the number of components in the homogenized models to the originals (which had 32 components) in relation to the hubness of each song. Below are scatter plots for each homogenization run.
We can see that with slight homogenization (e.g. 30 or 25) most strong hubs are unaffected (i.e. difference = 0) but with increased homogenization, songs across the board are seeing reduced components. So, I'd say this is reasonable.
The end results turn out to be mixed. The overall hubness of the set seems to improve (ie decrease). Below is the histogram for each homogenization run.
As the models are homogenized, we see the middle of the histogram "fatten" as the number of strong hubs and antihubs both decrease. Using the 100occurances measure, the number of hubs (h greater than 200) is 157, 150, 146, and 151 for no homogenization, a thresholding of 30, 25, 20, and 15, respectively. The number of antihubs (h less than 5) are 124, 113, 102, 91, and 78, respectively. This is promising but may simply be another sanity check since I based the homogenization on the observation that there was a strong correlation between hubness and distant components. The real question is whether the recommendations are better. Since there is no really ground truth with this kind of work (although some have sought it), one simple measure to look at is rprecision. This is the proportion of songs in by the same artist are returned in the top9 recommendations (9 because there are 10 songs per artist in the uspop2002 collection). If an artist is highly consistent, in that each of his songs is closer to the his other songs than any other artist's songs, rprecision will be high. This is of course problematic since an artist's sound can vary significantly from song to song, not mention albums. But since it's easy and relatively reasonable, I'll use it anyway.
It turns out that homogenization actually hurts rprecision. Over the same runs as above, the average rprecisions over all songs are 0.16169, 0.15989, 0.1564, 0.1472, and 0.12275. This means that the distant components did have some discriminative power, at least in the problem of artist classification.
So, homogenization, believe it or not, may not be the holy grail to improving this approach.
Footnote: A professor of mine once pointed out that the definitions of farther and further, which are distinct only in their literal or figurative usage (e.g. farther down the road, further into debt), tend to gradually exchange meanings back and forth over time, usually with a period of only about a few decades. So, if you're reading this blog in twenty years, know that at the time of this post, farther indeed refers to a physical distance.
Friday, January 18, 2008
Can we really trust EM?
To make my case against the bagofframes approach, I've been looking at the origin of "hubs", songs that are found to be unusually similar to many other songs in a database. These not only produce lots of falsepositives, but because recommendation lists are constantsum, they also lead to falsenegatives by beating out appropriate recommendations.
Working from Adam Berenzweig's blogged experiments, I've found a somewhat strong correlation between hub songs and the overall spread of the components. Hubs tend to have components tightly clustered around their centroid, whereas antihubs have components significantly far from each other. To verify, I found the median intracomponent KLdivergence for each song model. The correlation with this and the song's "hubness" (the number of times occurring on other top100 lists, aka JJ's 100occurance measure) was 0.4179 (pvalue = 1.21e45). In other words, the stronger the hub the more compact the GMM components are in MFCC space.
Then, I started looking at the activation of the individual GMM components over the MFCC frames of the songs and noticed that the more distant the component (in relation to the GMM's centroid) the more likely it came from a timbrally spurious section of the song. These sections can be as short as a few frames, but EM apparently still devotes components to them. Below is a good example from the GMM (16 components, diag covar) of The Bloodhound Gang's "Right Turn Clyde" (hubness value of zero!). The activations are shown on the right and the Euclidean distance from the GMM centroid is on the left. It's clear at least 7 of the 16 components are given to the short section in the middle, and these components are the furthest from the model's centroid.
Hubs, on the other hand, have nice dense activations where every component seems to be modeling a large part of the song. Of course, this is due to the song itself being timbrally homogeneous, but it's also due to EM simply better modeling it. Example below is from the top hub, Sugar Ray's "Ours" (hubness value of 604!, 57.52% of the database).
So, what can we do about this? Is it reasonable to neglect the outlier sections of songs,which are probably just breaks or intros (as opposed to salient parts like choruses)? Is it right to think songs' models should be more centralized?
Working from Adam Berenzweig's blogged experiments, I've found a somewhat strong correlation between hub songs and the overall spread of the components. Hubs tend to have components tightly clustered around their centroid, whereas antihubs have components significantly far from each other. To verify, I found the median intracomponent KLdivergence for each song model. The correlation with this and the song's "hubness" (the number of times occurring on other top100 lists, aka JJ's 100occurance measure) was 0.4179 (pvalue = 1.21e45). In other words, the stronger the hub the more compact the GMM components are in MFCC space.
Then, I started looking at the activation of the individual GMM components over the MFCC frames of the songs and noticed that the more distant the component (in relation to the GMM's centroid) the more likely it came from a timbrally spurious section of the song. These sections can be as short as a few frames, but EM apparently still devotes components to them. Below is a good example from the GMM (16 components, diag covar) of The Bloodhound Gang's "Right Turn Clyde" (hubness value of zero!). The activations are shown on the right and the Euclidean distance from the GMM centroid is on the left. It's clear at least 7 of the 16 components are given to the short section in the middle, and these components are the furthest from the model's centroid.
Hubs, on the other hand, have nice dense activations where every component seems to be modeling a large part of the song. Of course, this is due to the song itself being timbrally homogeneous, but it's also due to EM simply better modeling it. Example below is from the top hub, Sugar Ray's "Ours" (hubness value of 604!, 57.52% of the database).
So, what can we do about this? Is it reasonable to neglect the outlier sections of songs,which are probably just breaks or intros (as opposed to salient parts like choruses)? Is it right to think songs' models should be more centralized?
Subscribe to:
Posts (Atom)