On the Application of Generic Summarization Algorithms to Music

Several generic summarization algorithms were developed in the past and successfully applied in fields such as text and speech summarization. In this paper, we review and apply these algorithms to music. To evaluate this summarization's performance, we adopt an extrinsic approach: we compare a Fado Genre Classifier's performance using truncated contiguous clips against the summaries extracted with those algorithms on 2 different datasets. We show that Maximal Marginal Relevance (MMR), LexRank and Latent Semantic Analysis (LSA) all improve classification performance in both datasets used for testing.

Generic summarization algorithms have also been developed for and are usually applied in text summarization. Their application, in music, to extract a thumbnail is not ideal, because a "good" thumbnail entails requirements such as coherence and clarity. These summaries are composed of small segments from different parts of the song which makes them unsuitable for human enjoyment and thus may help evade copyright issues. Nevertheless, most of these algorithms produce summaries that are both concise and diverse.
We review several summarization algorithms, in order to summarize music for automatic, instead of human, consumption. The idea is that a summary clip contains more relevant and less redundant information and, thus, may improve the performance of certain tasks that rely on processing just a portion of the whole audio signal. We evaluate the summarization's contribution by comparing the performance of a Portuguese music style Fado Genre Classifier [9] using the extracted summaries of the songs against using contiguous clips (truncated from the beginning, middle and end of the song). We summarize music using MMR, LexRank, LSA and also with a method for music summarization called Average Similarity for comparison purposes. We present results on 2 datasets showing that MMR, LexRank and LSA improve classification performance under certain parameter combinations.
Section II reviews related work on summarization. Specifically, the following algorithms are reviewed: Average Similarity in section II-A, MMR in section II-B, LexRank in section II-C and LSA in section II-D. Section III describes the details of the experiments we performed for each algorithm and introduces the Fado Classifier. Section IV reports and discusses our classification results and section V concludes this paper with some remarks and future work.

II. SUMMARIZATION
Several algorithms for both generic and music summarization have been proposed. However, music summarization algorithms were developed to extract an audible summary so that any person can listen to it coherently. Our focus is on automatic consumption, so coherence and clarity are not mandatory requirements for our summaries.
LexRank [10] and TextRank [11] are centrality-based methods that rely on the similarity between every sentence. These are based on Google's PageRank [12] algorithm for ranking web pages and are successfully applied in text summarization. GRASSHOPPER [13] is another method applied in text summarization, as well as social network analysis, focusing on improving diversity in ranking sentences.
MMR [14], [15], applied in speech summarization, is a query-specific method that selects sentences according to their similarity to the query and to the sentences previously selected. LSA [16] is another June 20, 2014 DRAFT method used in text summarization based on the mathematical technique Singular Value Decomposition (SVD).
Music-specific summarization structurally segments songs and then selects which segments to include in the summary. This segmentation aims to extract meaningful segments (e.g. chorus, bridge). [1] presents two approaches for segmentation: using a Hidden Markov Model (HMM) to detect key changes between frames and Dynamic Time Warping (DTW) to detect repeating structure. In [2], segmentation is achieved by correlating a Gaussian-tempered "checkerboard" kernel along the main diagonal of the similarity matrix of the song, outputting segment boundaries. Then, a segment-indexed similarity matrix is built, containing the similarity between every detected segment. SVD is applied to that matrix to find its rank-K approximation. Segments are, then, clustered to output the song's structure. In [3], [4], songs are segmented in 3 stages. First, a similarity matrix is built and it is analyzed for fast changes, outputting segment boundaries. These segments are clustered to output the "middle states". Finally, an HMM is applied to these states, producing the final segmentation. These algorithms then follow some strategies to select the appropriate segments. [5] groups (based on the Kullback-Leibler (KL) divergence) and labels similar segments of the song and then the summary is generated by taking the longest sequence of segments belonging to the same cluster. In [6], [7], a method called Average Similarity is used to extract a thumbnail L seconds long that is most similar to the whole piece. Another method for this task is the Maximum Filtered Correlation [8] which starts by building a similarity matrix and then a filtered time-lag matrix, which has the similarity between extended segments embedded in it. Finding the maximum value in the latter is finding the starting position of the summary.
To apply generic summarization algorithms to music, first we need to segment the song into musical words/terms. This fixed segmentation differs a lot from the structural segmentation used in music-specific algorithms. Fixed segmentation does not take into account the human perception of musical structure.
It simply allows us to look at the variability and repetition of the signal and use them to find the most important parts. Structural segmentation aims to find meaningful segments (to people) of the song so that we can later select those segments to include in the summary. This type of segmentation often leads to audible summaries which violate copyrights of the original songs. Fixed segmentation combined with generic summarization algorithms may help evade those issues.
In the following sections we review the algorithms we chose to evaluate: Average Similarity, MMR, LexRank, and LSA.

A. Average Similarity
This approach to summarization has the purpose of finding a fixed-length continuous music segment, of duration L, most similar to the entire song. This method was introduced in [6] and later used in other research efforts such as [7].
The method consists of building a similarity matrix for the song and calculating an aggregated measure of similarity between the whole song and every L seconds long segment.
In [6], 45 Mel Frequency Cepstral Coefficient (MFCC)s are computed but only the 15 with highest variance are kept. The cosine distance is used to calculate pairwise similarities.
In [7], the first 13 MFCCs and the spectral centre of gravity (sound "brightness") are used. The Tchebychev distance was selected for building the similarity matrix.
Once the similarity between every frame is calculated, we build a similarity matrix S and embed the similarity values between feature vectors v i and v j in it: The average similarity measure can be calculated by summing up columns (or rows, since the similarity matrix is symmetric) of the similarity matrix, according to the desired summary length L, starting from different initial frames. The maximum score will correspond to the segment that is most similar to the whole song. To find the best summary of length L, we must compute the score Q L (i): N is the number of frames in the entire piece. The index 1 ≤ i ≤ (N − L) of the best summary starting frame is the one that maximizes Q L (i).
The evaluations of this method in the literature are subjective (human) evaluations that take into account whether the generated summaries include the most memorable part(s) of the song [6]. Other evaluations are averages of scores given by test subjects, regarding specific qualities of the summary such as Clarity, Conciseness and Coherence [7].

B. Maximal Marginal Relevance
MMR [17], selects sentences from the signal according to their relevance and to their diversity against the already selected sentences in order to output low-redundancy summaries. This approach has been used in speech summarization [14], [15]. It is a query-specific summarization method, though it is possible to produce generic summaries by taking the centroid vector of all the sentences (as in [15]) as the query.
MMR iteratively selects the sentence S i that maximizes the following mathematical model: Sim 1 and Sim 2 are the, possibly different, similarity metrics; S i are the unselected sentences and S j are the previously selected ones; Q is the query and λ is a configurable parameter that allows the selection of the next sentence to be based on its relevance, its diversity or a linear combination of both.
Usually sentences are represented as Term Frequency -Inverse Document Frequency (TF-IDF) scores vectors. The cosine similarity is frequently used as Sim 1 and Sim 2 .

C. LexRank
LexRank [10] is a centrality-based method that relies on the similarity for each sentence pair. This centrality-based method is based on Google's PageRank [12] algorithm for ranking web pages. The output is a list of ranked sentences from which we can extract the most central ones to produce a summary.
First, we compare all sentences, normally represented as TF-IDF scores vectors, to each other using a similarity measure. LexRank uses the cosine similarity. After this step, we build a graph where each sentence is a vertex and edges are created between every sentence according to their pairwise similarity.
Usually, the similarity score must be higher than some threshold to create an edge. LexRank can be used with both weighted and unweighted edges. Then, we perform the following calculation iteratively for each vertex until convergence is achieved (when the error rate of two successive iterations is below a certain threshold for every vertex): d is a damping factor to guarantee the convergence of the method, N is the total number of vertices and S (V i ) is the score of vertex i. This is the case where edges are weighted. When using unweighted edges, the equation is simpler: D (V i ) is the degree (i.e., number of edges) of vertex i. We can construct a summary by taking the highest ranked sentences until a certain summary length is reached. June 20, 2014 DRAFT This method is based on the fact that sentences recommend each other. A sentence very similar to many other sentences will get a high score. Sentence score is also determined by the score of the sentences recommending it.

D. Latent Semantic Analysis
LSA is based on the mathematical technique SVD that was first used for text summarization in [16].
SVD is used to reduce the dimensionality of an original matrix representation of the text. To perform LSA-based text summarization, we start by building a T terms by N sentences matrix A.
Each element of A, a ij = L ij G i , has two weight components: a local weight and a global weight. The local weight is a function of the number of times a term occurs in a specific sentence and the global weight is a function of the number of sentences that contain a specific term.
Applying SVD to matrix A will result in a decomposition formed by three matrices: U , a T × N Each latent dimension corresponds to a topic. We calculate the Rank K approximation by taking the first K columns of U , the K × K sub-matrix of Σ and the first K rows of V T . We can extract the most relevant sentences by iteratively selecting sentences corresponding to the indices of the highest values for each (most relevant) right singular vector.
In [18], two limitations of this approach are discussed: the fact that K is equal to the number of sentences in the summary, which, as it increases, tends to include less significant sentences; and that sentences with high values in several dimensions (topics), but never the highest, will never be included in the summary. To compensate for these problems, a sentence score was introduced and K is chosen so that the K th singular value does not fall under half of the highest singular value:

III. EXPERIMENTS
To evaluate these algorithms on music, we tested their impact on a Fado classifier. This classifier simply classifies a song as Fado or non-Fado. Fado is a Portuguese music genre whose instrumentation usually consists solely of stringed instruments, such as the classical guitar and the Portuguese guitar. The classifier is a Support Vector Machine (SVM) [19]. We used 5-fold cross validation when calculating classification performance. The classification performance was calculated first for the beginning, middle and end sections (of 30s) of the songs to get a baseline and then we compared it with the classification using the summaries (also 30s) for each parameter combination and algorithm.
For feature extraction we used OpenSMILE's [20] implementation, namely, to extract MFCC feature vectors. We also used the Armadillo library [21] for matrix operations and the Marsyas library [22] for synthesizing the summaries.
For Average Similarity, we experimented with 3 different frame sizes (0.25, 0.5, and 1 s) with both 50% and no overlap. We also experimented with MFCC vector sizes of 12 and 24.
To use the generic summarization algorithms, however, we need additional processing steps. We adapted those algorithms to the music domain by mapping the audio signal frames (represented as MFCC vectors) to a discrete representation of words and sentences. For each piece being summarized, we cluster all of its frames using the mlpack's [23] K-Means algorithm implementation which calculates the vocabulary for that song (i.e., each frame is now a word from that vocabulary). Then, we segment the whole piece into fixed-size sentences (e.g., 5-word sentences). This allows us to represent each sentence as a vector of word occurrences/frequencies (depending on the type of weighting chosen) which lets us compare sentences with each other using the cosine distance.
In our implementation of MMR, we calculate the similarity between every sentence only once and then apply the algorithm until the desired summary length is reached. We experimented using 3 different values for λ (0.3, 0.5 and 0.7) and 4 different weighting types: raw (counting of the term), binary (presence of the term), TF-IDF and "dampened" TF-IDF (same as TF-IDF but takes logarithm of TF instead of TF itself).
The damping factor used in LexRank was 0.85 and the convergence threshold was set to 0.0001. We also calculated the similarity between every sentence only once, applying the iterative algorithm and picking sentences until the desired summary length is reached. We also tested LexRank using the same weighting types as for MMR.
We used Armadillo's [21] implementation of the SVD operation to implement LSA. After sentence/word segmentation, we apply SVD to the term by sentences matrix (column-wise concatenation of all sentence vectors). We then take the rank-K approximation of the decomposition where the Kth singular value is not smaller than half of the (K − 1)th singular value. Then, we calculate the sentence score (as explained in section II-D) for each sentence and pick sentences according to that ranking until the desired summary length is reached. We tested LSA with both raw and binary weighting.
We tested MMR, LexRank, and LSA, with all combinations of the following parameter values: frame size of 0.5s with no overlap and with 50% (0.25s hops) overlap; vocabulary size of 25, 50, and 100 words; and sentence size of 5, 10, and 20 words. We used MFCC vectors (of size 12) as features for these experiments, they are widely used in many MIR tasks including music summarization in [2], [5]- [7].

IV. RESULTS
We present only the most interesting results, since we tried many different parameter combinations combination on the dataset 1 yielded 100% accuracy but on dataset 2 it yielded only 90.8% which is lower than the baseline (92%). However, all other parameter combination presented in those tables yield a better result than the baseline for both datasets. We also noticed that smaller values of λ would result in worse accuracy scores.
We can also see that the best parameter combination for LexRank on dataset 1 was also the best on dataset 2. Besides that, all other presented combinations are better when compared to the corresponding baseline, which suggests that these parameter combinations might also be good for other datasets.
Our experiments show that LSA works best with binary weighting when applied to music. This has to do with the fact that some musical sentences, namely, at the beginning of the songs, are strings with very few repeating terms, which increases term-frequency scores. Moreover, those terms might not even appear anywhere else in the song which will, in turn, decrease the document frequency of the term, thus increasing the inverse document frequency score. These issues are detected when LSA chooses those (unwanted) sentences because they will have a high score on a certain latent topic. The binary weighting alleviates these problems because we only check for the presence of a term (not its frequency) and the document frequency of that term is not taken into account. LSA also achieved results above the baseline (table I).

V. CONCLUSIONS AND FUTURE WORK
We evaluated summarization through classification for MMR, LexRank, and LSA in the music domain.
More experimenting should be done to find a set of parameter combinations that will work for most music contexts. Future work includes testing other summarization algorithms, other similarity metrics, other types of features and other types of classifiers. The use of Gaussian Mixture Models may also help in finding more "natural" vocabularies and Beat Detection might be used to find better values for fixed segmentation.