ABSTRACT:
The purpose of this project is to study the
feasibility of a music classification system based on music content using a
neural network. A 1.5 second audio file stored in WAV format is passed to a
feature extraction function. The WAV format for digital audio is simply the
left and right stereo signal samples. The feature extraction function
calculates 124 numerical features that characterize the sample. When training
the system, this feature extraction process is performed on many different
input WAV files to create a matrix of column feature vectors. This matrix is
then preprocessed to reduce the number of inputs to the neural network and then
sent to the neural network for training. After training, single column, vectors
can be fed to the preprocessing block, which processes them in the same manner
as the training vectors, and then classified by the neural network
INTRODUCTION:
Neural
networks have found profound success in the area of pattern recognition. By
repeatedly showing a neural network inputs classified into groups, the network
can be trained to discern the criteria used to classify, and it can do so in a
generalized manner allowing successful classification of new inputs not used
during training. With the explosion of digital music in recent years due to
Napster and the Internet, the application of pattern recognition technology to
digital audio has become increasingly interesting. On the user end, many people
have downloaded large collections of music files (e.g.MP3s and WAVs) that are
often stored in directory structures classified by genre or artist. Thus one
can imagine the usefulness of a program
that would automatically
classify and store
new downloaded music using the existing
classification system set by the user. A second useful program would be one
that searches through a collection of files and extracts only those with characteristics chosen
by the
user. For instance, a user may want to
search through a library of files stored on a computer in Austria for those
that are of the classical music genre, but due to a language difference and the
Austrian user’s own preferences for file naming, determining the genre of each
of the files may be very difficult to do using just file names.
Thus a
program that makes classifications based on music content would be much more appropriate
and useful. On Napster’s and the recording industry’s ends, classification of
music based on content is necessary for ensuring that copyrighted music is not
freely distributed across the internet. Filters based on file names have been
found to be very ineffective, for clever users simply alter the names of the
files to circumvent such filters. What is needed is a classification system
that only looks at the content of the file to make it’s classification
decisions, for such a system would be much more effective since altering the
content of the file is not a very appealing option to users. Figure 1 below is
a block diagram of the classification system.
SYSTEM SETUP:
This
section describes the setup of the digital audio classification system. This
system is composed primarily of the blocks above and was developed in the Mat
lab environment.
INPUT FILES:
Data
for training and testing the system was taken from ten compact discs, six
classified as rock (labeled R01-R04), two classified as classical (C01 and
C02), two classified as soul or R&B (S01 and S02), and two classified as
country and western (W01 and W02).
The four rock CDs
are recorded by four different artists. A complete source listing for these CDs
can be found in Appendix A. The tracks on each of these CDs were extracted and
converted to WAV format and then divided into segments of length 2 18 bits, or
six seconds. To avoid periods within the music not characteristic of the whole
song, the segments were all taken from the middle of each track. From this
procedure 2,781segments of music were produced. The segments of music were then further
divided into two sub-segments by extracting the first 2 16 bits (1.5 seconds)
and the third 2 16 bits.
Thus, in
total, 5,562 sub-segments of music were generated to use for training and
testing the system. For classification by genre, CDs R01, R02, C01, C02, S01,
S02, W01, and W02 were used. For classification by artist the four rock CDs
were used.
FEATURE EXTRACTION:
Ideally, all the samples in the
WAV file would be passed to the neural network, and then the neural network
would determine the best way to process the data to arrive at a classification
of the file. However, at a sampling rate of 44.1 kHz, even a one second sample
of audio would result in a prohibitive amount of information for the neural
network and Mat lab. Therefore, a feature extraction function is needed to reduce
the amount of data passed to the neural network. Extracting useful features
from a digital audio sample is an evolving science and remains a popular
research field. From the infinite amount of calculations that could be
performed, this system uses only 124. These features fall into six categories
described below. Table 1 outlines the format of the feature vector.
2.2.1 Linear Predictive Coding Taps
In
linear predictive coding (LPC), a signal is modeled by the following equation:
yn + 1 =
w0* yn + w1* yn - 1 + w2* yn-2 +
… + wL-1*yn-L-1
+ en+1
The
goal of this model is to predict the next sample of the signal by linearly
combining the L most current samples while minimizing the mean squared error
over the entire signal. The weights (wi’s) are determined by using an adaptive
filter and the LMS algorithm. For this system, the music segments were modeled
using 32 taps (L=32). A block diagram of the adaptive filter used is shown
below in Figure 2.
To speed up the execution time required to
calculate the LPC taps, the code was written in C and compiled using the Mat
lab MEX compiler, which resulted in a very significant decrease in execution
time.
FREQUENCY
CONTENT:
Frequency content was found to be an
important feature for classifying music. Three different frequency content
calculations were performed and included in the feature vectors. The first
frequency content features that were calculated were the amplitude values of
the discrete Fourier transform (DFT) of the signal. Because the sampling rate
for the WAV files was 44.1 kHz, the DFT of the audio sample shows only
the frequency content up to 22 kHz. Initial analysis of the audio
signal being tested revealed that the vast majority of the frequency power lies
in the lower portion of this spectrum; therefore, the signals were sampled at
T=2 before taking the DFT to effectively zoom in on the lower half of the
spectrum. The positive values of the DFT spectrum were then grouped into 32
evenly spaced bins, and the average spectral energies in each of the bins were
reported as 32 features. The second calculation made was to take the natural
logarithm of the 32 DFT amplitude values and report these values as 32
additional features. These features emphasize the differences in the values at
frequencies with very small DFT amplitude values, which are mostly the higher
frequencies. These features are provided to distinguish different samples by
their higher frequency content. The final calculation made was to take the
inverse DFT of the logarithm of the amplitude of the DFT values. The lower 12
values of this calculation were reported as 12 more features and were included
to further emphasize the higher frequencies of the samples.
MEL-FREQUENCY
CEPSTRAL COEFFICIENT:
Mel-Frequency Cepstral Coefficients (MFCCs)
have been used very successfully in the field of speech recognition as
classification features for speech audio signals. The processing sequence for
finding the MFCCs of an audio signal is the following:
· Window the data with a Hamming window
· Find the amplitude values of the DFT of the
data
· Convert the amplitude values to filter bank
outputs
· Calculate the log base 10
· Find the
cosine transform
The
filter bank consists of 40 triangle filters with 13 spaced linearly by 133.33
Hz and 27 spaced logarithmically by a factor of 1.0711703 in frequency. The DFT
amplitude values are combined using these triangle filters to form the filter
bank outputs. Code developed by Malcolm
Slaney as a part of his Auditory Toolbox was used to calculate the MFCC values.
Fifteen MFCC values were reported as features and included in the feature
vector [3].
VOLUME:
The volume of a musical piece is easily
calculated as the variance of the samples.
DATA PREPROCESSING:
The feature vectors returned by the feature
extraction block were first preprocessed before inputting them to the neural
network. Two types of preprocessing were performed, one to scale the data to
fall within the range of –1 to 1 and one to reduce the length of the input
vector. The data was divided into three sets, one for training, one for
validation, and one for testing. The preprocessing parameters were determined
using the matrix containing all feature vectors used for training and
validation. For testing, these same parameters were used to preprocess test
feature vectors before passing them to the trained neural network. The first
preprocessing function used was premnmx, which preprocesses the data so that
the minimum and maximum of each feature across all training and validation
feature vectors is –1 and 1. Premnmx returns two parameters, minp and maxp,
which were used with the function tramnmx for preprocessing the test feature
vectors.
The
second preprocessing function used was prepca, which performs principle
component analysis on the training and validation feature vectors. Principle
component analysis is used to reduce the dimensionality of the feature vectors
from a length of 124 to a length more manageable by the neural network. It does
this by orthogonalizing the features across all feature vectors, ordering the
features so that those with the most variation come first, and then removing
those that contribute least to the variation [4]. Precpa was used with a value
of .001 so that only those features that contribute to 99.9% of the variation
were used. This procedure reduced the length of the feature vectors by one
half. Precpa returns the matrix transMat, which is used with the function
trapca to perform the same principle component analysis procedure on the test
feature vectors as performed on the training and validation feature vectors.
This was done before passing the test feature vectors to the trained
neural network.
NEURAL NETWORK:
A three-layer feedforward
backpropagation neural network, shown in Figure 3, was used for classifying the
feature vectors. By trial and error, an architecture consisting of 20 adalines
in the input layer, 10 adalines in the middle layer, and 3 adalines in the
output layer was found to provide good performance. The transfer function used
for all adalines was a tangent sigmoid, ‘tansig’. Levenberg-Marquardt
backpropagation algorithm‘trainlm’, was used to train the neural
network.
CLASSIFICATION VECTOR:
Two music classification systems where
implemented and tested, one to classify by genre and one to classify by artist.
Figure 4 shows the constellations used for each of these classification
systems, and Table 2 lists the specific coordinates of the constellation for
each classification scheme. The constellations were chosen so that all points
where equidistant from each other, all coordinates where within the –1 to 1
range, and the distance between points was maximized. Originally a two
dimensional constellation was used, but the increased distance between
points gained by moving to three dimensions provided a significant performance
increase. Constellations of dimension greater than three did not provide a
significant enough performance increase to justify the added computational
complexity.
RESULTS:
This section will discuss the
results of training and testing the classification system. Two separate results
will be presented, one for classification by genre and one for classification
by artist.
CLASSIFICATION BY GENURE:
To test the performance of the
music classification system, the system was first configured to classify music
by genre. The four genres used were rock, classical, soul/R&B, and country
and western. The first step in performing this test was to generate the data set.
As discussed above, the data set was taken from eight CDs, two per genre, and
consisted of 4,425 feature vectors. From these 4,425 feature vectors, 2,213
were used for training and the other 2,212 were reserved for testing. Before
training, data preprocessing was performed on the training data, as was
discussed above. After preprocessing, the training data was divided further
into two groups, one for training and one for validation. A validation data set
was needed to ensure that the neural network did not overfit the data. The next
step was to create the neural network discussed above in the system setup
section. The training function used was Levenberg-Marquardt
backpropagation algorithm, ‘trainlm.’
The parameters mu, mu_dec, and mu_inc of ‘trainlm’ were set to 1, 0.8, and 1.5 in order to ensure that the
algorithm did not converge too quickly, which
helped to limit the amount of overfitting that occurred before a
validation stop of the training. Figure 5 below shows the MSE versus training
epoch plot both the training data MSE and validation data MSE curves are shown.
The MSE reached 0.0228 before a validation stop occurred.
After training, the system was then
tested using the data set reserved for testing. Before passing the test feature
vectors to the trained neural network, data preprocessing was performed using
the saved parameters from the preprocessing of the training data. The results
are summarized in Tables 3 and 4. Figure 6 shows a three-dimensional plot of
the output vectors of the neural network for each of the test input vectors.
The decision rule used for classifying the output of the neural network was a
minimum distance rule. A decision was made by first calculating the distance
from the output of the neural network to each of the constellation points and
then choosing the constellation point that produced the minimum distance.
Genre classification was performed at a
success rate of 94.8%, with classical music being classified
the most successfully, 96.7%, and country and western, soul/R&B, and rock
music
being classified the least successfully at success rates of 91.0%, 93.1%, and
93.3%. The separation of success rates between classical music and the other
three genres was expected since the four genres are not equally distinct
in style. Classical music is definitely the genre
that stands out as being the most distinct among the four
genres, while country and western, rock, and soul/R&B can be grouped as
musical genres of a somewhat similar style. Country and
western, rock, and soul/R&B have each influenced one
another throughout their growth into separate musical genres, and thus one
would expect several features of each genre to be mimicked in the
other two. Furthermore, out of the three non-classical music genres,
country and western music was the genre that was classified
incorrectly as classical music the most. This was also an expected result,
since
country and western music features instruments that are the most similar to
those used in classical music (i.e. stringed instruments such as the
acoustic guitar and violin).
CLASSIFICATION BY ARTISTS:
To further test the music classification
system, the system was configured to classify music by artist. Four rock
artists were used which I will call R01, R02, R03, and R04. Data for this test
was taken from the four rock CDs, which are listed in Appendix A. The training
and testing of this system \ were performed identically to the \ training and
testing of the system for classifying by genre. From the four CDs, 2,187
feature vectors were extracted and split into two equal groups, one for
training and one for testing. The training data set was then further divided to
form the training and validation data sets. Training was performed using the
same preprocessing, training function, and parameters as described in the
classification by genre section. Figure 7 below shows the MSE versus training
epoch plot – both the training data MSE and validation data MSE curves are
shown. The MSE reached 1.81e-5 before a validation stop occurred. By comparing
Figures 5 and 7, it is evident that more over fitting occurred when training
the system to classify by artist, which is discussed further below.
After training, the system was
tested using the feature vectors reserved for testing. The results are
summarized in Tables 5 and 6, and Figure 8 shows a three-dimensional plot of
the output vectors of the neural network for each of the test input vectors.
MORE
ADVANCED FEATURE EXTRACTION:
The field of music feature extraction is a
rich research area, for improving feature extraction will most likely have the
largest impact on the performance of a music classification system. For the
system detailed in this paper, feature vectors were extracted from 1.5-second
music samples, and although the system performed well, 1.5 seconds does not
capture all the characteristics of an entire song. What is needed is a feature
extraction method that looks at more of the song in an attempt to not only
capture “short-time” features but also “long-time” features that describe how
the song evolves over time. One way to implement this is to simply use entire
songs as the input to the feature extractor, but at high sampling rates, this
leads to a prohibitively large amount of data for the feature extractor to
process. A second approach would be to send several small samples, such as 1.5-second
samples, that are equally spaced throughout a song to the feature extractor. The
feature extractor could then extract “short-time” features from each of the
samples and then produce “long-time” features by examining how the extracted
“short-time” features evolve with time. A feature extractor that considers an
entire song would be a start towards developing a more advanced feature
extractor, but even more needs to be done. Probably the toughest problem that
needs to be solved is how to extract features that describe the very personal
performance style of a musical piece. These are the features that will be
necessary for making correct decisions when the differences in the pieces of
music are very subtle, such as occurs when classifying music by artists within
the same genre.
MORE
ADVANCED DECISION RULE:
Rule used in this system assumes
that the noise that drives the output of the neural network away from
constellation points is equal among all classification categories. From the
results of the classification by genre section, Figure 6, this assumption is obviously
not accurate. A more advanced decision rule that partitions the output space
into classification regions in a more clever manner would definitely lead to
better results. The main approach to implementing this is to observe the
outputs while training and to assign larger regions to the classification
groups experiencing the most noise or deviation. By providing more room for
error for the more noisy classification groups, the error rate will be driven
closer to zero and better balanced among all groups.
4.3 MP3
FILES INSTEAD OF WAV FILES:
Given the popularity of the MP3
format for digital audio, a system that would take MP3 files as input instead
of WAV files is desired. The system presented in this paper can be easily
converted to take MP3 files as input by pre-appending an MP3 to WAV converter
to Figure 1. This approach is valid and may be the best choice, but currently
converting MP3 files to WAV files is a computationally intense procedure that
requires a somewhat significant amount of execution time. However, as computer
performance continues to advance, this problem will become negligible. An
alternate approach is to design a system that works exclusively with MP3 files,
that is, extracts features directly from files in the MP3 format. The drawback
of this approach is that new methods for extracting features from highly
compressed data would have to be researched, and most of the current feature
extraction research would become irrelevant. However, highly compressed data
may contain valuable features not obvious in the uncompressed data, making such
research worthwhile. This leads to the idea of creating a hybrid system that
extracts features from both the WAV and MP3 versions of the file, thus using
the best of both worlds.
MORE
CLASSIFICATION CATEGORIES:
To make a music classification tool useful,
the number of classification categories needs to be increased to more
than four. Implementing this improvement would involve working
several different areas. For instance, one would need to find a way of
determining the constellation dimensionality needed to provide enough distance
between points to provide acceptable system performance. Another area that will
need work is the area of feature extraction, for more advanced feature extraction
may be necessary to provide a sufficient set of features for the neural network
to have enough information to discern more than four classifications. Also, a
more advanced decision rule will be needed to provide a clever partitioning
strategy of the output space so that categories experiencing more noise will be
given more room for error. One alternate approach to increasing the number of
categories is to set up a categorization system in the form of a tree
structure. If each node in the tree has a maximum of four children, then the
four-category classification system presented in this paper could be used to
move down the tree from node to node until a category at the bottom of the tree
is reached. Such a system would require a separate trained neural network for
each node in the tree, but it would avoid many of the issues discussed above
involved in implementing a flat categorization system.
43rd Annual Meeting - Conference on Advanced
Signal Processing Algorithms, Architectures, and Implementations VIII, SPIE
Vol. 3461, p432-443, July 1998.
REFERENCES
[1] B. Widrow and S. Stearns, Adaptive Signal
Processing, Prentice Hall, 1985.
[2] B. Widrow and M. Lehr, “30 Years of Adaptive
Neural Networks,” Proceedings of the IEEE, Vol. 78, No. 9, September
1990.
[3] Malcolm Slaney, Auditory Toolbox for Matlab,
Interval Research Corporation,
Version 2.
[4] H. Demuth and M. Beale, Neural Network Toolbox
User’s Guide, Version 4, The Mathworks, 2001.
[5] S. Haykin, Neural Networks, Prentice Hall,
2nd Edition, 1998.
[6] T. Zhang and J. Kuo, “Content Based
Classification and Retrieval of Audio,” SPIE's
No comments:
Post a Comment