Machine learning, computer vision, and big data are hot topics in 2016. Clarifai has an API allowing clients to access several pretrained models of tagged pictures. A user may also train a new model using his or her own images and tags (concepts). We wanted to give the Clarifai API a spin.
But there's an important question: How much of the Clarifai algorithm is baloney?
Suppose you train a new model with several dog pictures and cat pictures. Predictions of new dog or cat photos are surprisingly accurate! Might Clarifai secretly uses their large, well-trained model to influence the training of a new model?
Surely, there's a way to check! Right?
The first test was random pixel data. It turns out the API handles random pixel data very well---estimating the image to be abstract, a pattern, texture, design, decoration, color, art, etc.
We then proposed non-random visual data which Clarifai had a very low likelihood of ever seeing: music. What is digital music? Ones and zeros. What is a digital image? Ones and zeros.
What it does
Visual, spectral representations of music are found using ARSS. These pictures, which show the frequency content of a song at each moment of playback, are used to train a model. Each picture is associated with a musical category: rap, rock, classical, or jazz. The model can then predict the most likely genre of a previously-unseen visual song.
How I built it
- Compile ARSS (C)
- Install Python and Clarifai Python module
- Download and categorize music samples
- Write shell scripts to automate conversion between formats and batch operations
Challenges I ran into
Initially, the trained model provided no results using the Clarifai web-based API. Tagging image data by hand is time-consuming.
Accomplishments that I'm proud of
What's next for Convoluted Aural Net / Beep Learning
It might be possible to improve the accuracy of the model by splitting a particular song into 10 second chunks. (This may or may not require a window function.) Additionally, the spectral representation uses equal horizontal spacing between any octave, so the images could be split by octave and averaged. A larger training set and more categories should provide more robust results.
The ARSS developer had good success recreating voice audio via freehand copying of spectral data in Photoshop. It might be possible to generate random music in the spectral format by a combination of rule-based MIDI design (a la David Cope) and a LSTM RNN while verifying the random song matches the Clarifai model for the target genre.
A web-based interface and better scripting would allow end users to access this software and view their music in the piano roll visual format provided by ARSS. There are opportunities to provide better data visualization, including histograms of the output and enhancements applied to the pictorial music. The classification accuracy could be refined by crowdsourcing the tagging process, using an implementation similar to RoboBrain.