miMic is an augmented microphone. It is an ordinary microphone embedding an Inertial Measurement Unit and two buttons. It senses voice and gesture under four modes. miMic is the "pencil" for sketching sound.
A need and an idea: february 13, 2015, on a plane to Marseille.
Sketching sound by voice and gesture: An activity that needs a tool as immediate as a pencil is for sketching on paper.
It would be a microphone, with two buttons and a camera. Two buttons would afford four modes, but only two are described here: Select and Play.
The camera captures gestural actions that accompany vocal sketching.
Bodystorming with StefanoDM playing vocal sketching while manipulating a fake microphone and a fake iron. StefanoB plays the role of the synthesizer. First, the sound model is selected. Second, sound synthesis from the selected model is controlled.
A camera is probably not a suitable mean to detect the gestures associated with sound sketching, for the following reasons:
Instead, an Inertial Measurement Unit (IMU) would allow detecting movement and orientation of the microphone.
These are papers that propose sensor-augmented microphones. They can serve as a background for the design of miMic, the sketching microphone.
A prototype developed by Sennheiser and presented at NIME 2012. It was also presented by Veronique Larcher at the SMC Summer School in Copenhagen, in the same year.
A sensorized microphone stand presented at NIME 2003.
Microphone for vocal augmentation and modifications, presented at NIME 2012.
The main questions here are:
To give an answer to these questions, we decide that the microphone should be graspable with a single hand (like a pencil is) and actuated with a couple of fingers. Given this requirement the attention is focused on stage microphone, thus excluding studio configurations that are not supposed to be manipulated. The two possible shapes are
In the initial idea and sketch, the buttons were put on the stand or on a separate button pad. However, we soon realized that it is much better to have everything in one hand. For the gelato shape, there is ample possibility to accomodate buttons on the stick, as in the Sennheiser prototype described in the literature research. However, the classic roundish shape is preferred as
Microphone (to be hacked): http://www.soundsationmusic.com/?p=25891
Pushbutton, latching with light (one white, one blue): https://www.sparkfun.com/products/11975
IMU Adafruit LSM9DS0: http://www.adafruit.com/products/2021
Arduino Nano: http://arduino.cc/en/Main/arduinoBoardNano
Two 220ohm resistors.
Jumpers and wires.
Segments of metal tube.
The two buttons have been put on top of the frontal shell. Two holes have been drilled and pieces of metal tube have been used to raise the buttons a bit (construction, Silvano Rocchesso).
The microcontroller+IMU+button combination have been tested with breadboard+clips wiring.
The Adafruit tutorial contains all information for wiring: https://learn.adafruit.com/adafruit-lsm9ds0-accelerometer-gyro-magnetometer-9-dof-breakouts/overview. The modification to Arduino code to handle the buttons is trivial.
In the microphone shell there is just enough space to host the two buttons, the IMU, and the Arduino Nano. For the latter, it is convenient to use one of the two holes of the plastic board that keeps the microphone capsule suspended. The Nano can be embedded in such hole, perpendicular to the plastic board (see picture).
The buttons have five pins each, that are numbered 0 to 4 in the depicted schematics. Pins 0 and 1 are shortcircuited (red little wire) and connected to a digital input of the Nano, pin 2 goes to +5V, and pin 0 is grounded.
It is necessary to drill a rectangular hole on the bottom part of the back shell to plug the USB cable.
To keep the parts easily removable, soldering has been limited to a minimum, and jumper wires have been used, although this causes quite a bit of clutter.
In order to test that buttons and motion sensor are working properly when the microphone is manipulated, I used the Processing sample code from the tutorial on how to make an Attitude and Heading Reference System (AHRS): https://learn.adafruit.com/ahrs-for-adafruits-9-dof-10-dof-breakout/introduction
I associated a white light to the white button and a blue light to the blue button to illuminate the rabbit.
For this test, the AHRS has not been calibrated, and the IMU board was just put into the microphone shells with no care about its orientation and firm positioning. That is why the rabbit is not axis-aligned with the microphone.
Selection of one or more sound models is operated by automatic classification of vocal imitations. There are two different approaches to the design of this function:
To demonstrate the "Select" mode of miMic we implement a basic model selector based on a classification tree. The construction of such classifier is based on the following steps:
Consider the following classes of sounds (models):
Extract 1329 examples of these classes from the Ircam imitation database (ref.).
Condition the extracted examples such that they are normalized at -1 dB FS, and length is at least 4 seconds.
Consider the following set of feature extractors that are part of the Sound Design Toolkit (sdt.spectralfeats~, sdt.pitch~, and sdt.envelope~):
For each feature, computed on windows of 4096 samples, with an overlap of 75%, compute the Median and IQR (InterQuartile Range) over the length of 4 seconds. For the sdt.spectralfeats∼ object the parameter minFreq and maxFreq are set at 50 Hz and 5000 Hz. sdt.pitch∼ has an additional tol- erance parameter set at 0.2 while sdt.envelope∼ has attack and release set to 10 msec and 1000 msec respectively. A Ratio between IQR and the Median is computed for Envelope and RMS features since they are dependent on the signal level. Values are sampled every 20 msec. The Max/MSP patch produces a line of text for each imitation example, which includes a label and the sequence of feature values.
The matlab script derives the binary classification tree.
To demonstrate the "Play" mode of miMic we use a Max patch that collects the five used sound models.
For each sound model, a control layer has been built and tuned by Stefano Delle Monache, in such a way that the vocalizations get immediately interpreted as control signals. The assumption is that if a user selects a sound model (e.g., wind) then she will start controlling the model by producing wind-like sounds with the voice. So, the control layer associated with the model must be ready to interpret such kind of control sounds. Only at a later stage the user might want to explore different vocal emissions and to tune parameters by hand. Both of these actions are made possible by the graphical interface, where detailed maps between vocal features and parameters can be drawn, and each individual model parameter can be manually set.
In the construction of the control layer, we must consider the limits of humans in controlling the dimensions of timbre, as shown by the following study:
169th Meeting Acoustical Society of America
18–22 May 2015
Vocal imitations of basic auditory features.
Guillaume Lemaitre, Ali Jabbari, Olivier Houix, Nicolas Misdariis, and
We recently showed that vocal imitations are effective descriptions of a variety of sounds (Lemaitre and Rocchesso, 2014). The current study investigated the mechanisms of effective vocal imitations by studying if speakers could accurately reproduce basic auditory features. It focused on four features: pitch, tempo (basic musical features), sharpness, and onset (basic dimensions of timbre). It used two sets of 16 referent sounds (modulated narrow-band noises and pure tones), each crossing two of the four features. Dissimilarity rating experiments and multidimensional scaling analyses confirmed that listeners could easily discriminate the 16 sounds based the four features. Two expert and two lay participants recorded vocal imitations of the 32 sounds. Individual analyses highlighted that participants could reproduce accurately pitch and tempo of the referent sounds (experts being more accurate). There were larger differences of strategy for sharpness and onset. Participants matched the sharpness of the referent sounds either to the frequency of one particular formant or to the overall spectral balance of their voice. Onsets were ignored or imitated with crescendos. Overall, these results show that speakers may not imitate accurately absolute dimensions of timbre, hence suggesting that other features (such as dynamic patterns) may be more effective for sound recognition.
Models for sound synthesis:
Voice-driven sound synthesis in miMic is achieved through a subset of the sound models palette available in the Sound Design Toolkit (SDT). The SDT is a software package providing advanced, perception-oriented and physically-consistent sound synthesis models that cover a mixture of acoustic phenomena, basic mechanical interactions (i.e., everyday sounds) and machines.
The software package and documentation can be downloaded at https://github.com/SkAT-VG/SDT/releases
Control and Mapping:
Each sound model is provided with a customizable control layer that allows to connect the vocal and gestural descriptors to the interactive parameters. While one-to-one and one-to-many mapping is a trivial task to accomplish, many-to-one associations are achieved through specific Js functions, that can be recalled and edited in the control layer, directly. Finally, a control module per parameter allows to smooth, scale, and eventually distort the audio feature value into the range meaningful for the control parameter.
An opposite approach is that of training a classifier by examples provided by a specific user. We tested this approach by using MuBu objects for content-based real-time interactive audio processing, under Cycling’74 Max.
The MuBu.gmm object extracts Mel-Frequency Cepstral Coefficients (MFCC) from audio examples (at least one example per sound class), and models each sound class as a mixture of Gaussian distributions. In the recognition phase, the likelihood for each class is estimated, and it can be used as a mixing weight for the corresponding sound model.
The classification tree is implemented in a Max/MSP patch for online recognition.
The vocal input undergoes the same processing adopted for the offline training so each imitation is analyzed and then the Median and IQR values are used in the classification tree. The classifiers select one of the classes and a synthetic example of that class is then played back to the user.
The individual-centered selection of sound models, when included in miMic, implies a further sub-mode of use, namely the Train procedure. In our realization the Train procedure is activated when both the buttons of miMic are pushed. To avoid using the GUI for this personalization stage, the user is requested to produce one vocal imitation for each of the five different sounds classes in a precise order, with an audible feedback marking the start and end of each imitation.