SynVAE :: personads

Translating Visual Works of Art into Music

Art is experienced as a flow of information between an artist and an observer. Should the latter be impaired in the principal sense which the artwork is aimed at however, a barrier appears. Such is the case for visually impaired people and paintings, for instance. The Synesthetic Variational Autoencoder attempts to overcome this obstacle by translating a painting's visual information into a more accessible sensory modality, namely music.

Read Thesis ICCVW Paper View Code

Synesthetic Variational Autoencoder

Architecture of the Synesthetic Variational Autoencoder

The Synesthetic Variational Autoencoder (SynVAE) is an unsupervised machine learning model which leverages properties of single-modality VAEs ^[1] in order to perform cross-modal translations. In this project, a visual VAE and MusicVAE ^[2] are combined in order to translate images into audio. This involves encoding an input image into music and then reconstructing it based on that audio alone. The model needs to stay as close as possible to the original while mainting consistency across sensory modalities. Similar looking images should therefore translate into similar sounding music. Below, you are welcome to explore image-music pairs from this project's experiments.

MNIST

The classic MNIST dataset ^[3] contains monochrome images of handwritten digits and is used in many baseline machine learning tasks. Musical representations for identical digits sound noticably similar to each other when compared to different digits. A qualitative study confirmed that human listeners are able to distinguish between the digits "0", "1" and "4" with 73% accuracy by ear alone.

CIFAR-10

CIFAR-10 ^[4] contains small photographs of 10 distinct object classes. In contrast to the simpler MNIST data, SynVAE learns to prioritize higher-level features such as object placement and colour. A red car may therefore share more similarities with other red objects than with a car of a different colour.

BAM

Pink watercolour flower blossoms on white

Mountain in front of evening sky with moon

Continuous pattern of pink flowers with stem and leaves

Blue watercolour landscape with two humans and a deer

Ocean shore with houses in front of flaming sky

Grassy landscape with three houses and a lake

A smiling blue fish made from puzzle pieces

Mostly monochrome person in calavera makeup

Strongly stylized face of a person in dark grey and red tones

Two flowers with purple hue in tall grass

Dark skyscrapers turning into hounds and chasing people

The Behance Artistic Media dataset (BAM) ^[5] contains ∼2.3m annotated contemporary works of visual art of which we focus specifically on oil and watercolour paintings. Musical representations tend to encode higher-level information such as colour and overall structure. In a qualitative study, human evaluators were able to distinguish between images of 3 different emotion labels with 71% accuracy.

Van Gogh

Carafe and Dish with Citrus Fruit (1887)

The Old Church Tower at Nuenen ('The Peasants' Churchyard') (1885)

Congregation Leaving the Reformed Church in Nuenen (1884 - 1885)

Tree and Bushes in the Garden of the Asylum (1889)

Head of a Skeleton with a Burning Cigarette (1886)

Flowering Plum Orchard (after Hiroshige) (1887)

Bridge in the Rain (after Hiroshige) (1887)

Vincent van Gogh's expansive breadth of artworks offers a unique environment for SynVAE. As a proof of concept, ∼1000 musically translated artworks from the Van Gogh Museum’s permanent collection are available to download below. With a pre-trained model this translation process takes less than one minute to complete. The archive contains artwork thumbnails and associated MIDIs as well as meta-data and is available for non-commercial use.

Download

References

Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR).
Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. 2018. A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music. CoRR abs/1803.05428 (2018). arXiv:1803.05428 http://arxiv.org/abs/1803.05428
Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. 1998. Gradientbased learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. Technical Report. Citeseer.
Michael J. Wilber, Chen Fang, Hailin Jin, Aaron Hertzmann, John Collomosse, and Serge Belongie. 2017. BAM! The Behance Artistic Media Dataset for Recognition Beyond Photography. In The IEEE International Conference on Computer Vision (ICCV).