Translating Visual Works of Art into Music

Art is experienced as a flow of information between an artist and an observer. Should the latter be impaired in the principal sense which the artwork is aimed at however, a barrier appears. Such is the case for visually impaired people and paintings, for instance. The Synesthetic Variational Autoencoder attempts to overcome this obstacle by translating a painting's visual information into a more accessible sensory modality, namely music.

Read Thesis ICCVW Paper View Code

Synesthetic Variational Autoencoder

Architecture of the Synesthetic Variational Autoencoder

The Synesthetic Variational Autoencoder (SynVAE) is an unsupervised machine learning model which leverages properties of single-modality VAEs [1] in order to perform cross-modal translations. In this project, a visual VAE and MusicVAE [2] are combined in order to translate images into audio. This involves encoding an input image into music and then reconstructing it based on that audio alone. The model needs to stay as close as possible to the original while mainting consistency across sensory modalities. Similar looking images should therefore translate into similar sounding music. Below, you are welcome to explore image-music pairs from this project's experiments.



The classic MNIST dataset [3] contains monochrome images of handwritten digits and is used in many baseline machine learning tasks. Musical representations for identical digits sound noticably similar to each other when compared to different digits. A qualitative study confirmed that human listeners are able to distinguish between the digits "0", "1" and "4" with 73% accuracy by ear alone.


Small bird sitting on branch
Old red car
Two fighter jets flying
Small kitten
Horse on a field
Red fire engine
Small red bird on a branch
Blue car
Airborne airplane
Deer in high grass
Frog in grass
Head of a horse
Red truck

CIFAR-10 [4] contains small photographs of 10 distinct object classes. In contrast to the simpler MNIST data, SynVAE learns to prioritize higher-level features such as object placement and colour. A red car may therefore share more similarities with other red objects than with a car of a different colour.


Orange bird on a branch
Pink watercolour flower blossoms on white
Swirls of blue watercolour on white
Summer hat with ribbon on light green
Mountain in front of evening sky with moon
Woman surrounded by flowers
Continuous pattern of pink flowers with stem and leaves
Blue watercolour landscape with two humans and a deer
Two bears hugging in a field
Ocean shore with houses in front of flaming sky
Light purple flowers on grass
Grassy landscape with three houses and a lake
A smiling blue fish made from puzzle pieces
Mostly monochrome person in calavera makeup
Strongly stylized face of a person in dark grey and red tones
Two flowers with purple hue in tall grass
Forest landscape with a lake
Colourful fish swimming in blue
Dark skyscrapers turning into hounds and chasing people
Stylized face of a person in dark grey

The Behance Artistic Media dataset (BAM) [5] contains ∼2.3m annotated contemporary works of visual art of which we focus specifically on oil and watercolour paintings. Musical representations tend to encode higher-level information such as colour and overall structure. In a qualitative study, human evaluators were able to distinguish between images of 3 different emotion labels with 71% accuracy.

Van Gogh

Vincent van Gogh's expansive breadth of artworks offers a unique environment for SynVAE. As a proof of concept, ∼1000 musically translated artworks from the Van Gogh Museum’s permanent collection are available to download below. With a pre-trained model this translation process takes less than one minute to complete. The archive contains artwork thumbnails and associated MIDIs as well as meta-data and is available for non-commercial use.



  1. Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR).
  2. Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. 2018. A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music. CoRR abs/1803.05428 (2018). arXiv:1803.05428
  3. Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. 1998. Gradientbased learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
  4. Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. Technical Report. Citeseer.
  5. Michael J. Wilber, Chen Fang, Hailin Jin, Aaron Hertzmann, John Collomosse, and Serge Belongie. 2017. BAM! The Behance Artistic Media Dataset for Recognition Beyond Photography. In The IEEE International Conference on Computer Vision (ICCV).