Elephants, Rainbows and Scrambled Eggs at NeurIPS 2020

This post is part of the ELLIS CPH NeurIPS 2020 Reading Group. Many thanks to the organizers and group members for the interesting discussions.

NeurIPS 2020 has come to a close and although this year's virtual format was spent in the company of browser tabs instead of peers, ELLIS Copenhagen kindly organized a reading group for home-bound PhDs. Here, I would like to summarize personal impressions from our discussions by placing some of the conference's papers into an overarching natural language processing context.

The 175 Billion Parameter Elephant in the Room

Within the NLP community, "Language Models are Few-Shot Learners" (Brown et al., 2020) [1] certainly provided a topic for debate. Weighing in at 175 billion trainable parameters, GPT-3 demonstrates how far language models can be pushed with the data and computational resources currently available. It does not introduce any new architectural elements, but competes exclusively on scale (Azevedo et al. (2009) [2] estimate neurons in the human brain to number ~86 billion).

GPT-3 hands over the "Nuuk" token when prompted for Greenland's capital city

Despite only being trained to predict tokens in a sequence based on their surrounding context, GPT-3 performs suprisingly well on a large number of tasks. At inference time, it merely receives a variable number of generic task examples in the format "Context [...] Target Completion [..] Correct/Incorrect Answer [...]" plus the target it should predict:

It achieves competitive results compared to task-specific, fine-tuned models on a diverse set of problems including:

Given the size of the training corpus, the authors make an effort in preventing rote memorization. They scan for and remove sequences which cooccur in both the training and testing data. As stated in Appendix C however:

Unfortunately, a bug resulted in only partial removal of all detected overlaps from the training data. Due to the cost of training, it wasn’t feasible to retrain the model.

Further analyses do show that this accidental overlap has a negligible impact on performance, but the infeasibility of retraining the model provides a first look at the special considerations necessary for this extremely scale-reliant approach.

From a purely methodological standpoint, the main contribution of GPT-3 is to demonstrate the current upper bound of a purely data-driven approach to NLP. The broader implications of this model will be discussed later (see Broader Impact), but let us first take a look at alternative approaches to advancing the field.

See No Latent, Hear No Latent, Speak No Latent

One avenue for exploration, also mentioned by Brown et al., is using not just more data, but different data. Within the broad spectrum of multimodality, the "Self-Supervised MultiModal Versatile Networks" of Alayrac et al. (2020) [7] provided an interesting angle on how to combine language with information from visual and auditive modalities.

In experiments on action classification as well as audio and video retrieval from text, they show that it is not sufficient to feed the model low-level, audio-visual input concatenated with abstract, higher-level language information. Instead, their MultiModal Versatile Network learns joint representations for audio and video in addition to a separate, aligned language space. This hierarchical setup achieves competitive performance on a variety of tasks, such as retrieving video of Gordon Ramsay cooking scrambled eggs from the YouCook2 dataset (Zhou, Xu and Corso, 2017) [8].

The next logical step is of course to build a robot which does the scrambling for you. Luckily, Stepputtis et al. make some headway in this area with "Language-Conditioned Imitation Learning for Robot Manipulation Tasks" [9]. They extend the standard reinforcement learning setting of having a (simulated) robot arm move around objects of various shapes and colours by defining the target through user-friendly natural language descriptions.

Concurrently to Alayrac et al., this multimodal architecture is also not just a concatenation of visual and textual inputs: visual features describing potential target objects in the environment are weighted as a function of the task description's sentence embedding. The resulting "semantic model" provides crucial information for the RL agent to complete its tasks successfully.

The authors take care to collect a diverse set of verbal commands from different people and to also test their agent in a variety of adversarial conditions. Of course, the one GRU cell (Cho et al., 2014) [10] used to recurrently construct sentence embeddings from GloVe vectors (Pennington, Socher and Manning, 2014) [11] might not feel like the state-of-the-art in NLP terms, but somewhere in between this and GPT-3, there definitely lies a middle ground for future work. I for one am all in on the Gordon Ramsay NLP robot arm.

Robot arm says: "Finally some good frickin' food!"

Language Oscillates

Although the aforementioned papers in one way or another advocate for "more", it is equally important to take a step back and reduce existing methods to their essentials. Tamkin, Jurafasky and Goodman do exactly this in "Language Through a Prism" [12].

Research into information stored in contextual language models often looks at embeddings from layers at different depths. Tamkin et al. take an almost orthogonal approach by horizontally looking at frequency variations in sequential neuron activations. The idea comes from audio processing where Fourier transforms are applied to a signal in order to decompose it into simpler wave functions (e.g. this used to be how Shazam fingerprinted music).

The method applied here is a real-valued equivalent named the discrete cosine transform (Ahmed, Natarajan and Rao, 1974) [13]. Sequences of individual neuron activations are represented as a sum of cosine waves with different periods and amplitudes. Tamkin et al. identify that higher-frequency (i.e. faster changing) values in BERT embeddings (Devlin et al., 2018) [14] correspond to word-level information while lower-frequency (i.e. slower changing) values correspond to longer term structures such as documents.

Frequency decomposition is like jumping rope at different lengths

For word-level POS-tagging, performance is best when all but the highest frequencies are filtered out (similarly to a high-pass audio filter). On the flip-side, mid to low frequency embeddings even outperformed the original, non-filtered embeddings on document-level topic classification and sentence-level dialogue speech act classification.

Additionally, by enforcing frequency decomposition at training time using a single "prism" layer on top of BERT-Base [14], performance improves by 18.8% on topic and by 6.9% on speech act classification. There is a 1.5% decrease on the POS-tagging task, but these results nonetheless highlight how the fairly local masked language modeling objective can be improved to capture information at all scales.

Overall, this paper was a pleasant reminder that reducing current methods to their core can reveal interesting insights which can be applied to improve the very same models without a single point of additional data.

Broader Impact

This year's NeurIPS papers included so-called "broader impact" statements. At first glance, these might seem like another hoop to jump through in the submission process, yet they were mostly well thought out and guided many of our reading group discussions. Theoretical research lends itself less immediately to such analyses and Tamkin et al. [12] for instance close off their research by restating the double-edged sword for misuse and labour displacement that comes with the increased efficiency of any ML model.

Stepputtis et al. [9] correctly argue that before their language driven robotics system can be deployed on, say a voice controlled wheelchair, much more research must go into adversarial scenarios. Alayrac et al. [7] argue for the benefits their multimodal model can provide to retrieving relevant information and to filter out potentially harmful content. However, they also state the risk that biases in the training data are likely to perpetuate and influence both processes.

Brown et al. [1] dedicate a large portion of their paper to analysing the broader impact of GPT-3. Considering the large amounts of internet data it ingested, biases from these sources are reflected in the probabilities which the language model assigns to coocurring terms.

For instance, asking the model to predict a gendered identifier in relation to most occupations leads to masculine-leaning results (e.g. "The competent [occupation] is a [prediction]."). Conversely, the presence of female identifiers induced a higher probability for appearance related words in the surrounding context. Data induced biases were also observed for sentiment regarding racial attributes and for high-probability words which are predicted for religious affiliations.

On the scale of documents, the authors also outline the potential for misuse of GPT-3 as it can generate realistic, but potentially misinformative news articles much faster than any human. Using the as of yet unreleased 175 billion parameter model, ~80 participants were unable to distinguish between generated and real articles more than 52% of the time.

As a final note, training GPT-3's largest variant truly represents an upper bound of what is currently possible. Anthony, Kanding and Selvan (2020) [15] break down the energy usage with their carbontracker tool and estimate that training all 175 billion parameters would take roughly 188701.92 kWh of electricity, corresponding to the CO2 emissions of a 2018 EU car driven for 703,808.01 km. With a minimum distance of ~356,500 km, you could take a trip to the Moon and almost make it back (running out of gas close to the GPS satellites' orbits - drive safely). Alternatively, you could use a green energy provider which in my case would only rack up a bill of around €43,000 (incl. tax).

Training GPT-3 on a RasPi may have been a mistake

Hopefully, this post provided some interesting take-aways from the immense breadth of topics covered at NeurIPS. Thank you to all the authors for sharing their research, to the organizers for providing the setting and to my fellow reading group members for the lively discussions in these isolated times.



  1. Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. and Agarwal, S., 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33.
  2. Azevedo, F.A., Carvalho, L.R., Grinberg, L.T., Farfel, J.M., Ferretti, R.E., Leite, R.E., Filho, W.J., Lent, R. and Herculano‐Houzel, S., 2009. Equal numbers of neuronal and nonneuronal cells make the human brain an isometrically scaled‐up primate brain. Journal of Comparative Neurology, 513(5), pp.532-541.
  3. Paperno, D., Kruszewski, G., Lazaridou, A., Pham, Q.N., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G. and Fernández, R., 2016. The LAMBADA dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031.
  4. Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P. and Allen, J., 2016. A corpus and evaluation framework for deeper understanding of commonsense stories. arXiv preprint arXiv:1604.01696.
  5. Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A. and Choi, Y., 2019. HellaSwag: Can a Machine Really Finish Your Sentence?. arXiv preprint arXiv:1905.07830.
  6. Joshi, M., Choi, E., Weld, D.S. and Zettlemoyer, L., 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551.
  7. Alayrac, J.B., Recasens, A., Schneider, R., Arandjelović, R., Ramapuram, J., De Fauw, J., Smaira, L., Dieleman, S. and Zisserman, A., 2020. Self-supervised multimodal versatile networks. Advances in Neural Information Processing Systems, 33.
  8. Zhou, L., Xu, C. and Corso, J.J., 2017. Towards automatic learning of procedures from web instructional videos. arXiv preprint arXiv:1703.09788.
  9. Stepputtis, S., Campbell, J., Phielipp, M., Lee, S., Baral, C. and Ben Amor, H., 2020. Language-Conditioned Imitation Learning for Robot Manipulation Tasks. Advances in Neural Information Processing Systems, 33.
  10. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. and Bengio, Y., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
  11. Pennington, J., Socher, R. and Manning, C.D., 2014, October. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
  12. Tamkin, A., Jurafsky, D. and Goodman, N., 2020. Language Through a Prism: A Spectral Approach for Multiscale Language Representations. Advances in Neural Information Processing Systems, 33.
  13. Ahmed, N., Natarajan, T. and Rao, K.R., 1974. Discrete cosine transform. IEEE transactions on Computers, 100(1), pp.90-93.
  14. Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  15. Anthony, L.F.W., Kanding, B. and Selvan, R., 2020. Carbontracker: Tracking and predicting the carbon footprint of training deep learning models. arXiv preprint arXiv:2007.03051.