Publications By LightOn

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Authors: Teven Le Scao and 300+ authors

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.

What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

Authors: Teven Le Scao and 300+ authors

RITA: a Study on Scaling Up Generative Protein Sequence Models

Authors: Daniel Hesslow, Niccoló Zanichelli, Pascal Notin, Iacopo Poli, Debora Marks

Technical Reports and Preprints – Machine Learning, LLMs for Biology
‍In this work we introduce RITA: a suite of autoregressive generative models for protein sequences, with up to 1.2 billion parameters, trained on over 280 million protein sequences belonging to the UniRef-100 database. Such generative models hold the promise of greatly accelerating protein design. We conduct the first systematic study of how capabilities evolve with model size for autoregressive transformers in the protein domain: we evaluate RITA models in next amino acid prediction, zero-shot fitness, and enzyme function prediction, showing benefits from increased scale. We release the RITA models openly, to the benefit of the research community.

A Holistic Assessment of the Carbon Footprint of Noor, a Very Large Arabic Language Model

Authors: Imad Lakim, Ebtesam Almazrouei, Merouane Debbah, Julien Launay

ACL 2022 Workshop BigScience – LLMs – April 2022As ever-larger language models grow more ubiquitous, it is crucial to consider their environmental impact. Characterised by extreme size and resource use, recent generations of models have been criticised for their voracious appetite for compute, and thus significant carbon footprint. Although reporting of carbon impact has grown more common in machine learning papers, this reporting is usually limited to compute resources used strictly for training. In this work, we propose a holistic assessment of the footprint of an extreme-scale language model, Noor. Noor is an ongoing project aiming to develop the largest multi-task Arabic language models–with up to 13B parameters–leveraging zero-shot generalisation to enable a wide range of downstream tasks via natural language instructions. We assess the total carbon bill of the entire project: starting with data collection and storage costs, including research and development budgets, pretraining costs, future serving estimates, and other exogenous costs necessary for this international cooperation. Notably, we find that inference costs and exogenous factors can have a significant impact on the total budget. Finally, we discuss pathways to reduce the carbon footprint of extreme-scale models.

What Language Model to Train if You Have One Million GPU Hours?

Authors: Daniel Hesslow, Teven Le Scao, Lucile Saulnier, Thomas Wang, M Saiful Bari, Stas Bekman, Stella Biderman, Hady Elsahar, Jason Phang, Ofir Press, Colin Raffel, Victor Sanh, Sheng Shen, Lintang Sutawika, Jaesung Tae, Zheng Xin Yong, Julien Launay, Iz Beltagy

ACL 2022 Workshop BigScience – LLMs – April 2022As the size of language models continues to grow they become increasingly more powerful and lead to better results, but they also become more expensive to design and train. Given a compute budget that’s enough to train a multilingual transformers language model in the 100B+ parameter scale, our goal is to choose the architecture and the training setup of such a model. Specifically, we perform an ablation study comparing different modelling architectural, which can significantly impact the performance of the resulting models. We focus on the 1.3B parameter scale providing a compromise between the compute cost of the architecture search and the probability that our conclusions hold for the target 100B+ model. In addition, we study the impact of various popular pretraining corpora on the quality of the model. We also study the performance of training a multilingual model and how it compares to the English-only one. Finally, we consider the scaling behaviour of transformer models to choose the target model size, its shape and its training setup.

PAGnol: An Extra-Large French Generative Model

‍Authors: Julien Launay, Elena Tommasone, Baptiste Pannier, François Boniface, Amélie Chatelain, Alessandro Cappelli, Iacopo Poli, Djamé Seddah

LREC 2022 – LLMs – Initially published: October 2021
Access to large pre-trained models of varied architectures, in many different languages, is central to the democratization of NLP. We introduce PAGnol, a collection of French GPT models. Using scaling laws, we efficiently train PAGnol-XL (1.5B parameters) with the same computational budget as CamemBERT, a model 13 times smaller. PAGnol-XL is the largest model trained to date for the French language. We plan to train increasingly large and performing versions of PAGnol, exploring the capabilities of French extreme-scale models.For this first release, we focus on the pre-training and scaling calculations underlining PAGnol. We fit a scaling law for compute for the French language and compare it with its English counterpart. We find the pre-training dataset significantly conditions the quality of the outputs, with common datasets such as OSCAR leading to low-quality offensive text. We evaluate our models on discriminative and generative tasks in French, comparing them to other state-of-the-art French and multilingual models, and reaching the state of the art in the abstract summarization task. Our research was conducted on the public GENCI Jean Zay supercomputer, and our models up to the Large are made publicly available.

Scaling Laws Beyond Backpropagation

Authors: Matthew J. Filipovich, Alessandro Cappelli, Daniel Hesslow, Julien Launay

NeurIPS 2022 – Workshop: I Can’t Believe It’s Not Better, December 2022
Alternatives to backpropagation have long been studied to better understand how biological brains may learn. Recently, they have also garnered interest as a way to train neural networks more efficiently. By relaxing constraints inherent to backpropagation (e.g., symmetric feedforward and feedback weights, sequential updates), these methods enable promising prospects, such as local learning. However, the tradeoffs between different methods in terms of final task performance, convergence speed, and ultimately compute and data requirements are rarely outlined. In this work, we use scaling laws to study the ability of Direct Feedback Alignment~(DFA) to train causal decoder-only Transformers efficiently. Scaling laws provide an overview of the tradeoffs implied by a modeling decision, up to extrapolating how it might transfer to increasingly large models. We find that DFA fails to offer more efficient scaling than backpropagation: there is never a regime for which the degradation in loss incurred by using DFA is worth the potential reduction in compute budget. Our finding comes at variance with previous beliefs in the alternative training methods community, and highlights the need for holistic empirical approaches to better understand modeling decisions.

Shedding a PAC-Bayesian Light on Adaptive Sliced-Wasserstein Distances

Authors: Ruben Ohana, Kimia Nadjahi, Alain Rakotomamonjy, Liva Ralaivola

Technical Reports and Preprints – Machine Learning
The Sliced-Wasserstein distance (SW) is a computationally efficient and theoretically grounded alternative to the Wasserstein distance. Yet, the literature on its statistical properties with respect to the distribution of slices, beyond the uniform measure, is scarce. To bring new contributions to this line of research, we leverage the PAC-Bayesian theory and the central observation that SW actually hinges on a slice-distribution-dependent Gibbs risk, the kind of quantity PAC-Bayesian bounds have been designed to characterize. We provide four types of results: i) PAC-Bayesian generalization bounds that hold on what we refer as adaptive Sliced-Wasserstein distances, i.e. distances defined with respect to any distribution of slices, ii) a procedure to learn the distribution of slices that yields a maximally discriminative SW, by optimizing our PAC-Bayesian bounds, iii) an insight on how the performance of the so-called distributional Sliced-Wasserstein distance may be explained through our theory, and iv) empirical illustrations of our findings.

A high-fidelity and large-scale reconfigurable photonic processor for NISQ applications

Authors: A. Cavaillès, P. Boucher, L. Daudet, I. Carron, S. Gigan, K. Müller

Technical Reports and Preprints – Machine Learning
Reconfigurable linear optical networks are a key component for the development of optical quantum information processing platforms in the NISQ era and beyond. We report the implementation of such a device based on an innovative design that uses the mode mixing of a multimode fiber in combination with the programmable wavefront shaping of an SLM. The capabilities of the platform are explored in the classical regime. For up to a record number of 8~inputs and 38~outputs we achieve fidelities in excess of 93%, week-long stability and losses below 6.5dB. The device was built inside a standard server rack to allow for real-world use.

Binarization for Optical Processing Units via REINFORCE

Authors: B. Kozyrskiy, I. Poli, R. Ohana, L. Daudet, I. Carron, M. Filippone

Conference proceedings – Machine Learning – November 2021
Optical Processing Units (OPUs) are computing devices that perform random projections of input vectors by exploiting the physical phenomenon of scattering a light source through an opaque medium. OPUs have successfully been proposed to carry out approximate kernel ridge regression at scale and with low power consumption by the means of optical random features. OPUs require input vectors to be binary, and this work proposes a novel way to perform supervised data binarization. The main difficulty to develop a solution is that the OPU projection matrices are unknown which poses a challenge in deriving a binarization approach in an end-to-end fashion. Our approach is based on the REINFORCE gradient estimator, which allows us to estimate the gradient of the loss function with respect to binarization parameters by treating the OPU as a black box. Through experiments on several UCI classification and regression problems, we show that our method outperforms alternative unsupervised and supervised binarization techniques.