BLOOM: A Text Generative Large Language Model for 46+ languages!

BLOOM was recently released to the public. It is mostly a transformers based language model used as an generative model for text.

Anitya Gangurde
5 min readJul 12, 2022
Photo by Sharon Pittaway on Unsplash

What is BLOOM?

BLOOM aka BigScience Large Open-science Open-access Multilingual Language Model is an autoregressive Large Language Model (LLM). What does that mean? An autoregressive model relies on the past values to predict the present ones, where it multiplies the sum of past outcomes with a numeric factor to give out the current value.

BLOOM is easily able to generate humanly indistinguishable text in around 46 languages and 13 programming languages. Whenever an input text is given to BLOOM it can continue the write-up to generate relevant continuation by looking at the previous words.

This is a culmination of a year of work involving over 1000 researchers from 70+ countries and 250+ institutions, leading to a final run of 117 days (March 11 — July 6) training the BLOOM model on the Jean Zay supercomputer in the south of Paris, France thanks to a compute grant worth an estimated €3M from French research agencies CNRS and GENCI.

Technical details

BLOOM was developed by modifying the Megatron-Language model GPT-2, that accelerated the training times consumed by large models considerably when it was published. It has around 176 billion parameters which makes it the largest of the Language models out there. The 70 layered structure and 112 attention heads takes up the hidden layer dimensions to 14336 dimensions!

For training, texts from 46 different human languages were used and 13 more programming languages were also added. The pre-processed text then came about to be 1.6TB in total size.

The training took almost 4 months to finish training (from 11th March 2022 to 5 July 2022) and it was released today (on 12 July 2022) after some finishing touches. The estimated cost of training came to around $2–5M in cloud computing!

The environmental impact was also taken into account hence the training supercomputer was mostly run on nuclear energy and the heat generated by the GPUs was reused for heating campus housing.

Implementations

The model can be downloaded by referring the code below with transformers and accelerate libraries already installed:

Credits: bigscience/bloom · Hugging Face

For people with no supercomputers at home, an Hugging Face inference API, has also been made available and can be found over here: https://huggingface.co/bigscience/bloom

The direct use is mostly limited to text generation and study the language generated by such language models for conducting further research. But the model can be modified for tasks such as Information Extraction, Question Answering and text summarization.

The researchers have although warned regarding the authenticity of the text being generated and suggests that factual content (such as maths, history) should not be trusted directly. The model has also not been suggested for use in biomedical, political and legal side.

Also, intentionally generating text or using the model in a way that harms or injures or leads to a violation of human rights comes under the misuse of the model. Activities, such as spam creation, defamation, harassment, deception, disinformation and even generating content without attribution to the model, has been included under the misuse.

Limitations

The BLOOM model was again trained on a real world data set hence, it might lead to generation of text that can be biased towards certain things. This can lead an over-representation of some viewpoints and under-representation of others, and can also have encourage stereotypes.

It can also lead to a generation of hateful, discriminatory or even inappropriate (e.g. sexual) content. Along with it, the model can represent content as being factually correct when it is not and can also lead to generation of repetitive phrases.

Other than that it has also been warned that the model might make its users feel that as if it is sentient or conscious, hence, some of the users might also be aware of this.

BLOOM authors also suggest that the users or the consumers of the text generated by the model should know that the content they’re working with was created by an LLM.

Things I tried with BLOOM

I used the HuggingFace Interface to generate a continuation text for a wikipedia article. I chose the page on Grand Theft Auto 5 and copy pasted a random sentence in the input box. Here, are some of the screenshots:

Text generation for GTA 5 article on wikipedia. Generated text in blue.

I also tried one text generation instance for the Hindi language. Here are the results:

Hindi Fairy tale text continuation. Generated text in blue.

It is also possible to ask BLOOM programming language related translations. Below is one example:

A python “print” statement being translated to R, PHP and Java.

Conclusion

Here, we looked at a quick summary for the recently released BLOOM LLM and saw how it can be used. We also looked at the limitations that such models still possess.

BLOOM model is the first language model with over 100B parameters for almost all of the languages included while training such as Spanish, French and Arabic.

Such open source models built with an effort from all the world truly shows the world that we’re living in now and the possibilities that are in front of us.

Thanks for reading! I’m Anitya Gangurde and follow me for more such blogs in the future.

Applause!

--

--

Anitya Gangurde
Anitya Gangurde

Written by Anitya Gangurde

AI Product Manager | Futurist | Transferring my neural signals into the digital space

Responses (2)