Mastering Falcon-40B: A Guide to Fine-Tuning on E2E Networks

June 22, 2023

What is Falcon 40B?

Falcon 40-B is an open source LLM released by the Technology Innovation Institute (TII) in the UAE. It holds the topmost position among the world's LLMs, and is also the world's top-ranked royalty-free LLM. Trained on 40 billion parameters and a dataset of 1 trillion tokens, Falcon 40-B earned its name from these impressive numbers. Additionally, there is a smaller version known as the Falcon 7-B.

‍

The Leaderboard ( Source )

Challenges Falcon-40B Solves

Falcon 40-B was specifically developed by the TII to tackle several challenges, including:

1. The need for an open-source large language model.

2. The need for a versatile model that can be trained and tested on different languages, despite being primarily trained on data from the Middle East.

Falcon 40-B is openly accessible to all for research and academic purposes, making it a valuable contribution to the academic community.

Capabilities

It is capable of performing a lot of operations listed below but not restricted to it:

Natural Language Generation

Falcon 40-B is capable of creating a wide range of contextually accurate content. It can generate high-quality natural language outputs like blog articles, persuasive cold emails and also perform text translations.

Natural language understanding

With its robust language processing capabilities, Falcon 40-B is more context-aware. The LLM can evaluate and understand entity relationships and draw relevant insights from the input.

Code Generation

The Falcon 40-B can help developers by automatically creating code snippets for specific tasks, saving time and effort in the development process.

Data Analysis

The extensive language modeling skills of the Falcon 40-B allow it to obtain significant insights from complex information. Users can gain a better understanding through Falcon-40 B's capability of identifying patterns, trends, and correlations.

Machine Learning

The Falcon 40-B is well-suited for various machine learning tasks, such as training and fine-tuning models on multiple datasets, permitting academics and practitioners to explore and develop the area of machine learning.

Latest Developments

As a royalty-free solution, Falcon 40-B has gained widespread adoption across government and private sectors, with notable emphasis on its utilization by the UAE Government. Academicians have actively experimented with the model, conducting rigorous tests on leading cloud platforms such as E2E, to explore its capabilities and potential applications.

Model and Architecture

The architecture of Falcon 40-B is inspired by GPT-3 but the main difference is that it uses:

FlashAttention
Positional Embeddings, and
Decoder Blocks

The FlashAttention approach

It is faster, memory efficient, and exact, i.e., there is no approximation. Below is the architecture of FlashAttention:

‍

Source

By using SRAM (Static Random Memory) , which is way faster and way smaller than Graphic Processing unit (GPU) or HBM (High-Bandwidth Memory), we can attain a very high speed in the training process. The computation is done block by block which is also called tiling. This saves a lot of memory and we will obtain a high-quality model.

Positional Embeddings

Positional Embeddings help the model to learn long range dependencies. It is mostly used for machine translation.

‍

‍

Source

Decoder Blocks

Decoder Blocks are then used to decode the message in this transformer based architecture.

Source

Block Structure of Falcon 40-B

Source

The above diagram shows the block diagram of Falcon 40-B. The inputs are the Query (Q), Key(K), and Value(V). Q and K are taken together, linear masking is applied and V is multiplied to it to produce a linear output. The attention is computed block by block which speeds up the training process and saves memory. The model can be run on a single A100 with 80GB of RAM.

Why is Falcon 40-B so powerful?

Falcon 40-B is trained on a very huge corpora of word embeddings. The training data, sourced from the "refined web" and Reddit conversations, is of exceptionally high quality. The refined web dataset is built upon the vast archives of CommonCrawl, which have been collecting petabytes of data since 2008. The sheer magnitude of this data contributes to Falcon 40-B's unparalleled strength and effectiveness.

‍

Comparison of Falcon 40B with other LLM Models

Launching Falcon-40B on E2E Cloud

To successfully launch Falcon-40B on the E2E Cloud platform, follow the step-by-step guide below.

Generate your set of SSH keys in your local system using the following command:

ssh-keygen

A public and private key will be generated for your local system. Never ever share your private key with anyone. Add the public ssh key on E2E Cloud under Settings > SSH Keys > Add New Key. like this:

After you have added the key, log in to the E2E network , create a from your local network via SSH:

$ ssh username@ip_address

Enter the password when prompted to.

It's always a good practice to update and upgrade the machine.

$ sudo apt update & upgrade

Install lfs.

$ git lfs install

And then clone the Falcon 40-B repository:

$ git clone https://huggingface.co/tiiuae/falcon-40b

Also, download the dataset:

$ git clone https://huggingface.co/datasets/tiiuae/falcon-refinedweb

Install the necessary packages:


$ sudo apt -y install -qq aria2 
$ pip install -q -U torch torchvision torchaudio torchtext torchdata --extra-index-url https://download.pytorch.org/
$ pip install -q -U bitsandbytes sentencepiece fsspec gradio einops xformers 
$ pip install -q -U git+https://github.com/huggingface/transformers.git 
$ pip install -q -U git+https://github.com/huggingface/accelerate.git

To get started with the model, create a python script, say “script.py”:

$ touch script.py

Edit the “script.py” file and add the following to get started with the training process:


from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers 
import torch 

model = "tiiuae/falcon-40b" 
tokenizer = AutoTokenizer.from_pretrained(model) 
pipeline = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer,
    torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto", )
sequences = pipeline( "Girafatron is obsessed with giraffes, the most 
    glorious animal on the face of this Earth. Giraftron believes all other 
    animals are irrelevant when compared to the glorious majesty of the 
    giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:", max_length=200,
    do_sample=True, top_k=10, num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id, ) 
for seq in sequences: 
  print(f"Result:{" "}{seq["generated_text"]}"

Run:

$ python script.py

This will load the model weights and will generate text.

Conclusion

Falcon 40-B has opened up new opportunities for academics and researchers to examine and use online data for their own developments. The distribution of the free dataset by TII is an admirable example of open-source community action. As researchers, we are excited about the possibilities that this model may offer in the near future. In addition, the user-friendly E2E Networks cloud platform offers an affordable option for training large-scale models. We're excited about using our own data to train LLMs at scale on the E2E cloud infrastructure.

Sign up for Free Trial

Latest Blogs

A vector illustration of a tech city using latest cloud technologies & infrastructure

Mastering Falcon-40B: A Guide to Fine-Tuning on E2E Networks

June 22, 2023

Surya Remanan

What is Falcon 40B?

‍

The Leaderboard ( Source )

Challenges Falcon-40B Solves

Falcon 40-B was specifically developed by the TII to tackle several challenges, including:

1. The need for an open-source large language model.

2. The need for a versatile model that can be trained and tested on different languages, despite being primarily trained on data from the Middle East.

Falcon 40-B is openly accessible to all for research and academic purposes, making it a valuable contribution to the academic community.

Capabilities

It is capable of performing a lot of operations listed below but not restricted to it:

Natural Language Generation

Natural language understanding

With its robust language processing capabilities, Falcon 40-B is more context-aware. The LLM can evaluate and understand entity relationships and draw relevant insights from the input.

Code Generation

The Falcon 40-B can help developers by automatically creating code snippets for specific tasks, saving time and effort in the development process.

Data Analysis

Machine Learning

Latest Developments

Model and Architecture

The architecture of Falcon 40-B is inspired by GPT-3 but the main difference is that it uses:

FlashAttention
Positional Embeddings, and
Decoder Blocks

The FlashAttention approach

It is faster, memory efficient, and exact, i.e., there is no approximation. Below is the architecture of FlashAttention:

‍

Source

Positional Embeddings

Positional Embeddings help the model to learn long range dependencies. It is mostly used for machine translation.

‍

‍

Source

Decoder Blocks

Decoder Blocks are then used to decode the message in this transformer based architecture.

Source

Block Structure of Falcon 40-B

Source

Why is Falcon 40-B so powerful?

‍

Launching Falcon-40B on E2E Cloud

To successfully launch Falcon-40B on the E2E Cloud platform, follow the step-by-step guide below.

Generate your set of SSH keys in your local system using the following command:

ssh-keygen

After you have added the key, log in to the E2E network , create a from your local network via SSH:

$ ssh username@ip_address

Enter the password when prompted to.

It's always a good practice to update and upgrade the machine.

$ sudo apt update & upgrade

Install lfs.

$ git lfs install

And then clone the Falcon 40-B repository:

$ git clone https://huggingface.co/tiiuae/falcon-40b

Also, download the dataset:

$ git clone https://huggingface.co/datasets/tiiuae/falcon-refinedweb

Install the necessary packages:


$ sudo apt -y install -qq aria2 
$ pip install -q -U torch torchvision torchaudio torchtext torchdata --extra-index-url https://download.pytorch.org/
$ pip install -q -U bitsandbytes sentencepiece fsspec gradio einops xformers 
$ pip install -q -U git+https://github.com/huggingface/transformers.git 
$ pip install -q -U git+https://github.com/huggingface/accelerate.git

To get started with the model, create a python script, say “script.py”:

$ touch script.py

Edit the “script.py” file and add the following to get started with the training process:


from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers 
import torch 

model = "tiiuae/falcon-40b" 
tokenizer = AutoTokenizer.from_pretrained(model) 
pipeline = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer,
    torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto", )
sequences = pipeline( "Girafatron is obsessed with giraffes, the most 
    glorious animal on the face of this Earth. Giraftron believes all other 
    animals are irrelevant when compared to the glorious majesty of the 
    giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:", max_length=200,
    do_sample=True, top_k=10, num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id, ) 
for seq in sequences: 
  print(f"Result:{" "}{seq["generated_text"]}"

Run:

$ python script.py

This will load the model weights and will generate text.

Conclusion

Sign up for Free Trial

Latest Blogs

Mastering Falcon-40B: A Guide to Fine-Tuning on E2E Networks

What is Falcon 40B?

Challenges Falcon-40B Solves

Capabilities

Latest Developments

Model and Architecture

The FlashAttention approach

Positional Embeddings

Decoder Blocks

Block Structure of Falcon 40-B

Why is Falcon 40-B so powerful?

Launching Falcon-40B on E2E Cloud

Conclusion

Mastering Falcon-40B: A Guide to Fine-Tuning on E2E Networks

What is Falcon 40B?

Challenges Falcon-40B Solves

Capabilities

Latest Developments

Model and Architecture

The FlashAttention approach

Positional Embeddings

Decoder Blocks

Block Structure of Falcon 40-B

Why is Falcon 40-B so powerful?

Launching Falcon-40B on E2E Cloud

Conclusion

Interpretable vs. Black-Box Models: A Comprehensive Exploration on Early Prediction under Uncertainty

Generative AI in Healthcare: Applications, Benefits, and Its Future

No-Code Deployment of Fine-Tuned Models on TIR Foundation Studio: BYOM Made Easy

Building Production Ready Visual Query Systems: Llama 3.2 Vision on TIR

Exploring TIR GenAI APIs: Quickstart Guide with Llama 3 Chatbot

GPU Clusters: What It Is, Key Components, and Why They Matter

9 Cloud Computing Trends Shaping India’s Digital Future in 2025

LoRA fine-tune Gemma 7B Using TIR with 10 Easy Steps

How Does RAG Improve the Accuracy of LLM Responses?

Top 10 Cloud GPU Providers in 2025