A Step-by-Step Guide to Fine-Tuning the Mistral 7B LLM

April 2, 2025

Table of Contents

Fine-tuning a language model is a captivating journey into the world of adapting a pre-trained model for specific applications. In this theoretical guide, we'll delve deep into the process of fine-tuning the Mistral 7B LLM and explore the theoretical underpinnings that drive this adaptation.

Understanding Mistral 7B LLM

The Mistral 7B LLM stands as a formidable member of the GPT (Generative Pre-trained Transformer) family, revered for its unparalleled natural language processing capabilities. What sets Mistral 7B apart is its staggering size, characterized by an impressive 7 billion parameters. This colossal parameter count is a testament to the model's capacity to understand and generate text, making it an invaluable tool for a wide range of language-based tasks.

At its core, Mistral 7B LLM is an exemplar of deep learning models, boasting the following key attributes:

  1. Pre-Trained Foundation: Before embarking on fine-tuning, the model undergoes a pre-training phase. During this stage, it's exposed to an enormous corpus of text data. This immersion enables the model to capture the nuances of language, including syntactic and semantic structures. Consequently, it acquires a broad understanding of natural language, transforming it into a robust and versatile language model.
  2. Self-Attention Mechanism: Mistral 7B LLM employs the self-attention mechanism, a key feature of the Transformer architecture. This mechanism allows the model to analyze relationships between words in a sentence, taking context into account. This not only aids in understanding context but also empowers the model to generate coherent and contextually relevant text.
  3. Transfer Learning Paradigm: Mistral 7B epitomizes the concept of transfer learning in the realm of deep learning. It leverages knowledge acquired during pre-training to excel at a myriad of downstream tasks. Fine-tuning is the bridge that connects the model's general language understanding to specific applications.

A Theoretical Exploration of Fine-Tuning the Mistral 7B LLM

Step 1: Set Up Your Environment

Before diving into fine-tuning, it is crucial to prepare the requisite environment. This involves ensuring access to the Mistral 7B model and creating a computational environment suitable for fine-tuning.

  1. Computational Power: The depth and breadth of Mistral 7B LLM necessitate substantial computational resources. For efficient training, GPUs or TPUs are recommended.
  2. Deep Learning Frameworks: Popular deep learning frameworks such as PyTorch and TensorFlow serve as the foundation for implementing the fine-tuning process.
  3. Model Access: Access to the Mistral 7B model weights or a pre-trained version of the model is essential to get started.
  4. Domain-Specific Data: Fine-tuning mandates the availability of a significant dataset relevant to your target domain. The quality and quantity of this data significantly impact the success of the fine-tuning process.

Step 2: Preparing Data for Fine-Tuning

Data preparation forms a critical preliminary step for fine-tuning:

  1. Data Collection: Gather text data that is specific to your application or domain. This data forms the foundation for fine-tuning the model.
  2. Data Cleaning: Pre-process the data by removing noise, correcting errors, and ensuring a uniform format. Clean data is fundamental to a successful fine-tuning process.
  3. Data Splitting: Divide the dataset into training, validation, and test sets, adhering to the customary split of 80% for training, 10% for validation, and 10% for testing.

Step 3: Fine-Tuning the Model - The Theory

Fine-tuning is a multi-faceted process, and the theoretical underpinnings include:

  1. Loading a Pre-trained Model: The Mistral 7B model is loaded into the chosen deep learning framework. This model comes equipped with an extensive understanding of language structures, thanks to its pre-training phase.
  2. Tokenization: Tokenization is a critical process that converts the text data into a format suitable for the model. This ensures compatibility with the pre-trained architecture, allowing for smooth integration of your domain-specific data.
  3. Defining the Fine-Tuning Task: In the theoretical realm, this step involves specifying the task you want to address, whether it's text classification, text generation, or any other language-related task. This step ensures the model understands the target objective.
  4. Data Loaders: Create data loaders for training, validation, and testing. These loaders facilitate efficient model training by feeding data in batches, enabling the model to learn from the dataset effectively.
  5. Fine-Tuning Configuration: Theoretical considerations here involve setting hyperparameters such as learning rate, batch size, and the number of training epochs. These parameters govern how the model adapts to your specific task and can be optimized to enhance performance.
  6. Fine-Tuning Loop: At the heart of fine-tuning is the theoretical concept of minimizing a loss function. This function measures the difference between the model's predictions and the actual results. By iteratively adjusting model parameters, the model progressively aligns itself with the target task.

Step 4: Evaluation and Validation - Theoretical Insights

After fine-tuning, the model's performance must be rigorously evaluated:

  • Test Set: The theoretical underpinning of this step is to use the test set, prepared in Step 2, to assess the model's real-world performance. Metrics such as accuracy, precision, recall, and F1-score are applied, providing insights into its effectiveness and generalization capabilities.

Iterate through the fine-tuning process, adjusting hyperparameters and data as needed, guided by the theoretical knowledge gained from evaluating model performance.

Step 5: Deployment - A Theoretical Perspective

Once the fine-tuned model meets your criteria for performance, it's ready for deployment. The infrastructure required for serving model predictions should be theoretically efficient, scalable, and responsive to meet the needs of your application or service.

Tutorial: Fine-Tuning Mistral 7B using QLoRA 

In this tutorial, we will walk you through the process of fine-tuning the Mistral 7B model using the QLoRA (Quantization and LoRA) method. This approach combines quantization and LoRA adapters to improve the model's performance. We will also use the PEFT library from Hugging Face to facilitate the fine-tuning process.

Note: Before we begin, ensure that you have access to a GPU environment with sufficient memory (at least 24GB GPU memory) and the necessary dependencies installed.

If you require extra GPU resources for the tutorials ahead, you can explore the offerings on E2E CLOUD. They provide a diverse selection of GPUs, making them a suitable choice for more advanced LLM-based applications as well.

0. Install necessary dependencies


# You only need to run this once per machine
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U datasets scipy ipywidgets

1. Accelerator

First, we set up the accelerator using the FullyShardedDataParallelPlugin and Accelerator. This step may not be necessary for QLoRA but is included for future reference. You can comment it out if you prefer to proceed without an accelerator.


from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig
fsdp_plugin = FullyShardedDataParallelPlugin(
    state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
    optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)
accelerator = Accelerator(fsdp_plugin=fsdp_plugin)

2. Load Dataset

We load a meaning representation dataset for fine-tuning Mistral 7B. This dataset helps the model learn a unique form of desired output. You can replace this dataset with your own if needed.


from datasets import load_dataset


train_dataset = load_dataset('gem/viggo', split='train')
eval_dataset = load_dataset('gem/viggo', split='validation')
test_dataset = load_dataset('gem/viggo', split='test')

print(train_dataset)
print(eval_dataset)
print(test_dataset)

3. Load Base Model

Now, we load the Mistral 7B base model using 4-bit quantization.


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig


base_model_id = "mistralai/Mistral-7B-v0.1"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(base_model_id, quantization_config=bnb_config)

4. Tokenization

Set up the tokenizer and create functions for tokenization. We use self-supervised fine-tuning to align the labels and input_ids.


tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    model_max_length=512,
    padding_side="left",
    add_eos_token=True)
tokenizer.pad_token = tokenizer.eos_token

def tokenize(prompt):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=512,
        padding="max_length",
    )
    result["labels"] = result["input_ids"].copy()
    return result

def generate_and_tokenize_prompt(data_point):
    full_prompt =f"""Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values.
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].
The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']


### Target sentence:
{data_point["target"]}


### Meaning representation:
{data_point["meaning_representation"]}
"""
    return tokenize(full_prompt)
    

def generate_and_tokenize_prompt(data_point):
    full_prompt =f"""Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values.
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].
The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']


### Target sentence:
{data_point["target"]}


### Meaning representation:
{data_point["meaning_representation"]}
"""
    return tokenize(full_prompt)
    

tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

print(tokenized_train_dataset[4]['input_ids'])

print(len(tokenized_train_dataset[4]['input_ids']))

print("Target Sentence: " + test_dataset[1]['target'])
print("Meaning Representation: " + test_dataset[1]['meaning_representation'] + "\n")

eval_prompt = """Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values.
This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute'].
The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']


### Target sentence:
Earlier, you stated that you didn't have strong feelings about PlayStation's Little Big Adventure. Is your opinion true for all games which don't have multiplayer?


### Meaning representation:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")
model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=256, pad_token_id=2)[0], skip_special_tokens=True))
    

5. Set Up LoRA

Now, we prepare the model for fine-tuning by applying LoRA adapters to the linear layers of the model.


from peft import prepare_model_for_kbit_training
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )
    

from peft import LoraConfig, get_peft_model
config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
print_trainable_parameters(model)
# Apply the accelerator. You can comment this out to remove the accelerator.
model = accelerator.prepare_model(model)

print(model)

6. Run Training

In this step, we start training the fine-tuned model. You can adjust the training parameters according to your needs.


if torch.cuda.device_count() > 1: # If more than 1 GPU
    model.is_parallelizable = True
    model.model_parallel = True

import transformers
from datetime import datetime


project = "viggo-finetune"
base_model_name = "mistral"
run_name = base_model_name + "-" + project
output_dir = "./" + run_name


tokenizer.pad_token = tokenizer.eos_token


trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=transformers.TrainingArguments(
        output_dir=output_dir,
        warmup_steps=5,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        max_steps=1000,
        learning_rate=2.5e-5, # Want about 10x smaller than the Mistral learning rate
        logging_steps=50,
        bf16=True,
        optim="paged_adamw_8bit",
        logging_dir="./logs",        # Directory for storing logs
        save_strategy="steps",       # Save the model checkpoint every logging step
        save_steps=50,                # Save checkpoints every 50 steps
        evaluation_strategy="steps", # Evaluate the model every logging step
        eval_steps=50,               # Evaluate and save checkpoints every 50 steps
        do_eval=True,                # Perform evaluation at the end of training
        report_to="wandb",           # Comment this out if you don't want to use weights & baises
        run_name=f"{run_name}-{datetime.now().strftime('%Y-%m-%d-%H-%M')}"          # Name of the W&B run (optional)
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,  # Mistral, same as before
    quantization_config=bnb_config,  # Same quantization config as before
    device_map="auto",
    trust_remote_code=True,
    use_auth_token=True
)
tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

7. Try the Trained Model

After training, you can use the fine-tuned model for inference. You'll need to load the base Mistral model from the Huggingface Hub and then load the QLoRA adapters from the best-performing checkpoint directory.


from peft import PeftModel
ft_model = PeftModel.from_pretrained(base_model, "mistral-viggo-finetune/checkpoint-1000")

ft_model.eval()
with torch.no_grad():
    print(tokenizer.decode(ft_model.generate(**model_input, max_new_tokens=100, pad_token_id=2)[0], skip_special_tokens=True))
    

Conclusion

Fine-tuning the Mistral 7B LLM is a captivating fusion of theoretical concepts and practical steps. By understanding the theoretical framework of this process, you can appreciate the depth of customization possible with such a powerful language model. Remember that fine-tuning often demands experimentation and refinement to achieve peak performance. This theoretical guide equips you with the knowledge to embark on the journey of making Mistral 7B your own, tailored to your specific linguistic needs.

Latest Blogs

A vector illustration of a tech city using latest cloud technologies & infrastructure