ANNOUNCEMENT: Nexla to Make GenAI RAG Faster, Simpler, and More Accurate Using NVIDIA AI

Read Press Release
Multi-chapter guide | Your Guide to Generative AI Infrastructure

LLM Fine-Tuning—Overview with Code Example

Table of Contents

Unlock up to 10x greater productivity

Explore the full power of our data integration platform for free. Get started with your GenAI, analytics, and operational initiatives today.

Try for Free
Like this article?

Subscribe to our LinkedIn Newsletter to receive more educational content

Subscribe now

Large language models are everywhere. Nearly every day, a language model is released publicly. However, if you’ve worked with LLMs on a specific use case, you would have realized that most models are generic and rarely perform exceptionally well in real-world scenarios. Instead, you can use LLM training to increase the model’s performance and generate better results for a specific use case. The most common type of LLM training approach is fine-tuning.

In simple terms, fine-tuning is taking a pre-trained foundation model and training it on a given dataset, which helps the model perform better with data similar to the training dataset. These models cover a wide spectrum of modalities, such as image, video, and multimodal models. However, in this article on training and fine-tuning, we solely focus on text models, namely large language models.

Summary of key LLM fine-tuning concepts

Concept Description
Pre-training vs fine-tuning Pre-training is building an LLM from scratch. Fine-tuning further trains the pre-trained LLM on a curated knowledge base to increase its capabilities
Components of LLM fine-tuning Datasets, model architecture, hyperparameters, evaluation metrics
Semi-supervised fine-tuning The fine-tuning dataset is unlabeled and unorganized.
Supervised fine-tuning The fine-tuning dataset is labeled and organized.
Chat fine-tuning Adapts a pre-trained model specifically for conversational AI or chatbot explanations.
Instruction fine-tuning Involves curating datasets where each example pairs an instruction with the corresponding input and desired output.
Full fine-tuning Involves training the entire model on new, task-specific data to make it more specialized and effective for particular applications.
Hands-on LLM fine-tuning Step-by-step tutorial with code snippets

Pre-training vs. fine-tuning

Pre-training is building an LLM from scratch. A deep learning model is taken and trained on a large textual dataset, after which the model is called a large language model due to its capabilities, such as predicting the next token. However, pre-training a model is more complex than just taking a dataset and training the model with it. It involves significant computing, time, and resources as the model requires training on thousands of gigabytes of data.

Fine-tuning, on the other hand, takes the model a step further. Since pre-training usually involves a generic dataset, the pre-trained model performs well with generic tasks. However, when dealing with domain-specific tasks, its performance can drastically deteriorate. That’s where fine-tuning comes in. 

Fine-tuning further trains the pre-trained model on a curated knowledge base to increase its capabilities to work with tasks similar to those exemplified in the curated dataset. The model further learns patterns, features, and representations to build upon the generic knowledge it already possesses. It is also worth noting that fine-tuning is always carried out on a pre-trained model, but it can never be vice versa. 

Pre-training vs. fine-tuning

Pre-training vs. fine-tuning

Powering data engineering automation for AI and ML applications




  • Enhance LLM models like GPT and LaMDA with your own data



  • Connect to any vector database like Pinecone



  • Build retrieval-augmented generation (RAG) pipelines with no code

Components of LLM fine-tuning

The three key components of LLM fine-tuning are the model to be fine-tuned, the model hyperparameters, and the dataset on which it is tuned or trained. 

Datasets

Certain aspects of the dataset can often be overlooked in the fine-tuning process, which unknowingly affects the process and, ultimately, the fine-tuned model. For high-quality fine-tuning, it is essential to have high-quality data. The quality of your data is directly proportional to the result your fine-tuning yields. 

Model architectures

Your model’s architecture determines how it would capture, structure, and access its learning. Various architectures, such as BERT, GPT, and others, offer rich representations. Each architecture has its advantages and disadvantages, but the ability to append a model based on its strengths could be a key factor in measuring the efficacy of the fine-tuning process. For example, convolutional neural networks (CNNs) are more suitable for image classification tasks but generative adversarial networks (GANs) are better for image generation. You can enhance a GAN model by appending more layers to the generator network for generating high-quality images.

Hyperparameters 

Hyperparameters are always set and configured before the fine-tuning process begins, unlike model parameters learned during training. They need to be predefined as they significantly influence the performance and efficiency of a model. Some of them are:

  • Learning rate – One of the core tuning parameters that determines the step size while moving toward optimal weights
  • Batch size – The number of training samples used in one forward and backward pass
  • Number of epochs – The number of times the fine-tuning dataset passes through the entire model

These are the most common hyperparameters, but the main challenge lies in choosing the right hyperparameters to be tweaked and assigning the correct values to the same. We have covered the subject in-depth in our article on model tuning techniques.

Basic types of LLM fine-tuning

There are two main types of fine-tuning

Self-supervised fine-tuning

In self-supervised fine-tuning, the fine-tuning dataset is unlabelled but organized to allow the model to learn from patterns or properties within the data. There are two main types:

  • Causal language modeling is a self-supervised learning approach to predict the next word after a given set of words
  • Masked language modeling predicts masked tokens in a given sequence of words.

Supervised fine-tuning

A labeled dataset is used to adapt a pre-trained model for a specific task. This data has a set of features and labels that have been curated and validated beforehand, which enables it to cater to niche categories. Earlier training supervision could only be done by humans. However, with the increased capabilities of AI models, one can now experiment with instructing a model to supervise another model’s training.

LLM fine-tuning techniques

There are different approaches to LLM training and fine-tuning.

Chat fine-tuning 

One of LLMs’ most common use cases is in chatbots and other conversational settings. Chat fine-tuning is a process that adapts a pre-trained model specifically for conversations. It involves training the model on dialog data to improve its ability to generate human-like responses in a conversation. The model improves in understanding context, maintaining coherence, question-answering, and following instructions.

Instruction tuning 

When dealing with LLMs, most of our interactions are instruction-based. We provide the LLM with context when required and instruct it to perform a certain task. In such scenarios, traditional fine-tuning, although quite effective, might reach some bottlenecks. A more recent and innovative approach called instruction tuning has a distinct advantage when dealing with instruct LLMs. It involves curating datasets where each example pairs an instruction with the corresponding input and desired output.

Embedding fine-tuning

Embedding fine-tuning is adjusting the word or token embeddings of a pre-trained model. The embeddings, vector representations of words or tokens, are updated to better suit your specific task or data. The dataset would contain domain-specific terms, which improves the representation of these words in the domain’s context. A next-gen data platform like Nexla allows you to bring data to any vector database without structure or coding. It has hundreds of out-of-the-box bidirectional connectors to quickly get your data to your LLM, no matter where it resides.

Full fine-tuning

Full fine-tuning is a concept that builds upon feature extraction and takes it a step further. Unlike feature extraction, where only the outputs are extracted, this process updates all the weights and parameters of the pre-trained model. It involves training the entire model on new, task-specific data to make it more specialized and effective for particular applications. This also ensures the model’s internal representations are finely tuned to the specific nuances of the target dataset.

Hands-on LLM fine-tuning example

Parameter-efficient fine-tuning (PEFT) is a set of techniques that focuses on updating only a small subset of the model’s parameters during fine-tuning. It drastically reduces computational costs and memory requirements while still achieving good performance.

Low-Rank Adaptation(LoRA) is a specific PEFT technique that adds small trainable matrices to the model’s existing weights instead of changing all its original parameters. This enables efficient fine-tuning by updating only these newly added low-rank matrices.

Quantized Low-Rank Adaptation combines quantization techniques with LoRA. It uses 4-bit quantization for the base model’s parameters and keeps LoRA parameters in 16-bit precision.

The below fine-tuning example uses Hugging Face PEFT libraries in Python. We also use Nexla API to create and manage data resources. Nexla is a data integration platform that provides both code and no-code methods to move data from any source to any vector database, making it readily available for fine-tuning.

Step 1—Setup

First, we set up and install the required libraries. 

!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb

This step installs:

  • trl: Transformer Reinforcement Learning library
  • transformers: Hugging Face’s Transformers library for state-of-the-art NLP models
  • accelerate: Library for easy mixed precision training
  • peft: Parameter-efficient fine-tuning methods
  • datasets: Hugging Face’s datasets library
  • bitsandbytes: Quantization library
  • einops: Library for tensor operations
  • wandb: Weights & Biases for experiment tracking

Then, we can perform Nexla authentication and session setup as shown.  

from nexla import nexla_auth
from nexla import nexla_sink

async def get_token(service_key: str) -> str:
    headers = {
"Authorization": f"Basic {service_key}",
}
    url = DATAOPS_BASE_URL + "token"
    async with httpx.AsyncClient() as client:
        response = await client.post(url, headers=headers)
    if response.status_code == 200:
        return response.json().get("access_token", None), response.json()
    else:
        raise HTTPException(status_code=401, detail="Failed to get access token")

auth = nexla_auth.Auth(api_base_url = nexla_api_url, access_token = access_token)

The above code imports Nexla libraries for authentication and data sink operations.
DATAOPS_BASE_URL can be https://dataops.nexla.io/nexla-api/. You can also refer Nexla docs for details.

Step 2—Data retrieval

Next, we retrieve a list of data files from the Nexla sink. As an example, we use Parquet files. Apache Parquet is a column-oriented, open-source data file format for efficient data retrieval and storage. The below code creates a local ‘tmp’ directory and downloads Parquet files from S3 storage to the local directory.

data_files_list = nexla_sink.DataSink.get_sink_filesList(auth, id=sink_id, days=10)
print(data_files_list)

import os
download_path = os.path.join(os.getcwd(), 'tmp') 
if not os.path.exists(download_path):
    os.mkdir(download_path)

nexla_sink.DataSink.download_files_from_s3(file_paths=data_files_list, download_path=download_path)

Step 3—Data processing

Now, we read the downloaded Parquet files into Pandas DataFrames. A Pandas DataFrame is a 2-dimensional array data structure with rows and columns. We concatenate the DataFrames into a single DataFrame and convert it to a Hugging Face Dataset. We load the dataset using Hugging Face’s load_dataset function.

import pandas as pd
from datasets import Dataset
import os

li = []
file_list = os.listdir(download_path)

for file in file_list:
    df = pd.read_parquet(os.path.join(download_path, file))
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)
nexla_dataset = Dataset.from_pandas(frame)
print(hf_dataset)

from datasets import load_dataset
dataset = load_dataset(nexla_dataset)

Unlock the Power of Data Integration. Nexla's Interactive Demo. No Email Required!

Step 4—Model and tokenizer configuration

Next, we configure 4-bit quantization for efficient model loading. We load the pre-trained causal language model and the corresponding tokenizer.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = ""

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.padding_side = "right"
tokenizer.pad_token = tokenizer.eos_token

Remember to set the pad token to the same as the end-of-sequence token.

Step 5—PEFT configuration

Now, you can configure LoRA for efficient fine-tuning. Set LoRA hyperparameters like alpha, dropout, and rank. Add target_modules to specify which layers to fine-tune and use prepare_model_for_kbit_training to prepare the model for quantized training.

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM"
    target_modules=["k_proj", "v_proj"]
)

model = prepare_model_for_kbit_training(model)

Step 6—Training arguments configuration

Configure training arguments, including batch size, learning rate, and optimization settings. Use fp16 (half-precision) for faster training and set up a constant learning rate scheduler.

from transformers import TrainingArguments

output_dir = "./results"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 100
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 100
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
)

Step 7—SFTTrainer configuration and training

Finally, you can import the supervised fine-tuning(SFTTrainer) from the trl library. Set up the Trainer with the model, dataset, and configurations. Specify the text field in the dataset and the maximum sequence length. Initiate the fine-tuning process.

from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

trainer.train()

Step 8—Model inference

Now, you can load the fine-tuned LoRA adapters and prepare a prompt for inference. Then, generate text using the fine-tuned model, decode it, and print the generated output.

from peft import PeftModel

peft_model_id = "path/to/your/peft/model" 
model = PeftModel.from_pretrained(model, peft_model_id)

dataset['']

text = ""
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

LLM fine-tuning best practices

When fine-tuning your models, remember to follow the below best practices.

Try prompt engineering before fine-tuning

Always try prompt engineering first. Prompt engineering can inform you of the need for fine-tuning and if fine-tuning is required. When you experiment with different prompt templates, you can learn the model’s strengths and weaknesses—scenarios where it performs really well and where it doesn’t. It would be one of the best places to start before fine-tuning.  

Avoid overfitting

Overfitting happens when the model learns the training data too well, including its noise and other peculiarities. Various methods are used to prevent overfitting. One is early stopping, a technique used to monitor the model on a validation set during training. As soon as the validation degrades, you halt the training process. This helps prevent the model from losing its ability to generalize, such that it only performs well with training data and not with unseen or new data.

Data and model management

Your fine-tuned model may not meet your requirements in some instances. In such cases, you must experiment with the dataset and different fine-tuning techniques (instruction tuning for language modeling, full fine-tuning for text classification, etc.). 

Iterating through multiple fine-tuning processes produces different versions of your datasets and, more importantly, various model versions. You must split your dataset into a training, validation, and test set using a random seed to ensure consistency. Hence, it is highly recommended that you keep track of your datasets using version control.

Evaluation metrics 

After fine-tuning, evaluate the model with an evaluation dataset to ensure that it is functioning as per its training. A good performance evaluation metric can be derived by comparing the model’s baseline before and after fine-tuning using various scoring frameworks such as BLEU, ROUGE, etc.

Discover the Transformative Impact of Data Integration on GenAI

Conclusion

Fine-tuning is a powerful technique leveraged on pre-trained models to increase performance on a domain-specific task. Multiple aspects affect the fine-tuning process, such as datasets, architectures, hyperparameters, and different fine-tuning techniques, which can yield substantial results when chosen and implemented wisely. 

Navigate Chapters: