LLM Fine-Tuning—Overview with Code Example
Large language models are everywhere. Nearly every day, a language model is released publicly. However, if you’ve worked with LLMs on a specific use case, you would have realized that most models are generic and rarely perform exceptionally well in real-world scenarios. Instead, you can use LLM training to increase the model’s performance and generate better results for a specific use case. The most common type of LLM training approach is fine-tuning.
In simple terms, fine-tuning is taking a pre-trained foundation model and training it on a given dataset, which helps the model perform better with data similar to the training dataset. These models cover a wide spectrum of modalities, such as image, video, and multimodal models. However, in this article on training and fine-tuning, we solely focus on text models, namely large language models.
Summary of key LLM fine-tuning concepts
Concept | Description |
---|---|
Pre-training vs fine-tuning | Pre-training is building an LLM from scratch. Fine-tuning further trains the pre-trained LLM on a curated knowledge base to increase its capabilities |
Components of LLM fine-tuning | Datasets, model architecture, hyperparameters, evaluation metrics |
Semi-supervised fine-tuning | The fine-tuning dataset is unlabeled and unorganized. |
Supervised fine-tuning | The fine-tuning dataset is labeled and organized. |
Chat fine-tuning | Adapts a pre-trained model specifically for conversational AI or chatbot explanations. |
Instruction fine-tuning | Involves curating datasets where each example pairs an instruction with the corresponding input and desired output. |
Full fine-tuning | Involves training the entire model on new, task-specific data to make it more specialized and effective for particular applications. |
Hands-on LLM fine-tuning | Step-by-step tutorial with code snippets |
Pre-training vs. fine-tuning
Pre-training is building an LLM from scratch. A deep learning model is taken and trained on a large textual dataset, after which the model is called a large language model due to its capabilities, such as predicting the next token. However, pre-training a model is more complex than just taking a dataset and training the model with it. It involves significant computing, time, and resources as the model requires training on thousands of gigabytes of data.
Fine-tuning, on the other hand, takes the model a step further. Since pre-training usually involves a generic dataset, the pre-trained model performs well with generic tasks. However, when dealing with domain-specific tasks, its performance can drastically deteriorate. That’s where fine-tuning comes in.
Fine-tuning further trains the pre-trained model on a curated knowledge base to increase its capabilities to work with tasks similar to those exemplified in the curated dataset. The model further learns patterns, features, and representations to build upon the generic knowledge it already possesses. It is also worth noting that fine-tuning is always carried out on a pre-trained model, but it can never be vice versa.
Pre-training vs. fine-tuning
Powering data engineering automation for AI and ML applications
-
Enhance LLM models like GPT and LaMDA with your own data -
Connect to any vector database like Pinecone -
Build retrieval-augmented generation (RAG) pipelines with no code
Components of LLM fine-tuning
The three key components of LLM fine-tuning are the model to be fine-tuned, the model hyperparameters, and the dataset on which it is tuned or trained.
Datasets
Certain aspects of the dataset can often be overlooked in the fine-tuning process, which unknowingly affects the process and, ultimately, the fine-tuned model. For high-quality fine-tuning, it is essential to have high-quality data. The quality of your data is directly proportional to the result your fine-tuning yields.
Model architectures
Your model’s architecture determines how it would capture, structure, and access its learning. Various architectures, such as BERT, GPT, and others, offer rich representations. Each architecture has its advantages and disadvantages, but the ability to append a model based on its strengths could be a key factor in measuring the efficacy of the fine-tuning process. For example, convolutional neural networks (CNNs) are more suitable for image classification tasks but generative adversarial networks (GANs) are better for image generation. You can enhance a GAN model by appending more layers to the generator network for generating high-quality images.
Hyperparameters
Hyperparameters are always set and configured before the fine-tuning process begins, unlike model parameters learned during training. They need to be predefined as they significantly influence the performance and efficiency of a model. Some of them are:
- Learning rate – One of the core tuning parameters that determines the step size while moving toward optimal weights
- Batch size – The number of training samples used in one forward and backward pass
- Number of epochs – The number of times the fine-tuning dataset passes through the entire model
These are the most common hyperparameters, but the main challenge lies in choosing the right hyperparameters to be tweaked and assigning the correct values to the same. We have covered the subject in-depth in our article on model tuning techniques.
Basic types of LLM fine-tuning
There are two main types of fine-tuning
Self-supervised fine-tuning
In self-supervised fine-tuning, the fine-tuning dataset is unlabelled but organized to allow the model to learn from patterns or properties within the data. There are two main types:
- Causal language modeling is a self-supervised learning approach to predict the next word after a given set of words
- Masked language modeling predicts masked tokens in a given sequence of words.
Supervised fine-tuning
A labeled dataset is used to adapt a pre-trained model for a specific task. This data has a set of features and labels that have been curated and validated beforehand, which enables it to cater to niche categories. Earlier training supervision could only be done by humans. However, with the increased capabilities of AI models, one can now experiment with instructing a model to supervise another model’s training.
LLM fine-tuning techniques
There are different approaches to LLM training and fine-tuning.
Chat fine-tuning
One of LLMs’ most common use cases is in chatbots and other conversational settings. Chat fine-tuning is a process that adapts a pre-trained model specifically for conversations. It involves training the model on dialog data to improve its ability to generate human-like responses in a conversation. The model improves in understanding context, maintaining coherence, question-answering, and following instructions.
Instruction tuning
When dealing with LLMs, most of our interactions are instruction-based. We provide the LLM with context when required and instruct it to perform a certain task. In such scenarios, traditional fine-tuning, although quite effective, might reach some bottlenecks. A more recent and innovative approach called instruction tuning has a distinct advantage when dealing with instruct LLMs. It involves curating datasets where each example pairs an instruction with the corresponding input and desired output.
Embedding fine-tuning
Embedding fine-tuning is adjusting the word or token embeddings of a pre-trained model. The embeddings, vector representations of words or tokens, are updated to better suit your specific task or data. The dataset would contain domain-specific terms, which improves the representation of these words in the domain’s context. A next-gen data platform like Nexla allows you to bring data to any vector database without structure or coding. It has hundreds of out-of-the-box bidirectional connectors to quickly get your data to your LLM, no matter where it resides.
Full fine-tuning
Full fine-tuning is a concept that builds upon feature extraction and takes it a step further. Unlike feature extraction, where only the outputs are extracted, this process updates all the weights and parameters of the pre-trained model. It involves training the entire model on new, task-specific data to make it more specialized and effective for particular applications. This also ensures the model’s internal representations are finely tuned to the specific nuances of the target dataset.
Hands-on LLM fine-tuning example
Parameter-efficient fine-tuning (PEFT) is a set of techniques that focuses on updating only a small subset of the model’s parameters during fine-tuning. It drastically reduces computational costs and memory requirements while still achieving good performance.
Low-Rank Adaptation(LoRA) is a specific PEFT technique that adds small trainable matrices to the model’s existing weights instead of changing all its original parameters. This enables efficient fine-tuning by updating only these newly added low-rank matrices.
Quantized Low-Rank Adaptation combines quantization techniques with LoRA. It uses 4-bit quantization for the base model’s parameters and keeps LoRA parameters in 16-bit precision.
The below fine-tuning example uses Hugging Face PEFT libraries in Python. We also use Nexla API to create and manage data resources. Nexla is a data integration platform that provides both code and no-code methods to move data from any source to any vector database, making it readily available for fine-tuning.
Step 1—Setup
First, we set up and install the required libraries.
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git !pip install -q datasets bitsandbytes einops wandb
This step installs:
- trl: Transformer Reinforcement Learning library
- transformers: Hugging Face’s Transformers library for state-of-the-art NLP models
- accelerate: Library for easy mixed precision training
- peft: Parameter-efficient fine-tuning methods
- datasets: Hugging Face’s datasets library
- bitsandbytes: Quantization library
- einops: Library for tensor operations
- wandb: Weights & Biases for experiment tracking
Then, we can perform Nexla authentication and session setup as shown.
from nexla import nexla_auth from nexla import nexla_sink async def get_token(service_key: str) -> str: headers = { "Authorization": f"Basic {service_key}", } url = DATAOPS_BASE_URL + "token" async with httpx.AsyncClient() as client: response = await client.post(url, headers=headers) if response.status_code == 200: return response.json().get("access_token", None), response.json() else: raise HTTPException(status_code=401, detail="Failed to get access token") auth = nexla_auth.Auth(api_base_url = nexla_api_url, access_token = access_token)
The above code imports Nexla libraries for authentication and data sink operations.
DATAOPS_BASE_URL can be https://dataops.nexla.io/nexla-api/. You can also refer Nexla docs for details.
Step 2—Data retrieval
Next, we retrieve a list of data files from the Nexla sink. As an example, we use Parquet files. Apache Parquet is a column-oriented, open-source data file format for efficient data retrieval and storage. The below code creates a local ‘tmp’ directory and downloads Parquet files from S3 storage to the local directory.
data_files_list = nexla_sink.DataSink.get_sink_filesList(auth, id=sink_id, days=10) print(data_files_list) import os download_path = os.path.join(os.getcwd(), 'tmp') if not os.path.exists(download_path): os.mkdir(download_path) nexla_sink.DataSink.download_files_from_s3(file_paths=data_files_list, download_path=download_path)
Step 3—Data processing
Now, we read the downloaded Parquet files into Pandas DataFrames. A Pandas DataFrame is a 2-dimensional array data structure with rows and columns. We concatenate the DataFrames into a single DataFrame and convert it to a Hugging Face Dataset. We load the dataset using Hugging Face’s load_dataset function.
import pandas as pd from datasets import Dataset import os li = [] file_list = os.listdir(download_path) for file in file_list: df = pd.read_parquet(os.path.join(download_path, file)) li.append(df) frame = pd.concat(li, axis=0, ignore_index=True) nexla_dataset = Dataset.from_pandas(frame) print(hf_dataset) from datasets import load_dataset dataset = load_dataset(nexla_dataset)
Unlock the Power of Data Integration. Nexla's Interactive Demo. No Email Required!
Step 4—Model and tokenizer configuration
Next, we configure 4-bit quantization for efficient model loading. We load the pre-trained causal language model and the corresponding tokenizer.
import torch from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer model_name = "" bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, ) model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, trust_remote_code=True ) model.config.use_cache = False tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) tokenizer.padding_side = "right" tokenizer.pad_token = tokenizer.eos_token
Remember to set the pad token to the same as the end-of-sequence token.
Step 5—PEFT configuration
Now, you can configure LoRA for efficient fine-tuning. Set LoRA hyperparameters like alpha, dropout, and rank. Add target_modules to specify which layers to fine-tune and use prepare_model_for_kbit_training to prepare the model for quantized training.
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training lora_alpha = 16 lora_dropout = 0.1 lora_r = 64 peft_config = LoraConfig( lora_alpha=lora_alpha, lora_dropout=lora_dropout, r=lora_r, bias="none", task_type="CAUSAL_LM" target_modules=["k_proj", "v_proj"] ) model = prepare_model_for_kbit_training(model)
Step 6—Training arguments configuration
Configure training arguments, including batch size, learning rate, and optimization settings. Use fp16 (half-precision) for faster training and set up a constant learning rate scheduler.
from transformers import TrainingArguments output_dir = "./results" per_device_train_batch_size = 4 gradient_accumulation_steps = 4 optim = "paged_adamw_32bit" save_steps = 100 logging_steps = 10 learning_rate = 2e-4 max_grad_norm = 0.3 max_steps = 100 warmup_ratio = 0.03 lr_scheduler_type = "constant" training_arguments = TrainingArguments( output_dir=output_dir, per_device_train_batch_size=per_device_train_batch_size, gradient_accumulation_steps=gradient_accumulation_steps, optim=optim, save_steps=save_steps, logging_steps=logging_steps, learning_rate=learning_rate, fp16=True, max_grad_norm=max_grad_norm, max_steps=max_steps, warmup_ratio=warmup_ratio, group_by_length=True, lr_scheduler_type=lr_scheduler_type, )
Step 7—SFTTrainer configuration and training
Finally, you can import the supervised fine-tuning(SFTTrainer) from the trl library. Set up the Trainer with the model, dataset, and configurations. Specify the text field in the dataset and the maximum sequence length. Initiate the fine-tuning process.
from trl import SFTTrainer max_seq_length = 512 trainer = SFTTrainer( model=model, train_dataset=dataset, peft_config=peft_config, dataset_text_field="text", max_seq_length=max_seq_length, tokenizer=tokenizer, args=training_arguments, ) trainer.train()
Step 8—Model inference
Now, you can load the fine-tuned LoRA adapters and prepare a prompt for inference. Then, generate text using the fine-tuned model, decode it, and print the generated output.
from peft import PeftModel peft_model_id = "path/to/your/peft/model" model = PeftModel.from_pretrained(model, peft_model_id) dataset[''] text = "" device = "cuda:0" inputs = tokenizer(text, return_tensors="pt").to(device) outputs = model.generate(**inputs, max_new_tokens=50) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
LLM fine-tuning best practices
When fine-tuning your models, remember to follow the below best practices.
Try prompt engineering before fine-tuning
Always try prompt engineering first. Prompt engineering can inform you of the need for fine-tuning and if fine-tuning is required. When you experiment with different prompt templates, you can learn the model’s strengths and weaknesses—scenarios where it performs really well and where it doesn’t. It would be one of the best places to start before fine-tuning.
Avoid overfitting
Overfitting happens when the model learns the training data too well, including its noise and other peculiarities. Various methods are used to prevent overfitting. One is early stopping, a technique used to monitor the model on a validation set during training. As soon as the validation degrades, you halt the training process. This helps prevent the model from losing its ability to generalize, such that it only performs well with training data and not with unseen or new data.
Data and model management
Your fine-tuned model may not meet your requirements in some instances. In such cases, you must experiment with the dataset and different fine-tuning techniques (instruction tuning for language modeling, full fine-tuning for text classification, etc.).
Iterating through multiple fine-tuning processes produces different versions of your datasets and, more importantly, various model versions. You must split your dataset into a training, validation, and test set using a random seed to ensure consistency. Hence, it is highly recommended that you keep track of your datasets using version control.
Evaluation metrics
After fine-tuning, evaluate the model with an evaluation dataset to ensure that it is functioning as per its training. A good performance evaluation metric can be derived by comparing the model’s baseline before and after fine-tuning using various scoring frameworks such as BLEU, ROUGE, etc.
Discover the Transformative Impact of Data Integration on GenAI
Conclusion
Fine-tuning is a powerful technique leveraged on pre-trained models to increase performance on a domain-specific task. Multiple aspects affect the fine-tuning process, such as datasets, architectures, hyperparameters, and different fine-tuning techniques, which can yield substantial results when chosen and implemented wisely.