Low-rank Adaptation of Large Language Models—Implementation Guide
Low-rank adaptation, or LoRA, is an advanced fine-tuning technique designed to reduce the number of trainable parameters in large language models without significantly compromising performance. By decomposing weight updates into low-rank matrices, LoRA enables LLMs to adapt to specific tasks while minimizing computational requirements.
This article aims to provide an understanding of LoRA for LLMs, with a focus on implementation details and best practices. We’ll explore the technical principles behind LoRA, discuss its advantages over full fine-tuning, and provide practical guidance for fine-tuning models with LoRA.
This article assumes background knowledge of fine-tuning concepts and matrix operations. Please read our articles on prompt tuning vs. fine-tuning and model tuning as a pre-requisite to better understand the concepts mentioned in this article.
Summary of key low-rank adaptation LLM concepts
Concept | Description |
---|---|
Low-rank adaptation | A technique that approximates weight updates in neural networks using low-rank matrix factorization |
Why use LoRA? | LoRA significantly reduces memory requirements for training LLMs by adjusting only a small fraction of parameters. It maintains performance comparable to full fine-tuning with minimal impact on inference speed, allowing for efficient creation of multiple task-specific model versions without excessive storage needs. This makes LoRA valuable for various text, dialogue, and image generation applications. |
Parameter efficient fine-tuning (PEFT) | A class of methods that adapt pre-trained models by updating only a small subset of parameters. Low Rank Adaptation is a PEFT technique. |
Hyperparameters for LoRA | Key variables like rank and alpha that control LoRA’s behavior and effectiveness |
QLoRA | An extension of LoRA that incorporates 4-bit quantization for further memory optimization |
Low-rank adaptation of large language models explained
LoRA operates on the principle of matrix decomposition. Instead of updating the entire weight matrix W during fine-tuning, LoRA introduces two smaller matrices A and B:
W’ = W + BA
Where:
- W∈ℝdxk is the original weight matrix (frozen during training), where d and k are the number of rows and columns in the matrix W, respectively.
- W’ is the updated weight matrix after fine-tuning.
- B ∈ ℝdxr and A ∈ ℝrxk are the LoRA matrices. Note that matrix B has the same number of rows as the original weight matrix W. Similarly, matrix A has the same number of columns as the original matrix W.
- r is the inner dimension or rank of the decomposition matrices, which is a hyperparameter
The key to LoRA’s efficiency is setting r << min(d,k), which dramatically reduces the number of trainable parameters. This rank reduction is based on the empirical observation that weight updates during fine-tuning often have a low intrinsic rank.
During inference, the LoRA updates are merged with the original weights to get the output, h, as below:
h = (W + BA)x
This reconstruction approximates the effect of full fine-tuning while training far fewer parameters.
Powering data engineering automation for AI and ML applications
-
Enhance LLM models like GPT and LaMDA with your own data -
Connect to any vector database like Pinecone -
Build retrieval-augmented generation (RAG) with no code
Application to Transformer Architecture
LoRA is shown to be more effective when applied to the self-attention mechanism in transformer-based models. In a standard transformer, the attention function is computed as:
where Q, K, and V are the Query, Key, and Value matrices. You can read more about Query, Key, and Value matrices here. It is an important concept to understand when applying LoRA.
Q, K, and V are derived from the input X through learnable weight matrices WQ, WK, and WV:
LoRA modifies these weight matrices by adding low-rank updates:
W’Q = WQ + BQAQ
W’K = WK + BK AK
W’V = WV + BV AV
Where B_i ∈ ℝdxr and A_i ∈ ℝrxk are the LoRA matrices, and r is the rank (typically r << d, k). d and k are the number of rows and columns of the projections WQ WK, and WV.
This modification allows the model to adapt its attention patterns with minimal additional parameters. Empirically, applying LoRA to just the query and value projections (WQ and WV) often yields good results, further reducing the parameter count. We’ll see this in action in the next section.
Why use low-rank adaptation for LLMs?
The advantage of LoRA lies in its significant reduction of memory requirements during training and storage.
Reduce memory requirements in training
How much memory does LoRA save? It depends on the rank r, which is a hyperparameter. For example, if W has 1000 rows and 5000 columns, it stores 5,000,000 parameters. If we choose A and B with r=8, then A has 1000 rows and 8 columns, and B has 8 rows and 5000 columns. That’s 1000×8 + 8×5000 = 48,000 parameters, which is about 0.96% of the original parameters, i.e., you only need to train 0.96% of the model.
This efficiency allows for fine-tuning of large models on limited hardware resources. For instance, a 7 billion parameter model can be adapted using LoRA on a single GPU with just 14 GB of RAM—a task typically requiring multiple high-end GPUs for full fine-tuning.
Additional memory optimization can be achieved through Quantized LoRA (QLoRA), which quantizes pre-trained weights to 4-bit precision. Comparative studies have shown that while QLoRA increases training time, it significantly reduces memory usage.
Unlock the Power of Data Integration. Nexla's Interactive Demo. No Email Required!
Improve model performance
Despite this reduction in trainable parameters, LoRA maintains performance levels comparable to full fine-tuning. An important consideration for production deployments is inference latency. Applying LoRA has minimal impact on the new model’s inference speed. This is because the low-rank matrices can be merged with the original weights post-training, resulting in a model with the same architecture and parameter count as the original.
Empirical studies on tasks such as dialogue summarization have shown minimal performance drop-off. You can also combine LoRA with other parameter-efficient methods, like adapter layers or prompt tuning for improved performance.
Create multiple model versions with minimal overheads
LoRA has shown good results in text-generation tasks, dialogue systems, and image-generation models. Its ability to create small, task-specific adaptations that can be easily swapped out during inference makes it a good option for maintaining multiple fine-tuned versions of a large model for different tasks without the storage overhead of multiple full-sized models.
LoRA implementation
Select a suitable dataset and establish a baseline. The dataset’s preprocessing steps depend on the task and the base model you select. For the purpose of this article, we won’t be diving deep into the data processing steps.
Install the required libraries
Key libraries include bitsandbytes for quantization, datasets for loading datasets, accelerate for handling training distribution, and transformers and peft from Hugging Face.
!pip install -q datasets accelerate bitsandbytes !pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git
Load the pre-trained model
Choose a pre-trained model appropriate for your task. For this example, we use the flan-t5-base model from Google, which is suitable for sequence-to-sequence tasks like summarization.
Use the BitsAndBytesConfig to load the model with quantized (4-bit) weights. This allows us to reduce the needed memory for flan-t5-base by about 8x.
import torch from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, BitsAndBytesConfig bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", quantization_config=bnb_config, device_map='auto') tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
Apply LoRA
Specify LoRA parameters using LoraConfig in PEFT. Key parameters include:
- r: Rank of the low-rank matrices.
- lora_alpha: Scaling factor for the low-rank matrices.
- target_modules: The model components to apply LoRA (e.g., query and value matrices in attention layers).
- lora_dropout: Dropout rate for regularization.
- bias: How to handle biases (e.g., “none” to exclude biases).
- task_type:The type of model task (e.g., SEQ_2_SEQ_LM for sequence-to-sequence language modeling, CAUSAL_LM for causal language modeling ).
Use get_peft_model function to wrap the pre-trained flan-t5-base model with LoRA configuration.
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training config = LoraConfig( r=8, #attention heads lora_alpha=32, #alpha scaling target_modules=["q","v"], lora_dropout=0.05, bias="none", task_type="SEQ_2_SEQ_LM" ) # prepare int-4 model for training model = prepare_model_for_kbit_training(model) model_lora = get_peft_model(model, config) print_trainable_parameters(model_lora)
fds
As you can see, the trainable parameters are just 0.5% of the original model parameters.
Train the model
Tokenize and prepare your dataset for training. We won’t go into the data preprocessing details, as they depend on the domain task and the base model. For this experiment, we used the quotes dataset.
from datasets import load_dataset data = load_dataset("Abirate/english_quotes") data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)
Define arguments to configure training parameters like batch size, learning rate, and epoch. Our model tuning techniques article covers how to select these arguments in detail.
from transformers import TrainingArguments training_arguments = TrainingArguments( output_dir = "flan-t5-lora", per_device_train_batch_size = 4, gradient_accumulation_steps = 4, optim = "paged_adamw_32bit", save_steps = 10, logging_steps = 10, learning_rate = 2e-4, fp16=True, max_grad_norm = 0.3, max_steps = 500, warmup_ratio = 0.03, group_by_length=True, lr_scheduler_type = "constant", gradient_checkpointing=True, )
Use the Trainer library to handle the training loop. The training took ~2 hours and cost ~$2.
from transformers import Trainer, DataCollatorForLanguageModeling trainer = Trainer( model=model_lora, train_dataset=data["train"], args= training_arguments, data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False), ) model.config.use_cache = False trainer.train()
We can save the adapters to use for inference. Here, we are saving it to the Hugging Face Hub. The file is about 30 Mb.
model.push_to_hub("asanthosh/flan-t5-lora", use_auth_token=True, commit_message="lora 100", private=True)
You can load the LoRA adapters and wrap them with the base model using the PeftModel library. Now, the LoRA fine-tuned flan-t5-base model can be used for inference.
import torch from peft import PeftModel, PeftConfig peft_model_id = "asanthosh/flan-t5-lora" config = PeftConfig.from_pretrained(peft_model_id) model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", quantization_config=bnb_config, device_map='auto') tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base") # Load the Lora model model = PeftModel.from_pretrained(model, peft_model_id)
Best practices
When defining LoRA parameters, the trade-off is between computational efficiency and model performance. Adjusting the rank (r) and alpha values can be tricky: higher ranks improve performance but increase computational cost. A common heuristic is setting the alpha value at twice the rank value as a starting point.
The layers targeted for LoRA application are also important. While attention layers typically provide significant benefits, other layers can be included based on specific needs.
Utilizing quantized configurations, such as loading model weights with 4-bit precision, substantially reduces memory usage. However, this approach increases training time due to the overhead of quantization and dequantization steps.
Hyperparameter tuning is critical for successful fine-tuning. Key hyperparameters include learning rate, batch size, and the number of epochs. Fine-tuning these settings through experimentation can help achieve better performance. Additionally, the choice of optimizer, though less critical, should still be considered. Options like AdamW and SGD with schedulers are commonly used.
Discover the Transformative Impact of Data Integration on GenAI
Conclusion
LoRA offers a computationally efficient method for fine-tuning LLMs. You can maintain performance while significantly reducing trainable parameters and, hence, memory requirements. Its effectiveness, combined with techniques like QLoRA, make it a good option for adapting large models to specific tasks with limited computational resources.