Multi-chapter guide | Your Guide to Generative AI Infrastructure

Low-rank Adaptation of Large Language Models—Implementation Guide

Unlock up to 10x
greater productivity

From prompt to pipelines, Express.dev, our conversational AI, turns your words into workflows–no code needed.

Try Express for Free

Like this article?

Subscribe to our LinkedIn Newsletter

Subscribe now

Low-rank adaptation, or LoRA, is an advanced fine-tuning technique designed to reduce the number of trainable parameters in large language models without significantly compromising performance. By decomposing weight updates into low-rank matrices, LoRA enables LLMs to adapt to specific tasks while minimizing computational requirements.

This article aims to provide an understanding of LoRA for LLMs, with a focus on implementation details and best practices. We’ll explore the technical principles behind LoRA, discuss its advantages over full fine-tuning, and provide practical guidance for fine-tuning models with LoRA.

This article assumes background knowledge of fine-tuning concepts and matrix operations. Please read our articles on prompt tuning vs. fine-tuning and model tuning as a pre-requisite to better understand the concepts mentioned in this article.

Summary of key low-rank adaptation LLM concepts

Concept	Description
Low-rank adaptation	A technique that approximates weight updates in neural networks using low-rank matrix factorization
Why use LoRA?	LoRA significantly reduces memory requirements for training LLMs by adjusting only a small fraction of parameters. It maintains performance comparable to full fine-tuning with minimal impact on inference speed, allowing for efficient creation of multiple task-specific model versions without excessive storage needs. This makes LoRA valuable for various text, dialogue, and image generation applications.
Parameter efficient fine-tuning (PEFT)	A class of methods that adapt pre-trained models by updating only a small subset of parameters. Low Rank Adaptation is a PEFT technique.
Hyperparameters for LoRA	Key variables like rank and alpha that control LoRA’s behavior and effectiveness
QLoRA	An extension of LoRA that incorporates 4-bit quantization for further memory optimization

Low-rank adaptation of large language models explained

LoRA operates on the principle of matrix decomposition. Instead of updating the entire weight matrix W during fine-tuning, LoRA introduces two smaller matrices A and B:

W’ = W + BA

Where:

W∈ℝdxk is the original weight matrix (frozen during training), where d and k are the number of rows and columns in the matrix W, respectively.
W’ is the updated weight matrix after fine-tuning.
B ∈ ℝdxr and A ∈ ℝrxk are the LoRA matrices. Note that matrix B has the same number of rows as the original weight matrix W. Similarly, matrix A has the same number of columns as the original matrix W.
r is the inner dimension or rank of the decomposition matrices, which is a hyperparameter

The key to LoRA’s efficiency is setting r << min(d,k), which dramatically reduces the number of trainable parameters. This rank reduction is based on the empirical observation that weight updates during fine-tuning often have a low intrinsic rank.

During inference, the LoRA updates are merged with the original weights to get the output, h, as below:

h = (W + BA)x

This reconstruction approximates the effect of full fine-tuning while training far fewer parameters.

Enhance LLM models like GPT and LaMDA with your own data
Connect to any vector database like Pinecone
Build retrieval-augmented generation (RAG) pipelines with no code

Application to Transformer Architecture

LoRA is shown to be more effective when applied to the self-attention mechanism in transformer-based models. In a standard transformer, the attention function is computed as:

source

where Q, K, and V are the Query, Key, and Value matrices. You can read more about Query, Key, and Value matrices here. It is an important concept to understand when applying LoRA.

Q, K, and V are derived from the input X through learnable weight matrices WQ, WK, and WV:

source

LoRA modifies these weight matrices by adding low-rank updates:

W’Q = WQ + BQAQ

W’K = WK + BK AK

W’V = WV + BV AV

Where B_i ∈ ℝdxr and A_i ∈ ℝrxk are the LoRA matrices, and r is the rank (typically r << d, k). d and k are the number of rows and columns of the projections WQ WK, and WV.

This modification allows the model to adapt its attention patterns with minimal additional parameters. Empirically, applying LoRA to just the query and value projections (WQ and WV) often yields good results, further reducing the parameter count. We’ll see this in action in the next section.

Why use low-rank adaptation for LLMs?

The advantage of LoRA lies in its significant reduction of memory requirements during training and storage.

Reduce memory requirements in training

How much memory does LoRA save? It depends on the rank r, which is a hyperparameter. For example, if W has 1000 rows and 5000 columns, it stores 5,000,000 parameters. If we choose A and B with r=8, then A has 1000 rows and 8 columns, and B has 8 rows and 5000 columns. That’s 1000×8 + 8×5000 = 48,000 parameters, which is about 0.96% of the original parameters, i.e., you only need to train 0.96% of the model.

This efficiency allows for fine-tuning of large models on limited hardware resources. For instance, a 7 billion parameter model can be adapted using LoRA on a single GPU with just 14 GB of RAM—a task typically requiring multiple high-end GPUs for full fine-tuning.

Additional memory optimization can be achieved through Quantized LoRA (QLoRA), which quantizes pre-trained weights to 4-bit precision. Comparative studies have shown that while QLoRA increases training time, it significantly reduces memory usage.

Unlock the Power of Data Integration. Nexla's Interactive Demo. No Email Required!

Tour the Product

Improve model performance

Despite this reduction in trainable parameters, LoRA maintains performance levels comparable to full fine-tuning. An important consideration for production deployments is inference latency. Applying LoRA has minimal impact on the new model’s inference speed. This is because the low-rank matrices can be merged with the original weights post-training, resulting in a model with the same architecture and parameter count as the original.

Empirical studies on tasks such as dialogue summarization have shown minimal performance drop-off. You can also combine LoRA with other parameter-efficient methods, like adapter layers or prompt tuning for improved performance.

Create multiple model versions with minimal overheads

LoRA has shown good results in text-generation tasks, dialogue systems, and image-generation models. Its ability to create small, task-specific adaptations that can be easily swapped out during inference makes it a good option for maintaining multiple fine-tuned versions of a large model for different tasks without the storage overhead of multiple full-sized models.

LoRA implementation

Select a suitable dataset and establish a baseline. The dataset’s preprocessing steps depend on the task and the base model you select. For the purpose of this article, we won’t be diving deep into the data processing steps.

Install the required libraries

Key libraries include bitsandbytes for quantization, datasets for loading datasets, accelerate for handling training distribution, and transformers and peft from Hugging Face.

!pip install -q datasets accelerate bitsandbytes
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git

Load the pre-trained model

Choose a pre-trained model appropriate for your task. For this example, we use the flan-t5-base model from Google, which is suitable for sequence-to-sequence tasks like summarization.

Use the BitsAndBytesConfig to load the model with quantized (4-bit) weights. This allows us to reduce the needed memory for flan-t5-base by about 8x.

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, BitsAndBytesConfig


bnb_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_use_double_quant=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_compute_dtype=torch.bfloat16
)


model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", quantization_config=bnb_config, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

Apply LoRA

Specify LoRA parameters using LoraConfig in PEFT. Key parameters include:

r: Rank of the low-rank matrices.
lora_alpha: Scaling factor for the low-rank matrices.
target_modules: The model components to apply LoRA (e.g., query and value matrices in attention layers).
lora_dropout: Dropout rate for regularization.
bias: How to handle biases (e.g., “none” to exclude biases).
task_type:The type of model task (e.g., SEQ_2_SEQ_LM for sequence-to-sequence language modeling, CAUSAL_LM for causal language modeling ).

Use get_peft_model function to wrap the pre-trained flan-t5-base model with LoRA configuration.

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training


config = LoraConfig(
   r=8, #attention heads
   lora_alpha=32, #alpha scaling
   target_modules=["q","v"],
   lora_dropout=0.05,
   bias="none",
   task_type="SEQ_2_SEQ_LM"
)


# prepare int-4 model for training
model = prepare_model_for_kbit_training(model)
model_lora = get_peft_model(model, config)
print_trainable_parameters(model_lora)

fds

As you can see, the trainable parameters are just 0.5% of the original model parameters.

Train the model

Tokenize and prepare your dataset for training. We won’t go into the data preprocessing details, as they depend on the domain task and the base model. For this experiment, we used the quotes dataset.

from datasets import load_dataset


data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

Define arguments to configure training parameters like batch size, learning rate, and epoch. Our model tuning techniques article covers how to select these arguments in detail.

from transformers import TrainingArguments


training_arguments = TrainingArguments(
   output_dir = "flan-t5-lora",
   per_device_train_batch_size = 4,
   gradient_accumulation_steps = 4,
   optim = "paged_adamw_32bit",
   save_steps = 10,
   logging_steps = 10,
   learning_rate = 2e-4,
   fp16=True,
   max_grad_norm = 0.3,
   max_steps = 500,
   warmup_ratio = 0.03,
   group_by_length=True,
   lr_scheduler_type = "constant",
   gradient_checkpointing=True,
)

Use the Trainer library to handle the training loop. The training took ~2 hours and cost ~$2.

from transformers import Trainer, DataCollatorForLanguageModeling


trainer = Trainer(
   model=model_lora,
   train_dataset=data["train"],
   args= training_arguments,
   data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False 
trainer.train()

We can save the adapters to use for inference. Here, we are saving it to the Hugging Face Hub. The file is about 30 Mb.

model.push_to_hub("asanthosh/flan-t5-lora",
                 use_auth_token=True,
                 commit_message="lora 100",
                 private=True)

You can load the LoRA adapters and wrap them with the base model using the PeftModel library. Now, the LoRA fine-tuned flan-t5-base model can be used for inference.

import torch
from peft import PeftModel, PeftConfig


peft_model_id = "asanthosh/flan-t5-lora"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", quantization_config=bnb_config, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")


# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)

Best practices

When defining LoRA parameters, the trade-off is between computational efficiency and model performance. Adjusting the rank (r) and alpha values can be tricky: higher ranks improve performance but increase computational cost. A common heuristic is setting the alpha value at twice the rank value as a starting point.

The layers targeted for LoRA application are also important. While attention layers typically provide significant benefits, other layers can be included based on specific needs.

Utilizing quantized configurations, such as loading model weights with 4-bit precision, substantially reduces memory usage. However, this approach increases training time due to the overhead of quantization and dequantization steps.

Hyperparameter tuning is critical for successful fine-tuning. Key hyperparameters include learning rate, batch size, and the number of epochs. Fine-tuning these settings through experimentation can help achieve better performance. Additionally, the choice of optimizer, though less critical, should still be considered. Options like AdamW and SGD with schedulers are commonly used.

Discover the Transformative Impact of Data Integration on GenAI

Watch Expert Panel

Conclusion

LoRA offers a computationally efficient method for fine-tuning LLMs. You can maintain performance while significantly reducing trainable parameters and, hence, memory requirements. Its effectiveness, combined with techniques like QLoRA, make it a good option for adapting large models to specific tasks with limited computational resources.

Navigate Chapters:

Continue reading this series

Chapter 1

Enterprise AI—Principles and Best Practices

Learn how to effectively transition enterprise AI projects from proof of concept to production. Discover strategies, governance frameworks, and data engineering best practices for enterprise AI success.

Chapter 2

AI Cost Considerations for Enterprise AI Success

Learn the key factors driving AI cost and management and optimization strategies for sustainable AI development and strong profit margins.

Chapter 3

Enterprise Generative AI Tools for Scaling LLM Development in Your Enterprise

Learn about the top enterprise generative AI tools that support LLM selection, customization, testing and monitoring, to build and run AI applications in your organization.

Chapter 4

Enterprise AI Platform—Key Features for Success

Learn how an enterprise AI platform with standardized tools, structured workflows, and centralized data management improves efficiency, accuracy, and scalability for AI projects.

Chapter 5

Prompt Chaining Introduction and Coding Tutorials

Learn different prompt chaining strategies and how to implement them in LangChain. Discover no-code prompt chaining tools for beginners.

Chapter 6

Low-rank Adaptation of Large Language Models—Implementation Guide

Learn how to fine-tune LLMs with low-rank adaptation for large language models. Includes simple explanation, Python code, and advantages.

Chapter 7

LLM Fine-Tuning—Overview with Code Example

The most common type of LLM training approach is fine-tuning. Learn how to fine-tune large language models—including key concepts, components, and hands-on tutorials with code snippets.

Low-rank Adaptation of Large Language Models—Implementation Guide

Table of Contents

Unlock up to 10x
greater productivity

Like this article?

Summary of key low-rank adaptation LLM concepts

Low-rank adaptation of large language models explained

Powering data engineering automation for AI and ML applications

Application to Transformer Architecture

Why use low-rank adaptation for LLMs?

Reduce memory requirements in training

Unlock the Power of Data Integration. Nexla's Interactive Demo. No Email Required!

Improve model performance

Create multiple model versions with minimal overheads

LoRA implementation

Install the required libraries

Load the pre-trained model

Apply LoRA

Train the model

Best practices

Discover the Transformative Impact of Data Integration on GenAI

Conclusion

Continue reading this series

Enterprise AI—Principles and Best Practices

AI Cost Considerations for Enterprise AI Success

Enterprise Generative AI Tools for Scaling LLM Development in Your Enterprise

Enterprise AI Platform—Key Features for Success

Prompt Chaining Introduction and Coding Tutorials

Low-rank Adaptation of Large Language Models—Implementation Guide

LLM Fine-Tuning—Overview with Code Example

Low-rank Adaptation of Large Language Models—Implementation Guide

Table of Contents

Unlock up to 10x greater productivity

Like this article?

Summary of key low-rank adaptation LLM concepts

Low-rank adaptation of large language models explained

Powering data engineering automation for AI and ML applications

Application to Transformer Architecture

Why use low-rank adaptation for LLMs?

Reduce memory requirements in training

Unlock the Power of Data Integration. Nexla's Interactive Demo. No Email Required!

Improve model performance

Create multiple model versions with minimal overheads

LoRA implementation

Install the required libraries

Load the pre-trained model

Apply LoRA

Train the model

Best practices

Discover the Transformative Impact of Data Integration on GenAI

Conclusion

Continue reading this series

Enterprise AI—Principles and Best Practices

AI Cost Considerations for Enterprise AI Success

Enterprise Generative AI Tools for Scaling LLM Development in Your Enterprise

Enterprise AI Platform—Key Features for Success

Prompt Chaining Introduction and Coding Tutorials

Low-rank Adaptation of Large Language Models—Implementation Guide

LLM Fine-Tuning—Overview with Code Example

Unlock up to 10x
greater productivity