You're Invited!

Please join us 4/16 for a virtual event to hear speakers experiences at Doordash, LiveRamp and Clearwater Analytics.

Register now

Model Tuning—Key Techniques and Alternatives

Your Guide to Generative AI Infrastructure

Model tuning, also known as hyperparameter tuning, refers to configuring settings of a machine learning or large language model to improve the model’s performance during training. It aims to find the optimal combination of hyperparameters that maximizes the model’s accuracy, generation quality, and any other performance metric relevant to the specific task. 

Most AI projects require engineers to develop an accurate and effective ML model to solve specific problems. Model tuning is an iterative process that requires both experimentation and fine-tuning to obtain the best results.

This article discusses the different techniques, alternatives, and best practices to improve model performance with model tuning

Summary of key model tuning concepts

Concept Description 
Model tuning Configuring settings of a machine learning or large language model to improve the model’s performance during training
Epoch One complete pass through the entire training data
Learning rate The speed at which the model updates its parameters during training.
Batch size The number of training examples utilized in one iteration of model training
Other hyperparameters
  • Optimizer
  • Gradient checkpointing
  • Gradient accumulation
  • Warm-up steps
Prompt engineering Model tuning alternative that involves changing the input to help the model perform better on the overall task.
Architectural optimization Model tuning alternative involves changing the model’s structure to be better optimized for the task you are interested in training your model for.

 

Hyperparameters in model tuning

Some important hyperparameters that require tuning are given below.

Epoch

The most crucial hyperparameter is the number of epochs. An epoch is defined as one complete pass through the entire training data. In general, the longer you train, the better the model’s performance, but eventually, this leads to diminishing returns due to overfitting.

But what is overfitting? Overfitting is the phenomenon in which a model has learned the training data so well it increasingly starts to find random cues that have nothing to do with actually understanding the task. The result is that the model performs well on the training data but poorly on any unseen data. For example, if you are doing a sentiment analysis task and it just so happens that in the training data, every time the word “dog” shows up, the sentiment is always positive, that would result in overfitting since the trained model would always end up predicting “positive” for “dog,” even for a sentence like “The dog was sad.”

Learning rate

After that, the next most important hyperparameter is the learning rate. A training step simply refers to a single update of the model. A “loss” is calculated between the model’s prediction and the correct (target) answer every time a training step is completed. This is just a programmatic way to automatically grade the model’s performance. From there, the model’s parameters are updated depending on how significant the loss is – the larger the loss (or more technically, its gradient), the more the model’s parameters change. The goal of the training process is to minimize the loss.

However, you also want to control how quickly the model changes its parameters. Training will take forever if it changes by too small of an amount, but if it changes by too large of an amount, it will quickly start confusing itself and learn nothing. You can use the learning rate parameter to control the process. The learning rate is multiplied by the loss (or more precisely, its gradient) at each update to help the model learn the data faster.

What is the impact of GenAI on Data
Engineering?

WATCH EXPERT PANEL

Batch size

Next, you can move to using the batch size. The beauty of models is that they can handle multiple inputs simultaneously due to their batched matrix multiplication ability. For example, instead of multiplying two matrices 16 x 512 and 512 x 128, you can multiply two matrices 4 x 16 x 512 and 512 x 128, where 4 denotes four inputs you would like to pass through the model. 

This is helpful because NVIDIA GPUs found in ML processors are specifically optimized to multiply many different inputs together without significantly increasing the time it would take relative to run one input through the model. In other words, it is much faster to get model outputs for 256 inputs if you pass the 256 inputs into the model all at once than if you gave the model each input separately. You can leverage this fact to speed up training.

 

Finally, you also have the opportunity to parallelize across multiple GPUs at once—for example, if you have a cluster of eight GPUs, you can train your model on all eight at once to speed up training. This also has the added advantage that if you don’t have a lot of memory per GPU (and thus can’t fit a large batch size in each GPU), you can scale across multiple GPUs and make more stable updates to the model.

You should generally use the biggest batch size to fit within your GPU memory*. Nevertheless, it is always valuable to experiment to see what works best for your use case.

*Note that there are some theoretical caveats to this rule. For instance, if you split apart batch sizes such that you are training a single model on multiple GPUs, you can run into memory bottlenecks. In addition, very large batch sizes do have a risk of falling into something called the “over-smoothing phenomenon,” in which the model ends up using its capacity less efficiently – though this is still an area of active research (and in any case, given that pre-trained models already use very large batch sizes, there isn’t much you can do about that anyway). In any case, these situations will not matter in most model tuning use cases.

Other hyperparameters

Apart from the above three, there are dozens of additional hyperparameters like regularization, dropout, and learning rate schedules that you can tune. However, these hyperparameter changes are usually only applicable in niche cases. While the scope of this discussion is beyond this article, you can read the paper on the impact of dropout in model tuning to learn more. 

For example, you can change:

  • The optimizer determines how much the learning rate changes over time; generally default to the AdamW optimizer. 
  • Whether you are using gradient checkpointing and gradient accumulation, which are settings used to handle limited memory resources. 
  • The number of warm-up steps. 

To improve model training stability, you can choose not to start with the initial learning rate all at once but rather slowly increase the learning rate from 0 to the default value across the warm-up step period. For example, if you have a learning rate of 0.0001 and 100 warm-up steps, the first training iteration may start at a learning rate of 0.000001 (0.0001 divided by 100) and gradually increase to 0.0001 over the first 100 training steps. 

Typically, the increase is done linearly, but you can choose several other options, including using the cosine function. 

Model tuning platforms

Several platforms are available that enable engineers and data scientists to conduct model tuning cheaply and effectively. OpenAI offers model tuning capabilities for three models: Babbage-002, DaVinci-002, and GPT-3.5-Turbo, though the former two are largely obsolete as of this writing.

OpenAI model tuning example 

With OpenAI, model tuning can be done in two ways: write code to do so or simply use OpenAI’s fine-tuning web interface. OpenAI expects training data in this format (if your data doesn’t already look like this, then using Nexla’s data pipelines should be very helpful in changing it):

{"prompt": "", "completion": ""}
{"prompt": "", "completion": ""}
{"prompt": "", "completion": ""}

Then, to fine-tune while having control of hyperparameters, you can use the following code:

from openai import OpenAI
client = OpenAI()

client.fine_tuning.jobs.create(
  training_file="file-abc123", 
  model="gpt-3.5-turbo", 
  hyperparameters={
    "n_epochs": 2,
   "batch_size": 4,
   "learning_rate_multiplier": 0.001,
  }
)

Note that the choices for the number of epochs, batch size, and learning rate multiplier were randomly selected for demonstration purposes. For more examples, feel free to take a closer look at some examples from the OpenAI Cookbook, including for classification, function calling, and chat models

Alternatively, you can use Hugging Face to conduct model tuning with open-source models. Hugging Face has tutorials to show how model tuning works for various use cases, including token classification, sequence classification, translation, language modeling, summarization, and more. Hugging Face also has example notebooks geared explicitly toward training large language models.

Model tuning alternatives

Some model tuning alternatives are prompt engineering and architectural optimization. 

Prompt engineering

Prompt engineering does not involve changing the underlying model – it just involves changing the input to help the model perform better on the overall task. This is advantageous when you have limited computational resources – training a model uses up far more GPU memory than simply testing a model out with a new prompt. However, the lack of ability to change the model’s parameters also means that prompt engineering has far more limited ability to improve the model’s performance.

However, it is worth noting that you can use a variant of prompt engineering alongside model tuning. Specifically, you can train a model with different hyperparameter settings while changing the prompt. This can help to reduce the overall prompt length.

For example, if your original prompt from prompt engineering was well over 3,000 words, you can use model tuning to recover the model’s performance with a far shorter prompt. This video goes into further depth on how you can fine-tune Mistral-7B (an open-source language model) using together.ai to emulate the behavior of a model that has been prompted to act like an unhelpful assistant.

The cutting-edge data engineering platform Nexla is a very helpful tool for prompt engineering. Typically, ML models are compatible with inputs in JSON format that give pairs of questions and answers or prompts and completions, which are then used during model training to teach the model additional information. Generating these files requires ingesting information from many places and converting it into the question-answer / prompt-completion format. Nexla, which has had years of experience supporting file-generation pipelines through various eras of machine learning, can do this very effectively.

Architectural optimization

Architectural optimization involves changing parts of the model to be better optimized for the task for which you want to train your model. It allows you even more flexibility to improve performance. 

For example, most modern language models struggle with large inputs (for example, prompts and questions in the thousands of words). This is because they use a mechanism called attention, which involves seeing how every word in the input relates to each other. This process inherently scales quadratically for the input length – if you have n words, you must compare each word to the other n-1, which leads to a total of (n)(n-1) or n2 – n comparisons. Consequently, if you want your model to now cover 2,000 words instead of 1,000, you will need four times the amount of memory instead of 2 (and the model is expected to be four times slower), meaning it quickly becomes computationally expensive to cover large sequences.

As a result, you might elect to change up a language model’s attention mechanism to be more efficient. For example, you can choose from linear attention, flash attention, BigBird attention, grouped query attention (GQA), and many others. This would be an example of an architectural optimization — where you change a component of the model to be better suited for a task you are interested in. Other examples of where model architectures can be optimized include changing the position embedding layer and using mixture-of-experts.

The advantage of architectural optimization relative to model tuning is that you have more control over the model’s performance and can likely improve it. However, architectural optimization is an order of magnitude more complex than model tuning. While model tuning requires very little code (mainly, it just involves changing a few numbers/variables), architectural optimization may require changing tens, hundreds, or even thousands of lines of code. Despite the time spent making changes, it could fail to work correctly due to hardware limitations, bugs, and poor compatibility with other parts of the code. Due to the large amount of effort needed for architectural optimization, it is generally best to stick to model tuning first and see if that can get the requisite model performance you are looking for.

Best practices

Here are a few best practices to help improve your success with model tuning.

Batch size optimization

Use the biggest batch size that fits on your processor. In general, this allows the model to see more data in less time, which should improve performance. When using specialized hardware like GPUs and TPUs, increasing the number of examples processed per step correlates to a much smaller increase in overall time processing those examples (particularly when using larger batch sizes), which should also speed up training overall. In addition, it is also possible to use an adaptive batch size (in which the amount of training examples the model processes per step varies during training). This is especially valuable with Vision Transformers.

Early stopping

When setting the number of epochs, either use early stopping or monitor performance by epoch (or number of steps) to see whether it makes sense to keep training. Predicting how much time a model will need to train for the best performance is difficult. Early stopping allows you to programmatically set criteria for when to stop a model. 

For example, if the model cannot beat a checkpoint’s accuracy for two epochs, you can set the training algorithm to stop training at that point automatically. If you believe the picture is more nuanced than that (as it frequently ends up being), you can manually review your results until you feel it no longer makes sense to train the model.

In addition, you can combine early stopping with learning rate schedulers so that if you see no improvement in model performance after a given time period of training, you can try adjusting the learning rate instead before stopping training.

Random data sampling

Make sure to sample your training data randomly. This generally improves performance by a reasonable amount. Your options include simple sampling (where every example in your data has an equal probability of being selected), stratified sampling (where you sample from each possible label in your data – though this only works for classification tasks, not generative), bootstrap sampling (where you can sample data points more than once – especially ones that you want to the model to focus on), and weighted sampling (where you assign a weight to each example or group of examples and sample based on those weights).

Powering data engineering automation for AI and ML applications

Learn how Nexla helps enhance LLM models

Enhance LLM models like GPT and LaMDA with your own data

Connect to any vector database like Pinecone

Build retrieval-augmented generation (RAG) with no code

Conclusion

Overall, model tuning is a very effective tool for improving the performance of large language models. The number of epochs, the learning rate, and the batch size are the three hyperparameters most correlated with enhancing performance. However, dozens of hyperparameters are available for more specialized use cases.

Like this article?

Subscribe to our LinkedIn Newsletter to receive more educational content

Subscribe now