2 easy ways for fine-tuning LLAMA-v2 and other Open source LLMs

Robert John
12 min readSep 4, 2023

--

Are you ready to embark on a journey to create your very own custom Large Language Model based on an open source LLM? If you’ve ever wondered about the fascinating world of fine-tuning open-source LLMs, then this article is your gateway. Before we dive into the code and techniques for fine-tuning, let’s demystify key concepts such as prompt engineering, Retrieval Augment Generation(RAG) and vector embedding. These can be confusing, even for experienced developers, but fear not, we’re here to unravel them. Get ready for an exciting adventure where we’ll make the complex world of fine tuning LLMs accessible and captivating.

Fine-Tuning vs Prompt Engineering vs Retrieval Augmented Generation(RAG) and Embedding

Before we dive into writing code to fine-tuning LLM, first we need to understand the differences between some concepts. Let’s break down the differences between fine-tuning, prompt engineering, retrieval augmented generation (RAG) and embedding in simpler terms:

Prompt Engineering: This involves making the instructions(prompt) given to LLMs more detailed, clearer and specific to improve their output. For instance, you can add examples and format of the expected result to a prompt to make the prompt more understandable. Prompt engineering doesn’t change the update the parameters of the model; it’s about refining how you talk to the model to get the best out of it. To learn more about prompt engineering techniques, check out the free course “ChatGPT Prompt Engineering for Developers” offered by DeeplearningAI.

Retrieval Augmented Generation (RAG) and Embedding: RAG is a process that gives your LLM access to external knowledge or data sources. It takes this external information and converts it into vector embeddings, which are stored in a vector database. The LLM can then query this vector database using similarity search to retrieve relevant information. Giving the LLM access to your data doesn’t mean it trains on that data. It simply allows the model to fetch information for it when needed. RAG combines information retrieval with text generation, and you’ve likely seen it in action when you encounter features like “chat with your data” or “chat with your document.”

Image source: DeepLearning.ai course on “Finetuning Large Language Models” — https://learn.deeplearning.ai/finetuning-large-language-models/lesson/2/why-finetune

Fine-Tuning: This process involves training an LLM with specific data for a particular task. It does change the model architecture, that is the model parameters are updated. Fine-tuning takes a general-purpose LLM and tailors it for a specific job. For instance, you can fine-tune the base model of GPT-3 to excel at conversation tasks, creating a specialized version like “chatGPT.” Think of fine-tuning as a way to teach the model further for a specific purpose.

Image source: DeepLearning.ai course on “Finetuning Large Language Models” — https://learn.deeplearning.ai/finetuning-large-language-models/lesson/2/why-finetune

I won’t delve into the decision-making process between using RAG, prompt engineering, or fine-tuning in this article, as our focus here is on methods for fine-tuning Language Models (LLMs).

Benefits of fine-tuning an LLM

  1. Increased Reliability: Fine-tuning makes your LLM more dependable for your specific task. It tailors the model’s capabilities to precisely match your requirements.
  2. Enhanced Privacy: By training the model yourself, you keep your data secure and private, without sharing it with any third parties.
  3. Improved Performance: Fine-tuned LLMs excel in the tasks they are trained for, delivering better performance compared to generic models.
  4. Consistency: Fine-tuned models offer more consistent results, ensuring reliable outcomes for your applications.”

Demystifying Fine-Tuning Concepts

Before we get into the nitty-gritty of fine-tuning LLMs and understanding the complex codes and settings, let’s break down some essential terms that will help us grasp the process better.

Parameter-Efficient Fine-Tuning: Taming the LLM Giants

Large Language Models live up to their name; they are massive and continually growing in size. However, fine-tuning these behemoths can be a resource-intensive endeavor, demanding substantial GPU power and time. Enter “Parameter-Efficient Fine-Tuning,” a clever solution. Instead of tinkering with all the model’s parameters, this approach focuses on updating only a select few parameters while keeping the rest frozen. t’s like making precise adjustments on the model parameters without having to overhaul the entire parameters, making fine-tuning more manageable, efficient, and cost-effective. For more details, you can check out this paper and this blog from hugging face.

Reinforcement Learning with Human Feedback

Imagine taking a smart but not quite perfect language model and giving it a boost of human-powered wisdom. That’s exactly what Reinforcement Learning with Human Feedback (RLHF) does. This technique has gained fame thanks to ChatGPT and its predecessor, InstructGPT.

Here’s how it works in simpler terms: RLHF gets feedback from human by asking humans to rank or rate different responses from the model, a bit like giving grades. These grades become like little prizes, forming a reward system. We then train a special “reward model” to understand these prizes. This reward model guides the Language Model’s adaptation to better match human preferences. So, it’s like teaching the model to be more friendly and useful, just the way we like it.

Representing a 32-bit floating point number as an 8-bit integer reduces memory at the cost of precision [Image credit]

Quantization

Think of quantization as a smart way to make data leaner and faster without losing too much brainpower. When we say “precision,” we mean how detailed and exact numbers are. Quantization trims this precision, making data take up less space and move faster.

But, like a trade-off, there’s a catch. When we trim down precision, we also lose some accuracy. It’s a bit like using fewer words to describe something. For example, imagine going from using a 32-bit number with lots of detail to an 8-bit number with fewer details. You can still get the idea across, but you might miss some of the finer points. So, quantization is like finding the right balance between speed and smarts for your data.

Preparing Your Data for Fine-Tuning

Since fine-tuning involves updating parameters of the model, the dataset structure decides which type of fine-tuning that can be done. Remember a model is a representation of data.

Now, let’s explore different datasets used for fine-tuning Language Models (LLMs):

1. Token-Based Dataset

Think of this dataset as a massive, jumbled collection of words and sentences. When we use it for training, we’re teaching the model to speak in a way that’s similar to this data. Imagine if we fed it Shakespeare’s writings — the model would start talking like the Bard himself. It’s like teaching it to mimic a certain style or tone.

2. Human Feedback Dataset

This dataset is the most sophisticated. It involves people comparing two responses — one they like and one they don’t. It’s a bit like a popularity contest. Using this data, a special framework called RLHF (Reinforcement Learning with Human Feedback) trains a reward model. This reward model then coaches the base language model through reinforcement learning. It’s like the model is learning to be more likable and useful based on what people prefer.

3. Instruction Dataset

In this dataset, we provide clear instructions, an input, and the expected output. It’s a bit like a recipe — we tell the model what to do and what the result should be. This is similar to how ChatGPT learns to respond to your questions or prompts.

So, when fine-tuning our language models, we pick the dataset that suits the transformation we want — whether it’s sounding like Shakespeare, following a recipe, or winning in a popularity contest.

In this article, we fine-tune out model using instruction dataset, it is called instruction tuning

When it comes to building your own instruction dataset, you certainly can roll up your sleeves and craft one from scratch. But to save some precious time, you can also tap into open-source data repositories like Hugging Face.

Now, let’s talk about the secret language of instruction datasets. They use special tokens, kind of like secret code words, to communicate with the model:

  • ### Instruction {instruction}: This token is like the title of the game. It tells the model what it needs to do or learn.
  • ### Input {Input}: Think of this as the setup for the challenge. It’s like providing all the ingredients for a recipe.
  • ### Output {output}: This token is the answer key. It’s the expected result, like the delicious dish you’re trying to cook.

These special tokens act like variable names in a computer program. You can even give them different names if you want to keep things interesting. Sometimes, you might skip the ### Input token, and that’s perfectly fine.

Now, let’s say you’re interested in the Alpaca-gpt4 dataset from hugging face. It uses a slightly different token:

  • ### Instruction: telling the model what to do.
  • ### Input: This token is like the mission briefing, setting the stage for the task.
  • ### Response: Here’s the model’s answer. It’s called “Response” in this dataset, which is a bit like having a chat with your model.

So, whether you’re creating your own dataset or exploring open-source treasures like Hugging Face, these special tokens are your trusty guides, helping the model understand what’s expected. It’s like speaking the secret language of AI instruction.

Another dataset is Openassistant-guannaco dataset. It special tokens are ###Human and ###Assistant. ###Human is the instruction while ###Assistant is the output.

Difference ways to fine-tune LLAMA-2 and other open source model

Pre-requisite

  • Python 3.8: Think of it as the engine that drives the whole process. Make sure you have Python version 3.8 or newer.
  • GPU: To speed things up, you can harness the power of a Graphics Processing Unit (GPU). Don’t worry; you can even use a free GPU on platforms like Colab.
  • Huggingface Account with Token: You’ll need this to access Hugging Face’s fantastic resources.

Fine-tuning using autotrain-advanced

AutoTrain is a no-code tool for training state-of-the-art models for Natural Language Processing (NLP) tasks, for Computer Vision (CV) tasks, and for Speech tasks and even for Tabular tasks. We only need a single command to fine-tune our LLM. Link to the full code on Fine-tunings LLM using autotrain.

Step by step process to fine-tune with autotrain-advance

  1. Install autotrain-advance and huggingface_hub package. We need hugging face hub to download our dataset

2. Update the packages that will be used by autotrain-advanced

3. Login to Huggingface

4. Command to fine-tune the model

! autotrain llm \
--train \
--model {MODEL_NAME} \
--project-name {PROJECT_NAME} \
--data-path data/ \
--text-column text \
--lr {LEARNING_RATE} \
--batch-size {BATCH_SIZE} \
--epochs {NUM_EPOCHS} \
--block-size {BLOCK_SIZE} \
--warmup-ratio {WARMUP_RATIO} \
--lora-r {LORA_R} \
--lora-alpha {LORA_ALPHA} \
--lora-dropout {LORA_DROPOUT} \
--weight-decay {WEIGHT_DECAY} \
--gradient-accumulation {GRADIENT_ACCUMULATION}

Here are the key settings and values you’ll need:

llm: This is like choosing the type of model you want to fine-tune. Think of it as selecting your canvas.

— project_name: Give your project a name, something memorable. It’s like labeling a masterpiece.

— model: Specify the model you want to fine-tune. It’s like picking the right tool for the job.

— data_path: This is the path to your data. The path could also be on huggingface dataset. It’s like telling your model where to find the ingredients.

— text_column: If you’re working with a table of data, tell the model which column contains the instructions and responses. It’s like pointing out the recipe in a cookbook.

— use_peft: This option lets you use an efficient adaptation method called PEFT (Parameter-Efficient Fine-Tuning). It’s like choosing an energy-saving mode.

— use_int4: Adjust the precision of the model(quantization). It’s like fine-tuning the focus on your camera.

— learning_rate: Control how fast your model learns. Smaller numbers mean slower learning.

— train_batch_size: Decide how many data chunks your model studies at once. It’s like breaking a big task into smaller steps.

— num_train_epochs: Specify how many training cycles your model should go through. It’s like setting the duration of your workout.

— trainer: Choose the type of trainer to use, like picking a coaching style.

— model_max_length: Set the content window for your model. It’s like choosing the size of the canvas for your painting.

— push_to_hub: Want to store your fine-tuned model on Hugging Face? Use this option.

— repo_id: If you’re pushing your model to Hugging Face, provide a unique repository ID. It’s like naming your art gallery.

— block_size: This setting controls the size of text chunks your model processes. Think of it as slicing a cake into manageable pieces.

Fine-tuning using SFTTrainer from TRL and QLoRA

TRL is a full stack library where we provide a set of tools to train transformer language models with Reinforcement Learning, from the Supervised Fine-tuning step (SFT), Reward Modeling step (RM) to the Proximal Policy Optimization (PPO) step. Link to the the full code on Fine-tuning LLM using SFTTrainer from TRL and QLoRA

  1. Install the requirements

2. Install “wandb” to monitor your fine-tuning process. It is not necessary.

3. Load openassistant-guanaco dataset from hugging face

4. Create quantization config and download model from hugging face

5. Download Tokenizer

6. Create PEFT configuration

7. Create Fine-tuning and training configuration

8. Create SFTTrainer configuration

9. Pre-process the model by upcasting the layer norms in float 32 for more stable training

10. Fine-tune your model.

11. Save fine-tuned model

Your tune model will be saved in the folder “output”.

12. Load fine-tuned model

13. Test fined-tuned model

Here is a colab link to the Fine-tuning your using TRL

Conclusion

In our journey through the world of language model fine-tuning, we’ve explored the key differences between fine-tuning, prompt engineering, retrieval augmented generation, and embedding. We’ve delved into the essential concepts behind fine-tuning and learned how to create datasets for this purpose. Moreover, we’ve discovered two powerful methods for fine-tuning LLMs — LLAMA and other open-source options.

I hope this article has shed light on this complex topic and provided you with valuable insights. Your feedback matters, and I encourage you to share your thoughts on how I can improve and suggest topics you’d like me to explore in the future.

The landscape of fine-tuning LLMs is ever-evolving, with new methods and approaches emerging regularly. For instance, you can watch a video explaining “Efficient Fine-Tuning for Llama-v2–7b on a Single GPU” with insights from Ludwig at Uber. If you’re hungry for more knowledge, DeepLearning.ai offers a dedicated course on fine-tuning LLMs using LAMINI.

As we continue to unlock the potential of language models, there’s always more to explore, discover, and learn. Stay curious, and keep pushing the boundaries of what these remarkable models can achieve. Together, we’ll write the next chapter in the fascinating world of fine-tuning language models.

References

Deeplearning.ai course on “ChatGPT prompt engineering for developer”

Deeplearning.ai course on “Finetuning Large Language Models”

Fine-tuning vs prompt engineering Large Language Models

Finetuning Large Language Models

Fine-tunes Llama v2 model on Guanace using QLoRA -younesbelkada/finetune_llama_v2.py

--

--

Robert John
Robert John

Written by Robert John

I develop machine learning models and deploy them to production using cloud services.

Responses (2)