# Fine-tuning on custom data
## Setup environment

In [1]:
%%capture
!pip install -r requirements.txt

In [2]:
# Choose your base model to fine-tune
model_name = "NousResearch/Hermes-2-Pro-Mistral-7B"

# Output model name
finetuned_model_name = "kuhess/hermes-2-pro-mistral-7b-metropole"

# Set the Weights&Biaises project name
wandb_project = "sft-hermes-2-pro-mistral-7b-metropole"

In [3]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 4096
dtype = None # None for auto detection
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

==((====))==  Unsloth: Fast Mistral patching release 2024.3
   \\   /|    GPU: NVIDIA RTX A5000. Max memory: 23.679 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.0+cu121. CUDA = 8.6. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.24. FA = True.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = True,
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.3 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Load and split custom dataset

The code below will create `train`, `valid` and `test` splits.

I created my own dataset using another LLM to analyze my documents and create specialized Q/A.

In [5]:
# Dataset path
datapath = "data/metropole.jsonl"

In [6]:
from datasets import load_dataset, DatasetDict

def train_valid_test_split(dataset):
    dataset_train_validtest = dataset["train"].train_test_split(test_size=0.2, shuffle=True)
    dataset_valid_test = dataset_train_validtest["test"].train_test_split(test_size=0.5)
    
    return DatasetDict({
        "train": dataset_train_validtest["train"],
        "valid": dataset_valid_test["train"],
        "test": dataset_valid_test["test"],
    })

dataset = load_dataset("json", data_files=datapath)
dataset = train_valid_test_split(dataset)

dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'title'],
        num_rows: 4081
    })
    valid: Dataset({
        features: ['question', 'answer', 'title'],
        num_rows: 510
    })
    test: Dataset({
        features: ['question', 'answer', 'title'],
        num_rows: 511
    })
})

## Chat formatting
Here, we transform each entry of the raw dataset with the `chatml` format. Then we add this new value into the `text` column of the dataset.

In [7]:
from unsloth.chat_templates import get_chat_template
import json

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "chatml",
)

def formatting_prompts_func(examples):
    # create conversations
    convos = [
        [
            {
                "role": "user",
                 "content": f"D'après le magazine de la métropole d'Angers, {question[0].lower() + question[1:]}",
            },
            {
                "role": "assistant",
                 "content": f"Selon un article dans le magazine Métropole:\n\n{answer}",
            }
        ]
        for question, answer in zip(examples["question"], examples["answer"])
    ]
    # create prompt
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

dataset = dataset.map(formatting_prompts_func, batched = True,)

Unsloth: Will map <|im_end|> to EOS = <|im_end|>.


In [8]:
print(dataset["train"]["text"][22])

<|im_start|>user
D'après le magazine de la métropole d'Angers, quels déchets déposer dans les bornes d'apport volontaire ?<|im_end|>
<|im_start|>assistant
Selon un article dans le magazine Métropole:

Les bornes d'apport volontaire acceptent l'ensemble des déchets alimentaires, y compris les restes de viande, arêtes de poisson ou encore coquilles de fruits de mer. En revanche, les végétaux (tontes de pelouse et tailles de haie) ne doivent pas être déposés dans les bornes de collecte, car ils sont admis dans les composteurs.<|im_end|>



### Fine-tune the model
We use the supervised fine-tuning (SFT) trainer from the Huggingface TRL library ([TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer)).
We set up 3 epochs for this fine-tuning (usual values are 1, 2 or 3 epochs).

In [9]:
import os
from trl import SFTTrainer
from transformers import TrainingArguments

os.environ["WANDB_PROJECT"] = wandb_project
os.environ["WANDB_LOG_MODEL"] = "checkpoint"

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset["train"],
    eval_dataset = dataset["valid"],
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        output_dir = "outputs",
        report_to="wandb",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        gradient_checkpointing=True,
        evaluation_strategy="epoch", # log validation loss at each epoch
        # evaluation_strategy="steps", # log validation loss at each step -> slower
        num_train_epochs=3,
        warmup_steps = 20,
        learning_rate = 2e-5,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        # logging strategies 
        logging_strategy="steps",
        logging_steps=1,
        save_strategy="epoch", # saving is done at the end of each epoch
    ),
)

### Let's train the model
Here I use https://wandb.ai to dynamically track the training. I use the wandb library and I set up this notebook runtime to expose my wandb token (https://docs.wandb.ai/guides/track/environment-variables).

In [10]:
# !export WANDB_API_KEY=your_api_key
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 4,081 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 1,530
 "-____-"     Number of trainable parameters = 167,772,160
[34m[1mwandb[0m: Currently logged in as: [33mqsuire[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss
0,0.7038,0.769915
1,0.7015,0.705171
2,0.3351,0.71538


[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-510)... Done. 1.8s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-1020)... Done. 1.9s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-1530)... Done. 2.3s


In [11]:
import wandb
wandb.finish()

VBox(children=(Label(value='3531.069 MB of 3531.069 MB uploaded (3.469 MB deduped)\r'), FloatProgress(value=1.…

0,1
eval/loss,█▁▂
eval/runtime,█▃▁
eval/samples_per_second,▁▆█
eval/steps_per_second,▁▆█
train/epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/grad_norm,█▂▁▁▁▁▁▁▁▂▁▂▁▁▂▂▂▂▂▂▂▂▃▂▂▃▄▃▂▃▃▃▃▃▃▃▄▄▄▄
train/learning_rate,▇███▇▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁▁
train/loss,█▄▄▆▃▄▃▅▃▄▄▃▂▂▃▃▃▃▃▂▂▂▂▂▃▃▂▂▁▂▁▂▁▂▂▂▂▂▂▂

0,1
eval/loss,0.71538
eval/runtime,25.063
eval/samples_per_second,20.349
eval/steps_per_second,2.554
total_flos,6.885483059758694e+16
train/epoch,3.0
train/global_step,1530.0
train/grad_norm,4.15339
train/learning_rate,0.0
train/loss,0.3351


In [12]:
model.save_pretrained(finetuned_model_name + "_lora")

## Export to Huggingface Model Hub

Then we export the model to Huggingface. Here are 2 different ways to export (more info on `quantization_method` at https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)

### Basic f16 model

In [13]:
import os
model.push_to_hub_merged(finetuned_model_name, tokenizer, token=os.getenv("HF_TOKEN"))

Unsloth: You are pushing to hub, but you passed your HF username = kuhess.
We shall truncate kuhess/hermes-2-pro-mistral-7b-metropole to hermes-2-pro-mistral-7b-metropole


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 382.53 out of 503.52 RAM for saving.


100%|██████████| 32/32 [00:00<00:00, 56.04it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...


model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

Done.
Saved merged model to https://huggingface.co/kuhess/hermes-2-pro-mistral-7b-metropole


### Quantized 4bits model GGUF format (llama.cpp)

In [14]:
model.push_to_hub_gguf(finetuned_model_name + "-4bit-gguf", tokenizer, quantization_method = "q4_k_m", token = os.getenv("HF_TOKEN"))

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 384.67 out of 503.52 RAM for saving.


100%|██████████| 32/32 [00:00<00:00, 61.65it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.


Unsloth: Converting mistral model. Can use fast conversion = True.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GUUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to q4_k_m will take 20 minutes.
 "-____-"     In total, you will have to wait around 26 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at kuhess/hermes-2-pro-mistral-7b-metropole-4bit-gguf into f16 GGUF format.
The output location will be ./kuhess/hermes-2-pro-mistral-7b-metropole-4bit-gguf-unsloth.F16.gguf
This will take 3 minutes...
Loading model file kuhess/hermes-2-pro-mistral-7b-metropole-4bit-gguf/model-00001-of-00003.safetensors
Loading model file kuhess/hermes-2-pro-mistral-7b-metropole-4bit-gguf/model-00001-of-00003.safetensors
Loading model file kuhess/hermes-2-pro-mistral-7b-metropole-4bit-gguf/model-00002-of-00003.safetensors
Loading model file kuhess/hermes-2-pro-mistral-7b-metropole-4bit

hermes-2-pro-mistral-7b-metropole-4bit-gguf-unsloth.Q4_K_M.gguf:   0%|          | 0.00/4.37G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/kuhess/hermes-2-pro-mistral-7b-metropole-4bit-gguf
