LoRA Instruction Tuning Implementation

Following my previous GPT-2 pretraining project, this is clean pipeline for fine-tuning with a “from scratch” implementation of LoRA that can be reused for other project. I’m choosing GPT-2 because I’m renting a small GPU for the experiment with a small memory but this can be reproduced with any other transformer model.

Code Repository

Qwen2 benchmark comparison

Dataset

I used the Alpaca-GPT4 dataset from Stanford that contains 52K instruction-following data generated by GPT-4. Each row of the dataset contains an instruction, an optional input that provide context, and an output:

#	Instruction	Input	Output
1	Classify the following fruit according to its color.	Orange	Orange is a yellow-orange color.
2	How many bytes are there in a kilobyte?		There are 1,024 bytes in a kilobyte
3	Translate the following sentence from English to French.	I am happy to meet you.	Je suis heureux de te rencontrer.

The datapoints are preprocessed so we can feed the LLM with:

Add different instructions for datapoints with and without input:
- “Below is an instruction that describes a task …”
- “Below is an instruction that describes a task, paired with an input that provides further context …”
Add End of String Tokens ("</ s>" for Qwen)
Tokenize the text (vocab size 151936)
Packing by combining examples to fill the model’s memory during training (max sequence length 1024 for Qwen2)

Training

The change in model weight $\Delta{W}$ has a low intrinsic dimension, the dimension being related to the rank of a matrix, the LoRA paper suggest to fine-tune through lower rank matrices.

\[ W_{0} + \Delta{W} = W_{0} + BA\]

with $W_{0}$ the original model weights, $W \in \mathbb{R}^{d\times k}$, $B \in \mathbb{R}^{d\times r}$, $A \in \mathbb{R}^{r\times k}$ with $r < < min\left ( d, k \right )$. We then have to fine-tune only $r(d+k)$ parameters instead of $d\times k$.

We then define the LoRA layer as follows:

class LoRALayer(torch.nn.Module):
    def __init__(self, in_dim, out_dim, rank, alpha):
        super().__init__()
        std_dev = 1 / torch.sqrt(torch.tensor(rank).float())
        self.A = torch.nn.Parameter(torch.randn(in_dim, rank) * std_dev)
        self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))
        self.alpha = alpha

    def forward(self, x):
        x = self.alpha * (x @ self.A @ self.B)
        return x

To apply LoRA, we replace the existing Linear layers in a neural network with the LinearWithLoRA layers that combine both the original Linear layer and the LoRALayer:

class LinearWithLoRA(torch.nn.Module):
    def __init__(self, linear, rank, alpha):
        super().__init__()
        self.linear = linear
        self.lora = LoRALayer(
            linear.in_features, linear.out_features, rank, alpha
        )

    def forward(self, x):
        return self.linear(x) + self.lora(x)

According to the paper a rank as low as 1 leads to pretty good performances and the more unrelated the fine-tuning dataset is from the original dataset, the higher the rank should be. In this experiment $r=2$

Results

TBD..

Dataset#

Training#

Results#

Dataset

Training

Results