Following my previous GPT-2 pretraining project, this is clean pipeline for fine-tuning with a “from scratch” implementation of LoRA that can be reused for other project. I’m choosing GPT-2 because I’m renting a small GPU for the experiment with a small memory but this can be reproduced with any other transformer model.

Code Repository

Qwen2 benchmark comparison

Dataset

I used the Alpaca-GPT4 dataset from Stanford that contains 52K instruction-following data generated by GPT-4. Each row of the dataset contains an instruction, an optional input that provide context, and an output:

#InstructionInputOutput
1Classify the following fruit according to its color.OrangeOrange is a yellow-orange color.
2How many bytes are there in a kilobyte?There are 1,024 bytes in a kilobyte
3Translate the following sentence from English to French.I am happy to meet you.Je suis heureux de te rencontrer.

The datapoints are preprocessed so we can feed the LLM with:

  • Add different instructions for datapoints with and without input:
    • “Below is an instruction that describes a task …”
    • “Below is an instruction that describes a task, paired with an input that provides further context …”
  • Add End of String Tokens ("</ s>" for Qwen)
  • Tokenize the text (vocab size 151936)
  • Packing by combining examples to fill the model’s memory during training (max sequence length 1024 for Qwen2)

Training

The change in model weight $\Delta{W}$ has a low intrinsic dimension, the dimension being related to the rank of a matrix, the LoRA paper suggest to fine-tune through lower rank matrices.

\[ W_{0} + \Delta{W} = W_{0} + BA\]

with $W_{0}$ the original model weights, $W \in \mathbb{R}^{d\times k}$, $B \in \mathbb{R}^{d\times r}$, $A \in \mathbb{R}^{r\times k}$ with $r < < min\left ( d, k \right )$. We then have to fine-tune only $r(d+k)$ parameters instead of $d\times k$.

We then define the LoRA layer as follows:

class LoRALayer(torch.nn.Module):
    def __init__(self, in_dim, out_dim, rank, alpha):
        super().__init__()
        std_dev = 1 / torch.sqrt(torch.tensor(rank).float())
        self.A = torch.nn.Parameter(torch.randn(in_dim, rank) * std_dev)
        self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))
        self.alpha = alpha

    def forward(self, x):
        x = self.alpha * (x @ self.A @ self.B)
        return x

To apply LoRA, we replace the existing Linear layers in a neural network with the LinearWithLoRA layers that combine both the original Linear layer and the LoRALayer:

class LinearWithLoRA(torch.nn.Module):
    def __init__(self, linear, rank, alpha):
        super().__init__()
        self.linear = linear
        self.lora = LoRALayer(
            linear.in_features, linear.out_features, rank, alpha
        )

    def forward(self, x):
        return self.linear(x) + self.lora(x)

According to the paper a rank as low as 1 leads to pretty good performances and the more unrelated the fine-tuning dataset is from the original dataset, the higher the rank should be. In this experiment $r=2$

Results

TBD..