Following my previous GPT-2 pretraining project, this is clean pipeline for fine-tuning with a “from scratch” implementation of LoRA that can be reused for other project. I’m choosing GPT-2 because I’m renting a small GPU for the experiment with a small memory but this can be reproduced with any other transformer model.
Dataset
I used the Alpaca-GPT4 dataset from Stanford that contains 52K instruction-following data generated by GPT-4. Each row of the dataset contains an instruction, an optional input that provide context, and an output:
# | Instruction | Input | Output |
---|---|---|---|
1 | Classify the following fruit according to its color. | Orange | Orange is a yellow-orange color. |
2 | How many bytes are there in a kilobyte? | There are 1,024 bytes in a kilobyte | |
3 | Translate the following sentence from English to French. | I am happy to meet you. | Je suis heureux de te rencontrer. |
The datapoints are preprocessed so we can feed the LLM with:
- Add different instructions for datapoints with and without input:
- “Below is an instruction that describes a task …”
- “Below is an instruction that describes a task, paired with an input that provides further context …”
- Add End of String Tokens ("</ s>" for Qwen)
- Tokenize the text (vocab size 151936)
- Packing by combining examples to fill the model’s memory during training (max sequence length 1024 for Qwen2)
Training
The change in model weight $\Delta{W}$ has a low intrinsic dimension, the dimension being related to the rank of a matrix, the LoRA paper suggest to fine-tune through lower rank matrices.
\[ W_{0} + \Delta{W} = W_{0} + BA\]with $W_{0}$ the original model weights, $W \in \mathbb{R}^{d\times k}$, $B \in \mathbb{R}^{d\times r}$, $A \in \mathbb{R}^{r\times k}$ with $r < < min\left ( d, k \right )$. We then have to fine-tune only $r(d+k)$ parameters instead of $d\times k$.
We then define the LoRA layer as follows:
class LoRALayer(torch.nn.Module):
def __init__(self, in_dim, out_dim, rank, alpha):
super().__init__()
std_dev = 1 / torch.sqrt(torch.tensor(rank).float())
self.A = torch.nn.Parameter(torch.randn(in_dim, rank) * std_dev)
self.B = torch.nn.Parameter(torch.zeros(rank, out_dim))
self.alpha = alpha
def forward(self, x):
x = self.alpha * (x @ self.A @ self.B)
return x
To apply LoRA, we replace the existing Linear layers in a neural network with the LinearWithLoRA layers that combine both the original Linear layer and the LoRALayer:
class LinearWithLoRA(torch.nn.Module):
def __init__(self, linear, rank, alpha):
super().__init__()
self.linear = linear
self.lora = LoRALayer(
linear.in_features, linear.out_features, rank, alpha
)
def forward(self, x):
return self.linear(x) + self.lora(x)
According to the paper a rank as low as 1 leads to pretty good performances and the more unrelated the fine-tuning dataset is from the original dataset, the higher the rank should be. In this experiment $r=2$
Results
TBD..