I came across this paper a few days ago on long range sequence modelling and it looks interesting. So I decided to give it a shot.
Unsurprisingly, Lucidrains already has a repo on this paper 😅.
I decided to use the C4 dataset that I used in my previous experiment, but this time I made a gist for that.
For training, I used A100-40GB server. Model configuration is default except I went with embedding size of 768 and 12 layers. Since, I am using GPT-2 tokenizer, the number of tokens is 50257.
Authors in the paper reported that they trained model at a sequence length of 4096 but evaluated on longer sequences which sounds very interesting if it can generalize well. So to see how well it generalizes, I started training it on 512 and generating text of size 1024.
Training starts as usual. Loss curve looks pretty normal:
But can I speed up the convergence? In Lucidrain’s implementation, there is no layer norm before the LM head so I decided to patch it to see if it can help. After restarting the training, I could not see any significant difference:
One idea I had from the past few weeks of experimentation with language models is that the convergence of good embeddings takes a lot of time. So why not use GPT-2 Pretrained Embeddings?
I removed the LayerNorm at the end and injected the pretrained gpt2 embeddings to both starting embedding layer as well as the last language model head. Also, I froze the both layers. Since, I am using Embedding Size=768, it worked out of the box:
However when I started training the model; the loss started very high, LOL. I realized it is because I removed LayerNorm before the last layer:
Thus after adding the last layernorm back, the convergence is super fast, 🎉
While the loss is around 6.something right now so the output is pretty much garbage. At around 3, we will probably have something useful. Since this model is ~500MB around the same as base level GPT-2, the quality of results would probably be okish even when converged.
Next is to investigate the results beyond training sequence length. So, I will use GPT-2 xl embeddings of size 1600. And reduce max sequence length to 64, and sample for 256 tokens. This should converge both faster and I can set pretty big batch size.
I also increased the number of layers from 12 to 24. And now the size of the model is ~2.8GB (~700M parameters). I also want to investigate if the fine-tuning can improve longer context inference. If yes, then it should be a good idea to first train on smaller sequences and slowly increase the sequence length for faster convergence.
T+1h20m: Loss going down pretty smoothly. We are at ~5.4:
T+5h40m:
Ok something is coming out @ loss=4.5.
It does look like that it is able to connect the information in the start with the info at the end. But since the sentences are not making any sense, i’ll let it train more.
T+16hr: The shape of the curve is same at 60k iterations which means it has not yet converged. With a batch size of 64 and sequence length of 64; the model has seen ~245 million tokens so far.
Further developments will be posted on huggingface repo.