Team 13

Our Approach

We worked on 3 directions to improve the training efficiency given the allocated 3 hours of training: hyperparameter search and model scaling, model architecture, and data selection.

We tried various tricks but most of them didn't result in a significant improvement in the final performance. Some of them looked promising at the beginning but didn't converge to a better model in the end.

The tricks that helped us to improve the performance are:

Enable dropout
Using scaling law to find the optimal model size
Using a smaller batch size for more iterations

Scaling Law with fixed compute

The setting of the challenge is to train a model with a fixed 3h compute budget. The FLOPS of A100 GPU is 3.12 * 10^14 flops/s, so the total FLOPS is 3.12 * 10^14 * 3 * 3600 ~ 3.4 * 10^18 flops.

However, we may not be able to fully utilize the FLOPS

We interpolate the scaling law from Deep Mind and the optimal config would be:

170M parameters
3.4B tokens

Considering we may not fully utilize the FLOPS, the optimal model size may be a bit smaller.

However, as the scaling law is generally for LLM with parameters > 1B, we decide to validate it empirically by training model with different sizes.

The plots are here

Takeaways:

The smaller the model, the faster it converges but the final performance is not guaranteed to be better.
12Layer and 18Layer models have similar performance, but 24Layer model(given by the organizers) has a 2 ppl gap.
Model smaller than 8 Layer has significantly worse performance.

Larger batch size

There is a trade-off between batch size and the number of iterations.

Traditionally, people use large batch size to train LLM such as 256 or 512. However, in our settings, smaller batch size is better as it allows more iterations.

This is why we removed gradient accumulation and used a batch size of 66.

Dropout

Previous work has shown that dropout is not necessary for LLM training. However, we found that dropout is beneficial in our setting, boosting the performance by 1.5 ppl.

Takeaways:

Dropout is beneficial in our setting.

Model architecture

After finding the optimal model size, we want to know if using model with wider layers would help. Given theN = 12*D^2*L, we increase the width of the model and reduce the depth to keep the number of parameters the same.

The plots are here

Takeaways:

Wider models converge faster but the final performance is slightly worse than the deeper models.

Large Learning Rate

As we have limited time, we want to know if using a large learning rate would accelerate the training process.

We did not investigate this further as the default learning rate is already large, i.e. 0.001

Brainformer architecture:

The brainformer block is composed of self-attention + MoE layer + MLP layer + MoE layer + MLP layer + MoE layer or little variations depending on the brainformer version - Brainformer paper

Architecture modification from BabyLM challenge winners:

We applied BabyLM challenge winners architecture modification on LlaMA 2 architecture: allowing all layers within the architecture to have a weighted residual connection to all previous layers Not all layers are equally as important

LR scheduling:

Composing lr-schedulers (e.g. warm restarts with cosine schedule on plateau)
Schedule-free optimizers (https://github.com/facebookresearch/schedule_free)

Data selection:

Data selection using (papers considered Dataset Cartography, SemDeDup, RETSIM) all approaches required steps that turned out to be too expensive for us within this 2 days timeframe

Other directions we considered:

MLP-Mixer inspired models like HyperMixer (HyperMixer paper) but required generalization to the autoregressive setting (which with our naive extension would require the same computational complexity as the transformer's self-attention).
8-bit optimizers (https://huggingface.co/docs/bitsandbytes/main/en/optimizers)

Find out more in our GitHub repo: https://github.com/Saibo-creator/llm-baselines