DeepMind researchers have published a paper titled “Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design”. This work presents a new way to optimize Vision Transformers (ViT) by focusing on the model’s shape, such as width and depth, rather than simply increasing its size.

Key Points

  1. The paper refines methods to infer compute-optimal model shapes and applies this to vision transformers.
  2. The shape-optimized vision transformer, SoViT, achieves results comparable to models twice its size while using the same amount of compute for pre-training.
  3. SoViT-400m/14 reaches 90.3% fine-tuning accuracy on ILSRCV2012, outperforming the larger ViT-g/14 and approaching ViT-G/14 with less than half the inference cost.

Methodology

The researchers focus on optimizing the architecture’s shape instead of improving its training protocol. They represent a neural architecture as a tuple containing shape dimensions like width, depth, and MLP.

They show that when the shape and training duration of small ViT models are optimized together for the available compute, they can match the performance of much larger models.

Results

The researchers test their predictions by optimizing the shape of ViT for the compute-equivalent of ViT-g/14 when pre-trained on 16 billion JFT-3B examples. The resulting model, SoViT-400M, is evaluated in various contexts to confirm if it broadly matches ViT-g/14’s performance.

Table


Impact

This research challenges the common practice of simply scaling up vision models. It shows that optimizing the model’s shape and training duration for the available compute can significantly improve the performance of vision transformers and other AI models, making them more efficient and effective.

The findings offer a new direction for future research and development in AI model design, potentially leading to more optimized and high-performing models.