No retraining needed: Sakana’s new AI model changes how machines learn


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Researchers at Sakana AI, an AI research lab focusing on nature-inspired algorithms, have developed a self-adaptive language model that can learn new tasks without the need for fine-tuning. Called Transformer² (Transformer-squared), the model uses mathematical tricks to align its weights with user requests during inference. 

This is the latest in a series of techniques that aim to improve the abilities of large language models (LLMs) at inference time, making them increasingly useful for everyday applications across different domains.

Dynamically adjusting weights

Usually, configuring LLMs for new tasks requires a costly fine-tuning process, during which the model is exposed to new examples and its parameters are adjusted. A more cost-effective approach is “low-rank adaptation” (LoRA), in which a small subset of the model’s parameters relevant to the target task is identified and modified during fine-tuning.

After training and fine-tuning, the model’s parameters remain frozen, and the only way to repurpose it for new tasks is through techniques such as few-shot and many-shot learning. 

In contrast to classic fine-tuning, Transformer-squared uses a two-step approach to dynamically adjust its parameters during inference. First, it analyzes the incoming request to understand the task and its requirements, then it applies task-specific adjustments to the model’s weights to optimize its performance for that specific request.

“By selectively adjusting critical components of the model weights, our framework allows LLMs to dynamically adapt to new tasks in real time,” the researchers write in a blog post published on the company’s website.

How Sakana’s Transformer-squared works

The core ability of Transformer-squared is dynamically adjusting critical components of its weights at inference. 

To do this, it has to first identify the key components that can be tweaked during inference. Transformer-squared does this through singular-value decomposition (SVD), a linear algebra trick that breaks down a matrix into three other matrices that reveal its inner structure and geometry. SVD is often used to compress data or to simplify machine learning models.

When applied to the LLM’s weight matrix, SVD obtains a set of components that roughly represent the model’s different abilities, such as math, language understanding or coding. In their experiments, the researchers found that these components could be tweaked to modify the model’s abilities in specific tasks.

To systematically leverage these findings, they developed a process called singular value finetuning (SVF). At training time, SVF learns a set of vectors from the SVD components of the model. These vectors, called z-vectors, are compact representations of individual skills and can be used as knobs to amplify or dampen the model’s ability in specific tasks. 

At inference time, Transformer-squared uses a two-pass mechanism to adapt the LLM for unseen tasks. First, it examines the prompt to determine the skills required to tackle the problem (the researchers propose three different techniques for determining the required skills). In the second stage, Transformer-squared configures the z-vectors corresponding to the request and runs the prompt through the model and the updated weights. This enables the model to provide a tailored response to each prompt.

Transformer-squared training and inference (source: arXiv)

Transformer-squared in action

The researchers applied Transformer-squared to Llama-3 and Mistral LLMs and compared them to LoRA on various tasks, including math, coding, reasoning and visual question-answering. Transformer-squared outperforms LoRA on all benchmarks while having fewer parameters. It is also notable that, unlike Transformer-squared, LoRA models can’t adapt their weights at inference time, which makes them less flexible.

Another intriguing finding is that the knowledge extracted from one model can be transferred to another. For example, the z-vectors obtained from Llama models could be applied to Mistral models. The results were not on par with creating z-vectors from scratch for the target model, and the transferability was possible because the two models had similar architectures. But it suggests the possibility of learning generalized z-vectors that can be applied to a wide range of models.

image a38eca
Transformer-squared (SVF in the table) vs base models and LoRA (source: arXiv)

“The path forward lies in building models that dynamically adapt and collaborate with other systems, combining specialized capabilities to solve complex, multi-domain problems,” the researchers write. “Self-adaptive systems like Transformer² bridge the gap between static AI and living intelligence, paving the way for efficient, personalized and fully integrated AI tools that drive progress across industries and our daily lives.”

Sakana AI has released the code for training the components of Transformer-squared on GitHub.

Inference-time tricks

As enterprises explore different LLM applications, the past year has seen a noticeable shift toward developing inference-time techniques. Transformer-squared is one of several approaches that enable developers to customize LLMs for new tasks at inference time without the need to retrain or fine-tune them.

Titans, an architecture developed by researchers at Google, tackles the problem from a different angle, giving language models the ability to learn and memorize new information at inference time. Other techniques focus on enabling frontier LLMs to leverage their increasingly long context windows to learn new tasks without retraining.

With enterprises owning the data and knowledge specific to their applications, advances in inference-time customization techniques will make LLMs much more useful.



Source link

About The Author