Google’s new Trillium AI chip delivers 4X speed and powers Gemini 2.0


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Google has just unveiled Trillium, its sixth-generation artificial intelligence accelerator chip, claiming performance improvements that could fundamentally alter the economics of AI development while pushing the boundaries of what’s possible in machine learning.

The custom processor, which powered the training of Google’s newly announced Gemini 2.0 AI model, delivers four times the training performance of its predecessor while using significantly less energy. This breakthrough comes at a crucial moment, as tech companies race to build increasingly sophisticated AI systems that require enormous computational resources.

“TPUs powered 100% of Gemini 2.0 training and inference,” Sundar Pichai, Google’s CEO, explained in an announcement post highlighting the chip’s central role in the company’s AI strategy. The scale of deployment is unprecedented: Google has connected more than 100,000 Trillium chips in a single network fabric, creating what amounts to one of the world’s most powerful AI supercomputers.

How Trillium’s 4X performance boost is transforming AI development

Trillium’s specifications represent significant advances across multiple dimensions. The chip delivers a 4.7x increase in peak compute performance per chip compared to its predecessor, while doubling both high-bandwidth memory capacity and interchip interconnect bandwidth. Perhaps most importantly, it achieves a 67% increase in energy efficiency — a crucial metric as data centers grapple with the enormous power demands of AI training.

“When training the Llama-2-70B model, our tests demonstrate that Trillium achieves near-linear scaling from a 4-slice Trillium-256 chip pod to a 36-slice Trillium-256 chip pod at a 99% scaling efficiency,” said Mark Lohmeyer, VP of Compute and AI Infrastructure at Google Cloud. This level of scaling efficiency is particularly remarkable given the challenges typically associated with distributed computing at this scale.

The economics of innovation: Why Trillium changes the game for AI startups

The business implications of Trillium extend beyond raw performance metrics. Google claims the chip provides up to 2.5x improvement in training performance per dollar compared to its previous generation, potentially reshaping the economics of AI development.

This cost efficiency could prove particularly significant for enterprises and startups developing large language models. AI21 Labs, an early Trillium customer, has already reported significant improvements. “The advancements in scale, speed, and cost-efficiency are significant,” noted Barak Lenz, CTO of AI21 Labs, in the announcement.

Scaling new heights: Google’s 100,000-chip AI supernetwork

Google’s deployment of Trillium within its AI Hypercomputer architecture demonstrates the company’s integrated approach to AI infrastructure. The system combines over 100,000 Trillium chips with a Jupiter network fabric capable of 13 Petabits per second of bisectional bandwidth – enabling a single distributed training job to scale across hundreds of thousands of accelerators.

“The growth of flash usage has been like more than more than 900% which has been incredible to see,” noted Logan Kilpatrick, a product manager on Google’s AI studio team, during the developer conference, highlighting the rapidly increasing demand for AI computing resources.

Beyond Nvidia: Google’s bold move in the AI chip wars

The release of Trillium intensifies the competition in AI hardware, where Nvidia has dominated with its GPU-based solutions. While Nvidia’s chips remain the industry standard for many AI applications, Google’s custom silicon approach could provide advantages for specific workloads, particularly in training very large models.

Industry analysts suggest that Google’s massive investment in custom chip development reflects a strategic bet on the growing importance of AI infrastructure. The company’s decision to make Trillium available to cloud customers indicates a desire to compete more aggressively in the cloud AI market, where it faces strong competition from Microsoft Azure and Amazon Web Services.

Powering the Future: What Trillium Means for Tomorrow’s AI

The implications of Trillium’s capabilities extend beyond immediate performance gains. The chip’s ability to handle mixed workloads efficiently — from training massive models to running inference for production applications — suggests a future where AI computing becomes more accessible and cost-effective.

For the broader tech industry, Trillium’s release signals that the race for AI hardware supremacy is entering a new phase. As companies push the boundaries of what’s possible with artificial intelligence, the ability to design and deploy specialized hardware at scale could become an increasingly critical competitive advantage.

“We’re still in the early stages of what’s possible with AI,” Demis Hassabis, CEO of Google DeepMind, in wrote in the company blog post. “Having the right infrastructure — both hardware and software — will be crucial as we continue to push the boundaries of what AI can do.”

As the industry moves toward more sophisticated AI models that can act autonomously and reason across multiple modes of information, the demands on underlying hardware will only increase. With Trillium, Google has demonstrated that it intends to remain at the forefront of this evolution, investing in the infrastructure that will power the next generation of AI advancement.



Source link

About The Author