Deep Mind’s Generalist Agent: Gato

A breakthrough step using a single neural transformer to perform numerous multiple tasks.

Renu Khandelwal
6 min readMay 18, 2022


Imagine a single neural sequence transformer model with a single set of weights engaged in predicting various tasks like caption images, chat, stacking blocks with a robotic arm, outperforming humans at playing Atari games, navigating in simulated 3D environments, and more.

Deep Mind ‘s Gato is a significant step towards a generalist AI model but a giant leap that will lead to inteligent machines performing intellectual tasks much like humans

What is Gato?

Gato is a single generalist agent that works as a multi-modal, multi-task, multi-embodiment generalist policy. It currently uses the same network with the same weights, around 1.2B parameters to play Atari, caption images, chat, stack blocks with a robotic arm, and much more based on its context.

Gato is trained on 604 distinct tasks with varying modalities, observations, and action specifications.

Source: A Generalist Agent

Inspiration for Gato

Gato is inspired by works such as GPT-3 and Gopher to push the limits of generalist language models, also inspired by multi-embodiment continuous control using message passing graph networks to build a single locomotor controller for many simulated 2D walkers.

Gato can be thought of as a “Single-brain” style model where one single neural transformer with a same weight performs different task

How does Gato work?

Gato uses a single transformer neural network with the same weights to sense and act with different embodiments across various environments. Gato is currently trained offline in a purely supervised manner with around 1.2B parameters.

Input Data Tokenization and Sequencing

The multi-modal input data is processed by serializing it into a flat sequence of tokens. After tokenization, a sequencing order is applied based on the context.

Embedding Input Tokens



