Deep Mind’s Generalist Agent: Gato
A breakthrough step using a single neural transformer to perform numerous multiple tasks.
Imagine a single neural sequence transformer model with a single set of weights engaged in predicting various tasks like caption images, chat, stacking blocks with a robotic arm, outperforming humans at playing Atari games, navigating in simulated 3D environments, and more.
Deep Mind ‘s Gato is a significant step towards a generalist AI model but a giant leap that will lead to inteligent machines performing intellectual tasks much like humans
What is Gato?
Gato is a single generalist agent that works as a multi-modal, multi-task, multi-embodiment generalist policy. It currently uses the same network with the same weights, around 1.2B parameters to play Atari, caption images, chat, stack blocks with a robotic arm, and much more based on its context.
Gato is trained on 604 distinct tasks with varying modalities, observations, and action specifications.
Inspiration for Gato
Gato is inspired by works such as GPT-3 and Gopher to push the limits of generalist language models, also inspired by multi-embodiment continuous control using message passing graph networks to build a single locomotor controller for many simulated 2D walkers.
Gato can be thought of as a “Single-brain” style model where one single neural transformer with a same weight performs different task
How does Gato work?
Gato uses a single transformer neural network with the same weights to sense and act with different embodiments across various environments. Gato is currently trained offline in a purely supervised manner with around 1.2B parameters.
Input Data Tokenization and Sequencing
The multi-modal input data is processed by serializing it into a flat sequence of tokens. After tokenization, a sequencing order is applied based on the context.