CLIP: OpenAI's Multi-Modal Model
Learn visual concepts from natural language supervision
--
When you flip through the photos on your mobile, you look at images and then say, "This is a pic with my family last year at Cape Cod where we watched Whales". You use language to describe and classify an image.
Constastive Language -Image Pretraining(CLIP) is a zero-shot multi-modal model that learns directly from the raw text about images. CLIP which efficiently learns visual concepts from natural language supervision
CLIP efficiently learns visual concepts from natural language supervision using zero-shot transfer, natural language supervision, and multi-modal learning.
Zero-shot learning is a way of generalizing unseen object categories when you want the model to classify objects that it is not trained on.
Contrastive objective learns better representations where similar representations stay close to each other and dissimilar representations stay far apart.
CLIP leverages multiple modalities, natural language processing, and images as the source to learn, just like we humans do.
The core idea of CLIP is to effieciently learn visual representation from text paired with images to associate a text as a whole to be paired with an image and not the exact words of that text.
CLIP's Architecture and Working
CLIP takes an Image, text pairing as input to learn a multi-modal embedding space. CLIP jointly trains an image encoder and text encoder to maximize the cosine similarity of the image and text embedding of the correct pair and minimize the cosine similarity of the image and text embeddings of the incorrect pairings.
As part of the training, CLIP learns to recognize a wide variety of visual concepts in images and associate them with their text.
The image encoder computes the feature embedding of the image using ResNet-50 or ResNet-101 as a base, where we replace the global average pooling layer with a transformer-style attention pooling mechanism. Alternatively, the…