CLIP: OpenAI's Multi-Modal Model

Learn visual concepts from natural language supervision

Renu Khandelwal
3 min readMar 28, 2022

When you flip through the photos on your mobile, you look at images and then say, "This is a pic with my family last year at Cape Cod where we watched Whales". You use language to describe and classify an image.

Constastive Language -Image Pretraining(CLIP) is a zero-shot multi-modal model that learns directly from the raw text about images. CLIP which efficiently learns visual concepts from natural language supervision

CLIP efficiently learns visual concepts from natural language supervision using zero-shot transfer, natural language supervision, and multi-modal learning.

Zero-shot learning is a way of generalizing unseen object categories when you want the model to classify objects that it is not trained on.

Contrastive objective learns better representations where similar representations stay close to each other and dissimilar representations stay far apart.

CLIP leverages multiple modalities, natural language processing, and images as the source to learn, just like we humans do.

The core idea of CLIP is to effieciently learn visual representation from text paired with images to associate a text as a whole to be paired with an image and not the exact words of that text.

--

--

Renu Khandelwal

A Technology Enthusiast who constantly seeks out new challenges by exploring cutting-edge technologies to make the world a better place!