CLIP: OpenAI's Multi-Modal Model

Learn visual concepts from natural language supervision

3 min readMar 28, 2022

When you flip through the photos on your mobile, you look at images and then say, "This is a pic with my family last year at Cape Cod where we watched Whales". You use language to describe and classify an image.

Constastive Language -Image Pretraining(CLIP) is a zero-shot multi-modal model that learns directly from the raw text about images. CLIP which efficiently…

CLIP: OpenAI's Multi-Modal Model

Learn visual concepts from natural language supervision

Written by Renu Khandelwal