CLIP: OpenAI's Multi-Modal Model
Learn visual concepts from natural language supervision
When you flip through the photos on your mobile, you look at images and then say, "This is a pic with my family last year at Cape Cod where we watched Whales". You use language to describe and classify an image.
Constastive Language -Image Pretraining(CLIP) is a zero-shot multi-modal model that learns directly from the raw text about images. CLIP which efficiently…