CLIP: OpenAI's Multi-Modal Model

When you flip through the photos on your mobile, you look at images and then say, "This is a pic with my family last year at Cape Cod where we watched Whales". You use language to describe and classify an image.

Constastive Language -Image Pretraining(CLIP) is a zero-shot multi-modal model that learns directly from the raw text about images. CLIP which efficiently…

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store