Multimodal

5 milestones in AI history

AI-generated image by DALL-E
ResearchThe Transformer Era

DALL-E: Text to Image Generation

OpenAI unveiled DALL-E, a model that could generate images from text descriptions — 'an armchair in the shape of an avocado' became iconic. Built on GPT-3's architecture adapted for images, it showed that language models could bridge the gap between text and visual creativity.

OpenAI
OpenAI logo
ResearchGenerative AI Revolution

GPT-4: Multimodal Intelligence

OpenAI released GPT-4, a multimodal model that could understand both text and images. It passed the bar exam (90th percentile), scored 1410 on the SAT, and demonstrated remarkably nuanced reasoning. It was a massive leap from GPT-3.5 in accuracy, safety, and capability.

OpenAI
Google Gemini AI model logo
ProductGenerative AI Revolution

Gemini: Google's Multimodal Response

Google launched Gemini, its most capable AI model family, natively multimodal across text, code, images, audio, and video. Gemini Ultra matched or exceeded GPT-4 on many benchmarks. It marked Google DeepMind's full response to OpenAI's dominance.

Sundar PichaiDemis HassabisGoogle DeepMind
OpenAI logo, creators of Sora
ResearchGenerative AI Revolution

Sora: AI Video Generation

OpenAI previewed Sora, a model that could generate photorealistic videos up to a minute long from text descriptions. The quality stunned the world — realistic physics, complex camera movements, and coherent scenes that looked like professional cinematography.

OpenAI
OpenAI logo
ProductGenerative AI Revolution

GPT-4o: Omni Model

OpenAI released GPT-4o ('omni'), a unified model that natively processed text, audio, images, and video with near-instant response times. It could hold natural voice conversations with emotional expression, sing, laugh, and respond to visual input in real time.

OpenAI

Related Topics