Multimodal
5 milestones in AI history
DALL-E: Text to Image Generation
OpenAI unveiled DALL-E, a model that could generate images from text descriptions — 'an armchair in the shape of an avocado' became iconic. Built on GPT-3's architecture adapted for images, it showed that language models could bridge the gap between text and visual creativity.
GPT-4: Multimodal Intelligence
OpenAI released GPT-4, a multimodal model that could understand both text and images. It passed the bar exam (90th percentile), scored 1410 on the SAT, and demonstrated remarkably nuanced reasoning. It was a massive leap from GPT-3.5 in accuracy, safety, and capability.
Gemini: Google's Multimodal Response
Google launched Gemini, its most capable AI model family, natively multimodal across text, code, images, audio, and video. Gemini Ultra matched or exceeded GPT-4 on many benchmarks. It marked Google DeepMind's full response to OpenAI's dominance.
Sora: AI Video Generation
OpenAI previewed Sora, a model that could generate photorealistic videos up to a minute long from text descriptions. The quality stunned the world — realistic physics, complex camera movements, and coherent scenes that looked like professional cinematography.
GPT-4o: Omni Model
OpenAI released GPT-4o ('omni'), a unified model that natively processed text, audio, images, and video with near-instant response times. It could hold natural voice conversations with emotional expression, sing, laugh, and respond to visual input in real time.