{
  "title": "Articles/multimodal-models",
  "caption": "Multimodal Models",
  "slug": "multimodal-models",
  "tags": [
    "article",
    "choir-substack",
    "hermes-published",
    "imported-substack",
    "published"
  ],
  "canonical_url": "https://mosiah.org/articles/multimodal-models/",
  "interactive_url": "https://mosiah.org/#Articles%2Fmultimodal-models",
  "markdown_url": "https://mosiah.org/articles/multimodal-models.md",
  "json_url": "https://mosiah.org/json/multimodal-models.json",
  "fields": {
    "caption": "Multimodal Models",
    "created": "20260510152126284",
    "modified": "20260510152126284",
    "original-date": "2024-07-15T23:46:45.188Z",
    "original-url": "https://choir.substack.com/p/multimodal-models",
    "tags": "article hermes-published published imported-substack choir-substack",
    "title": "Articles/multimodal-models",
    "type": "text/vnd.tiddlywiki"
  },
  "text": "# Multimodal Models\n\n//The Key to Continued Gains from Scale//\n\n//Related:// [[sources|Article Sources/multimodal-models]] · [[notes|Article Notes/multimodal-models]] · [[metadata|Article Metadata/multimodal-models]] · [[Published Pieces]]\n\nWhile recent discussions have focused on the potential end of traditional scaling laws in AI development, a compelling counterargument emerges: multimodal learning could be the key to unlocking unprecedented levels of artificial general intelligence (AGI). This essay explores how integrating audio and video data into AI training could lead to a multiple order of magnitude (OOM) increase in practical scaling of general intelligence.\n\n### The Limitations of Text-Based Learning\n\nCurrent large language models (LLMs) have made remarkable strides in processing and generating text. However, they are inherently limited by the nature of their training data. Text alone cannot fully capture the rich, multidimensional nature of human experience and knowledge. This is where multimodal learning comes into play.\n\n### The Power of Multimodal Integration\n\nMultimodal learning involves training AI systems on diverse types of data, including text, audio, and video, as well as biological data like genomics, and data from analog/digital sensors. This approach more closely mimics how humans learn and understand the world. Two key modalities stand out for their potential to dramatically enhance AI capabilities:\n\n**Audio: The Gateway to Emotional Intelligence**\n\n**Video: The Teacher of 3D Physics and Visual Understanding**\n\n#### Audio: Cultivating Emotional Intelligence\n\nAudio data, particularly human speech, contains a wealth of information beyond mere words:\n\n- Tone and inflection convey emotion and intent\n\n- Rhythm and pacing indicate confidence and mental state\n\n- Non-verbal sounds (sighs, laughter, etc.) provide crucial context\n\nBy incorporating audio data, AI models can develop a much more nuanced understanding of human communication and emotional states. This could lead to:\n\n- Enhanced natural language processing that understands not just what is said, but how it's said\n\n- Improved sentiment analysis and emotion recognition\n\n- More natural and empathetic AI interactions\n\nThe potential improvement in emotional intelligence could imply a quantum leap in emotional understanding capabilities compared to what's possible with text alone.\n\n#### Video: Understanding the Physical World\n\nVideo data provides a dynamic, visual representation of the world that is rich in physical and spatial information:\n\n- Motion and interaction of objects demonstrate physics principles\n\n- Spatial relationships and perspective showcase 3D understanding\n\n- Temporal sequences illustrate cause and effect\n\nTraining on video data could enable AI to:\n\n- Develop intuitive understanding of physics and object permanence\n\n- Improve visual reasoning and scene understanding\n\n- Enhance predictive capabilities for physical events\n\nThis deep understanding of the physical world could represent another order of magnitude improvement in AI's general intelligence.\n\nNote that the first generation of video generation models such as OpenAI’s Sora generate videos with pronounced physical artifacts. They do not respect the physics of our shared reality. But maybe “scale is all you need.”\n\n### Synergistic Effects of Multimodal Learning\n\nThe true power of multimodal learning lies not just in the individual contributions of each modality, but in their synergistic combination:\n\n- **Contextual Grounding:** Text descriptions become more meaningful when grounded in audiovisual context.\n\n- **Cross-Modal Inference:** Understanding gained from one modality can enhance learning in others.\n\n- **Multisensory Integration:** Combining inputs from different modalities can lead to more robust and accurate understanding.\n\nFor instance, it may be the case that multimodal models do better than pure LLMs with visual reasoning.\n\nThis synergistic effect could potentially lead to a significant improvement in general intelligence.\n\n### Practical Implications of Multimodal Scaling\n\nThe integration of audio and video data into AI training could have far-reaching implications:\n\n1.  **More Human-Like Understanding:** AI systems could develop a more holistic, human-like understanding of the world.\n\n2.  **Enhanced Problem-Solving:** Improved physical and emotional understanding could lead to more creative and effective problem-solving.\n\n3.  **Advanced Robotics and Embodied AI:** Better understanding of the physical world could dramatically improve robotic systems and embodied AI.\n\n4.  **Improved Human-AI Interaction:** AI with enhanced emotional intelligence could interact more naturally and empathetically with humans.\n\n### GPT-4o and Multimodal Learning\n\nOpenAI’s upcoming GPT-4o Voice-to-Voice model is set to significantly advance multimodal learning. This model boasts faster processing and improved capabilities across text, voice, and vision. It can generate human-like audio in real-time, smoothly handling interruptions, and speaking with a wide vocal range, including whispering, singing, and more, making interactions “more natural and empathetic”. \\[Well, it may be in the uncanny valley. We’ll have to test it to find out.\\]\n\nIn addition, GPT-4o's vision capabilities allow it to understand and discuss images, such as troubleshooting technical issues or analyzing complex graphs, further enhancing its multimodal learning potential​​. These advancements in voice and vision integration align with the idea that multimodal learning can lead to significant improvements in AI's general intelligence.\n\nThat said, despite the fanfare about multimodal interaction which does appear to be a step-change improvement in user experience, OpenAI has not claimed that GPT-4o’s multimodality improves it’s general reasoning abilities. Maybe it has, but we would expect OpenAI to hype it more if it did.\n\n### Challenges and Considerations\n\nWhile the potential of multimodal learning is enormous, it comes with its own set of challenges:\n\n1.  **Data Complexity:** Processing and integrating multimodal data is computationally intensive.\n\n2.  **Alignment Problems:** Ensuring proper alignment and synchronization across modalities is crucial.\n\n3.  **Ethical Considerations:** Multimodal data, especially video, raises new privacy and ethical concerns.\n\n4.  **Interpretability:** Understanding decision-making in multimodal systems may be even more complex than in unimodal systems.\n\n### Conclusion: A New Frontier in AI Scaling\n\nMultimodal learning may represent the next great frontier in AI development. By incorporating the rich, multidimensional data from audio and video alongside text, we may be on the cusp of unlocking a step-change improvement in practical general intelligence.\n\nThis leap forward could bridge the gap between narrow AI and artificial general intelligence, leading to AI systems that not only process information more effectively but also understand and interact with the world in a fundamentally more human-like way. As we venture into this new territory, the challenge will be not just to process more data, but to integrate these diverse streams of information into a coherent, general intelligence that can navigate the complexities of both the physical and emotional worlds with unprecedented sophistication.\n\nThat said, my personal take is a bit deflationary. Multimodality may not give much of transfer learning general intelligence boost, but it will transform the user experience of generative AI, making it more accessible to people who are less literate, and allow everyone to utilize AI while on-the-go — just by talking into the mic in your headphones or car stereo.\n\n---\n\n//Originally published on Choir Substack: [[https://choir.substack.com/p/multimodal-models|https://choir.substack.com/p/multimodal-models]].//\n"
}