Friday, December 1, 2023

The Multimodal Phase: Generative AI Accelerates

Last updated on October 6th, 2023 at 09:47 pm

DYK? Your AI can get stressed too. Tell it to relax. Take a deep breath. A new paper unnervingly shows it can generate much better results. AI model to “take a deep breath” causes math scores to soar in study

We’re getting to another weird part of AI evolution, the power of a sentence in a cognitive system that only knows words? Does anything really take a deep breath and feel better? No, but it’s advanced enough and had enough training and data now to know what is supposed to happen when it does. The AI would read the instructions, go into its LLM, and run some algorithms to figure out what usually happens after someone says that. And improve its performance. Possibly. (Why it’s not already going full tilt and how performance improves is a whole other matter.) That is the power of words in an AI trained exclusively on text.

And Now It’s Not Just Words. 

Everything OpenAI shows us leads to gasps of delight and disbelief, and the ability it now has to verbally respond to the image capabilities, and the early DALL-E 3 output are no exception. 

Going through the new multimodal capabilities can, like the original, inspire some awe. Whoever is training this is doing a hell of a job. Some of the responses feel almost miraculous. You can now ask ChatGPT to create an image for you the way you’d previously ask it a question (this is called prompting and it’s a whole science in and of itself). You can ask it what a complex image flow chart or data sheet is. And that is just the beginning. 

What is Multimodal AI?

Multimodal is the next evolution to change everything generative AI. There are two main reasons why multimodal is a big step forward: on the training side, and on the functionality/UX side. It gives us a whole new set of super powerful tools to test and understand and figure out how to hack. What you can actually do with multimodal are the things many of us saw in sci-fi when we were kids:
Instant translation. Just check out what happens when you load a picture with kanji into ChatGPT.

Definitely at the top of the “feels miraculous” list and the kind of thing AI seems to have been invented to do. It will open up old worlds that were previously closed to common experience. There have been a few Twitter threads on the coolest new abilities and I’ve picked four more:

  1. The generation of this image: read the prompts! 
  2. The first front-end engineer agent (crude but effective and with enormous potential) 
  3. Picture to recipe! And not just for food. Distill almost anything to its core elements then learn how to purity back together. I am definitely going to be able to repair all my Apple power cords by myself now.
  4. Translation from an image!
  5. Definitely at the top of the “feels miraculous” list and the kind of thing AI seems to have been invented to do. It will open up old worlds that were previously closed to common experience. Bye, old school.

On the training side, multimodal creates better datasets. 

The use of various data types (or modalities) can lead to better performance in AI systems because:

1. Complementary Information: Different data types can capture different facets of information. For instance, while a photo might reveal the appearance of an event, an audio clip can capture the sounds, and textual data might provide context or specific details. Combining these can give a fuller picture.

2. Redundancy: Having multiple types of data can serve as a check against errors or omissions in one modality. If one data type is unclear or ambiguous, another might provide clarity.

3. Robustness: Relying on multiple data types can make the system more resilient to noise or inaccuracies in any single data source. For instance, in a noisy environment, visual cues might help an AI system better understand spoken words.

4. Generalization: Training on diverse data types can help models generalize better to unseen data. They can learn underlying patterns and relationships that are consistent across different modalities.

5. Contextual Understanding: Different data types can provide context that might be missing from a single data source. For example, textual metadata about an image can provide background information that helps an AI interpret the image’s content.

6. Enriched Feature Representation: Multi-modal systems can extract and combine features from different data types, leading to a richer representation that captures more nuances and complexities.

7. Improved Decision-Making: For tasks like anomaly detection, using multiple data types can provide multiple perspectives, leading to more accurate and reliable decision-making.

In essence, integrating information from various data types allows AI systems to leverage the strengths of each modality, offsetting the weaknesses or limitations of others, leading to more comprehensive and accurate outcomes.

Initial take: Google, you better have some good tricks up your sleeve, because right now David is eating Goliath’s lunch. Some early testers are saying the Bing Dall-E functionality is less impressive, that hallucination-like issues are happening in images and that video is really untested … but it’s still stunning as hell to be able to do … this.


Concerns are enormous. Yes, integrating multiple modalities into AI systems introduces new complexities that can elevate certain risks, including more complexity around security, privacy, and training; managing inter-modality conflict, even increased complexity around tokens, since audio and video are harder to pinpoint than text. Because of this, it’s more resource intensive and  QA demands are higher while quality may be lower until the training wheels are off. 

The biggest issue however is the potential for misuse. Technology has gotten so good it’s its own worst enemy. The risks that need to be mitigated exist because OpenAI has gotten so good at what it does. Managing those risks is going to feel like hunting at midnight for a while. We have no idea of the potential of any of these advances to help or harm us. 


Unleashing the Power of AI in B2B Marketing: Strategies for 2023

The digital marketing landscape is evolving rapidly, with artificial...

How To Check if a Backlink is Indexed

Backlinks are an essential aspect of building a good...

How to Find Any Business Owner’s Name

Have you ever wondered how to find the owner...

Do You Have the Right Attributes for a Career in Software Engineering?

Software engineers are in high demand these days. With...

6 Strategies to Make Sure Your Business Survives a Recession

Small businesses are always hit the hardest during an...
Jennifer Evans
Jennifer Evans
principal, @patternpulseai. author, THE CEO GUIDE TO INDUSTRY AI. former chair @technationCA, founder @b2bnewsnetwork #basicincome activist. Machine learning since 2009.