Text-to-Image Generation in Computer Vision

Summary

AI breakthroughs come in waves and today we are going through an exponential wave of generative AI and text-to-image models. Most notably, models like DALL-E enable creation of photo-realistic images from textual descriptions but this is not the only exciting use case! At Passio we are focused on creating the most advanced domain-specific AI and computer vision technology and we are extremely excited about the applications of text-to-image models in synthetic data generation and training of machine learning models at scale.

In this post we are excited to share some of our early exploits in using generative AI in computer vision applications and the key lessons we’ve learned when using DALL-E and stable diffusion.

‍

Generative AI: the new force in the computer vision game

As the world of computer vision continues to evolve, so does the technology behind it. And this evolution is happening at an unprecedented pace. The technology stack is changing almost daily with new AI tools appearing overnight to solve challenges faced by earlier generations of AI. It feels like AI is starting to build AI tools to improve itself.

One of the recent developments in the space of AI and computer vision is text-to-image generation most notably represented by models like DALL-E, Stable Diffusion and Midjourney. This technology enables the creation of photo-realistic images from textual descriptions, which can be used to train and improve the performance of visual recognition algorithms and to generate synthetic data for training and testing machine learning models at scale.

‍

Quick overview of text-to-image generation

Text-to-image generation is a relatively new technology that has only recently begun to be used in computer vision applications. The first text-to-image system was developed in the early 2000s by a team of researchers at MIT. This system, called Text2Image, was able to generate simple images from textual descriptions. Since then, there have been a number of advances in text-to-image generation, which have led to the development of more sophisticated systems that can generate photo-realistic images.

Text-to-image generation systems work by using a deep learning algorithm to learn the mapping between textual descriptions and images. The algorithm is trained on a dataset of images and their corresponding textual descriptions. Once the algorithm has been trained, it can then be used to generate images from new textual descriptions. The easiest and fastest way to experiment with text-to-image is via deployed models hosted on websites like https://www.craiyon.com/ and by looking at exciting examples here.

For more inspiration and amazing overview of the business applications of generative AI we encourage you to check out the latest post from Sequoia Capital: Generative AI A Creative New World.

‍

How text-to-image generation can be used to improve real-world computer vision solutions?

Text-to-image generation has a number of potential applications. One application is in the training of visual recognition algorithms. By generating synthetic data, text-to-image generation can be used to train visual recognition algorithms more effectively. In addition, text-to-image generation can be used to generate images for testing and debugging computer vision applications. But how effective can this approach be? What are the limitations? Can synthetic text-to-image data replace the need for real-world data? Why not use model-to-model and conventional transfer learning instead?

To explore these questions we decided to test text-to-image generation in our most advanced use-case: recognition and analysis of foods. Over the past 4 years, our team at Passio has built, arguably, the most advanced and robust food recognition visual AI dataset with millions of images representing thousands of classes structured in our unique visual food taxonomy (check out how we did it here). And we decided to test text-to-image across several food recognition use cases.

‍

Generating training data

We started by generating training data using text-to-image. Below you can see several examples of data we generated and you can compare the generated data with real-world data collected by our team.

By analyzing this data we can make a number of interesting observations:

Running image generation across the same prompt leads to generation of visually similar data.
Adding textual variety helps with adding variety to images (albeit to a limited extent).
Using a variety of text prompts allows us to create a reasonable dataset of 10s and even 100s of diverse images.
Text-to-image allows us to expand but can not replace real-world datasets.
Complex recognition cases and complex visual concepts require real-world data.
Text generation models like GPT-3 can be used together with text-to-image models to create diverse text prompts.

‍

15 foods Classifier real vs DALL-E mini generated data

Key Facts:

We selected 15 food labels
We used our standard background dataset (faces, cars, flowers, clothings, etc.)
100 test images per class. No DELL-E mini generated images in the testset.
100 train images generated by DALL-E mini.
100 train images consisting of real images.

‍

Results:

Accuracy with DALL-E mini + real data: 94%

Accuracy with real data: 90%

Accuracy with DALL-E mini data: 75%

‍

DALL-E only. Main confusions include generic egg carton and cooked white fish fillet.

Passio real-world Nutrition-AI data only. Main issue oatmeal patties.

DALL-E + Passio's real data. Leveled out issues observed in real world-only and DALL-E only runs.

General conclusions

Our general conclusion that synthetic data generated using text-to-image models can be extremely helpful in building machine learning models. The quality of data generated with DALL-E and similar models is very high when the prompts are properly constructed. The use of this synthetic data still requires high degree of supervision, especially when models are trained with that data are intended for real-world, production-level applications. The use of synthetic data is likely to become a critical component of dataset development efforts and we are excited to be integrating text-to-image data generation into Passio Mobile AI platform.

Quick summary points:

DALL-E mini is more helpful for certain labels than for others and there is currently no easy way to predict this upfront
Good prompts are important for both data quality and to introduce variety.
DALL-E + real data give better accuracy than training on real-world data alone.