Voice, text, images, videos - everything is a number for GenAI!

Dr. Rafał Jaworski
Mar 27, 2025
1 min read

Updated: Sep 11, 2025

By Dr. Rafał Jaworski, AI Engineering Executive, Tech Lead at PwC

Generative AI solutions never cease to amaze us, especially when it comes to multimodal capabilities. It appears that large AI models are happy to ingest not only text but also images, audio files and videos. If you ask them nicely, they can generate any type of content for you, including voice.

How is this possible? Word of the day: embeddings. Embeddings is the modern approach to information modelling. Essentially, each chunk of information is assigned several hundred numbers. This little chunk of information can be any of the following:

Written word or part of a word (aka "token")
Small patch of an image
Spoken syllable
Part of a video frame

The embeddings assigned to the information chunks provide a unique information fingerprint which is then used by the model during processing. Computers always think in numbers, so they love embeddings! If you want to know more, check out this article.

Voice, text, images, videos - everything is a number for GenAI!

Recent Posts

Comments

GlobalSaké