top of page

Voice, text, images, videos - everything is a number for GenAI!

Updated: Sep 11

By Dr. Rafał Jaworski, AI Engineering Executive, Tech Lead at PwC



Generative AI solutions never cease to amaze us, especially when it comes to multimodal capabilities. It appears that large AI models are happy to ingest not only text but also images, audio files and videos. If you ask them nicely, they can generate any type of content for you, including voice.


How is this possible? Word of the day: embeddings. Embeddings is the modern approach to information modelling. Essentially, each chunk of information is assigned several hundred numbers. This little chunk of information can be any of the following:


  • Written word or part of a word (aka "token")

  • Small patch of an image

  • Spoken syllable

  • Part of a video frame


The embeddings assigned to the information chunks provide a unique information fingerprint which is then used by the model during processing. Computers always think in numbers, so they love embeddings! If you want to know more, check out this article.

Comments


bottom of page