Google Deepmind recently announced Gemini, its recent AI model designed to compete with OpenAI’s ChatGPT. While each models are examples of “generative AI” that learn to search out patterns when entering training information to generate recent data (images, words, or other media), ChatGPT is a big language model (LLM) that focuses on the Production of text focused.

Just as ChatGPT is a conversation web app based on the GPT neural network (trained on large amounts of text), Google has a conversation web app called bard which was based on a model called LaMDA (Training on Dialogue). But Google is now retrofitting this based on Gemini.

What sets Gemini other than previous generative AI models like LaMDA is that it’s a “multimodal model.” This signifies that it really works directly with multiple input and output modes: In addition to text input and output, it also supports images, audio and video. Accordingly, a brand new acronym emerges: LMM (Large Multimodal Model), to not be confused with LLM.

OpenAI in September a model announced called GPT-4Vision, which may also work with images, audio and text. However, it isn’t a totally multimodal model in the best way that Gemini guarantees.

For example, while ChatGPT-4, powered by GPT-4V, can work with audio inputs and generate voice output, OpenAI has confirmed that it does this by converting speech to text on input using one other deep learning model called Whisper. ChatGPT-4 also converts text to speech using a special model when output, meaning GPT-4V itself works exclusively with text.

Likewise, ChatGPT-4 can generate images, but by generating text prompts which might be forwarded to a separate deep learning model called Dall-E 2, which converts text descriptions into images.

In contrast, Google designed Gemini to be “natively multimodal.” This signifies that the core model processes a spread of input types (audio, images, video and text) directly and may also output them directly.

The judgment

The distinction between these two approaches could seem academic, but it is crucial. The general conclusion Google Technical Report and other qualitative tests What is understood to date is that the present publicly available version of Gemini, called Gemini 1.0 Pro, is usually not pretty much as good as GPT-4 and is more just like GPT 3.5 in its capabilities.

Google also announced a more powerful version of Gemini called Gemini 1.0 Ultra and presented some results showing that it’s more powerful than GPT-4. However, that is difficult to evaluate for 2 reasons. The first reason is that Google has not released Ultra yet, so the outcomes can’t be independently validated right now.

The second reason it’s difficult to evaluate Google’s claims is that Google selected to release a somewhat misleading demonstration video, see below. The video shows the Gemini model providing interactive and fluid commentary on a live video stream.

However, there originally reported by Bloomberg, the demonstration within the video was not done in real time. For example, the model had previously learned some specific tasks, corresponding to the three cups and ball trick, during which Gemini keeps track of which cup the ball is under. To do that, he was supplied with a still image sequence during which the presenter’s hands are on the cups which might be being swapped.

Promising future

Despite these issues, I consider that Gemini and enormous multimodal models represent a really exciting advance for generative AI. This is on account of each their future capabilities and the competitive landscape of AI tools. As I discussed in a previous article, GPT-4 was trained on about 500 billion words – essentially all good quality public text.

The performance of deep learning models is usually determined by the increasing complexity of the model and the quantity of coaching data. This has led to the query of how further improvements might be achieved, as we’ve got almost run out of latest training data for language models. However, multimodal models open up enormous recent reserves of coaching data – in the shape of images, audio and videos.

AIs like Gemini that may be trained directly based on all this data are prone to have much greater capabilities in the longer term. For example, I might expect to see models emerge which might be trained on video sophisticated internal representations the so-called “naive physics”. This is the basic understanding that humans and animals have about causality, motion, gravity, and other physical phenomena.

I’m also excited to see what this implies for the competitive landscape of AI. Over the past 12 months, despite the emergence of many generative AI models, OpenAI’s GPT models have been dominant, showing a level of performance that other models haven’t been capable of match.

Google’s Gemini signals the emergence of a serious competitor that can help advance the sector. Of course, OpenAI is sort of definitely working on GPT-5, and we will expect it to even be multimodal and show remarkable recent capabilities.

That being said, I’m excited to see the emergence of very large multimodal models which might be open source and non-commercial, hopefully on the best way in the approaching years.

I also like some features of the Gemini implementation. For example, Google announced a version called Gemini Nanowhich is way lighter and may be run directly on mobile phones.

Lightweight models like this reduce the environmental impact of AI computing and offer many advantages from a knowledge protection perspective, and I’m sure that this development will result in competitors following suit.

This article was originally published at