There is no way Google Gemini is an AGI

Google Gemini

General AI has been defined as a type of artificial intelligence that will be capable of performing any intellectual task a human can perform. This definition is somewhat vague but let’s go with it. What people really mean when they talk of general AI is a type of AI that can reason, think, and probably feel like a human. This latter definition of General AI is albeit close to a sentient/conscious AI. Anyway, no matter which definition of general AI makes sense, rumours are that the already hyped Google Gemini will be the first General AI to visit mankind.

Google Gemini is a generative pre-trained transformer (GPT) designed to be multimodal, meaning that it can handle multiple modes of input and output at the same time. For example, Gemini can not only generate text from text, but also generate text from images, images from text, or even images from images. Gemini can also answer questions based on text or images, summarize videos or podcasts, create graphics or layouts for magazines or websites, and much more. It is because of these multimodal capabilities that some have seen it as a type of General AI.

But I don’t think describing Google Gemini as General AI is justified. From where I sit, there are four types of AI – the narrow AI, the multi-narrow AI, the General AI, and the strong super intelligence AI (the science fiction singularity kind).

The narrow- AI is the type of AI we have become used to ever since we encountered computers. These are computer systems that are so good at doing the one task (or one-related class of tasks), that no humans can ever beat them at those tasks. Stockfish chess engine for example is a narrow AI that can only do chess, nothing else. At times, the algorithm behind the chess engine can be adapted to another field. We have seen this from Google’s DeepMind. Their AI algorithm that was extremely good at playing Go was adapted to play chess, then adapted to play other games, and finally adapted to solve the protein folding problem. But these algorithms are still narrowed to tackle tasks that are uniquely related, so they are still classified as part of the narrow-AI spectrum.

As of today we started seeing an expandable type of narrow-AI – the ones that use generative pre-trained transformer technology. ChatGPT is currently on the lead on these kinds of expanded AI, to a point where one may be tempted to classify it as AGI. For example, ChatGPT and other GPT powered AI systems can create artistic content such as music, videos, images, poetry, information retrieval, perform mathematical calculations, write code for computer programs, and even make human like errors. Ideally speaking, ability to perform these wide range of intellectual tasks can’t just be classified as narrow-AI.

Though broad, these types of AI are still specific in the sense that they receive inputs (usually text, image, or video), then generate outputs that are specific to the domains of the input. AGI on the other hand is expected to be able to  learn and reason from intuition. This is why we cannot rightly think that Google Gemini will be a general AI.

Checkout: Why be scared of Google Genesis yet Bing Chat can already do what it is supposed to do?

What Google Gemini has been created to be is a type of expanded narrow-AI that, at the back end, delegates specific tasks to specialized systems that can handle them… more like GPT-4 with its numerous plugins. The fact that ChatGPT can now perform a wide range of tasks thanks to the plugins doesn’t make it an artificial generally intelligent system – at least in the sense of what we expect AGI to be.

That said, Google Gemini will be quiet promising – especially now that I can confidently classify it as a multi-narrow AI – or what they prefer to call multimodal. Multimodal AI is a branch of AI that deals with multiple data modalities, such as text, images, audio, and video. Multimodal AI models can either perform cross-modal tasks, such as translating speech to text or captioning images, or multimodal tasks, such as generating a video from a text description or synthesizing a voice from an image.

Multimodal AI is challenging because it requires integrating different types of data that have different characteristics and structures. For example, text is discrete and sequential, while images are continuous and spatial. Audio and video are both temporal and spatial, but have different sampling rates and resolutions. Therefore, multimodal AI models need to learn how to represent, align, fuse, and transform different data modalities in a coherent and meaningful way. According to Google’s blog post, Google Gemini has the following capabilities:

  • Multimodal understanding: Google Gemini can understand the content and context of different data modalities, such as text, images, audio, and video. For example, it can answer questions based on a given image or video, or extract relevant information from a speech or text document.
  • Multimodal generation: Google Gemini can generate realistic and diverse data modalities, such as text, images, audio, and video. For example, it can create a story from a given image or video, or produce a song from a given text or image.
  • Multimodal transformation: Google Gemini can transform one data modality into another, such as text to image, image to audio, audio to video, or video to text. For example, it can draw a picture from a given text description, or narrate a video from a given image sequence.
  • Multimodal fusion: Google Gemini can fuse multiple data modalities into one output modality, such as text + image to video, audio + video to text, or image + audio to image. For example, it can create a video from a given text and image pair, or transcribe a speech from a given audio and video pair.
  • Multimodal editing: Google Gemini can edit existing data modalities based on user input or feedback. For example, it can modify an image based on a given text instruction, or change the tone of a voice based on a given emotion.

Also see: How AI-powered robots will usher in soft life for everyone

Google Gemini is based on a deep neural network architecture that consists of four main components: an encoder, a decoder, an attention mechanism, and a fusion module. The encoder and decoder are responsible for encoding and decoding different data modalities into latent representations. The attention mechanism is responsible for aligning and weighting different data modalities based on their relevance and importance. The fusion module is responsible for combining different data modalities into one output modality.

Google Gemini uses two types of attention mechanisms: self-attention and cross-attention. Self-attention allows the model to capture the internal structure and dependencies within each data modality. Cross-attention allows the model to capture the external relationships and interactions between different data modalities. Google Gemini also uses two types of fusion modules: additive fusion and multiplicative fusion. Additive fusion allows the model to add up the features of different data modalities into one output feature. Multiplicative fusion allows the model to multiply the features of different data modalities into one output feature.

Google Gemini is trained on a large-scale multimodal dataset that contains millions of examples of different data modalities. The dataset covers various domains and topics, such as art, music, literature, science, sports, etc. The dataset also includes various types of labels and annotations, such as captions, keywords, categories, sentiments, etc. The dataset is used to train Google Gemini on various multimodal tasks using supervised learning or self-supervised learning methods. If Google Gemini works as promised, it will be possible to use it in one of the following areas:

  • Education: Google Gemini can be used as an interactive learning tool that can provide multimodal feedback and guidance to students and teachers. For example, it can generate quizzes or exercises based on a given text or image material, or explain concepts or phenomena using different data modalities.
  • Entertainment: Google Gemini can be used as an creative assistant that can produce multimodal content for entertainment purposes. For example, it can generate stories, poems, songs, jokes, or games based on a given text or image input, or create personalized content based on user preferences or emotions.
  • Healthcare: Google Gemini can be used as a diagnostic tool that can analyze multimodal data from patients and provide multimodal reports or recommendations. For example, it can process medical images, audio recordings, or text documents and generate diagnoses, prescriptions, or suggestions using text, images, audio, or video.
  • Communication: Google Gemini can be used as a communication tool that can facilitate multimodal interaction between humans and machines or humans and humans. For example, it can translate speech to text or text to speech in different languages, or convert text to image or image to text in different formats.

Read: Canon Launches the World’s First Ultra-High-Sensitivity Camera with a SPAD Sensor


Welcome! Login in to your account

Remember me Lost your password?

Lost Password