Connect with us

Technology

Gemini’s data analytics capabilities are not as good as Google claims

Avatar

Published

on

In this photo illustration a Gemini logo and a welcome message on Gemini website are displayed on two screens.

One of the selling points of Google’s generative AI models, Gemini 1.5 Pro and 1.5 Flash, is the amount of data they can supposedly process and analyze. In press conferences and demos, Google has repeatedly claimed that its models can perform previously impossible tasks thanks to their “long context,” such as summarizing documents spanning hundreds of pages or searching through scenes in movie footage.

But new research suggests the models aren’t actually that good at that.

Two different studies examined how well Google’s Gemini models and others make sense from a huge amount of data – think ‘War and Peace’ length. Both find that Gemini 1.5 Pro and 1.5 Flash struggle to correctly answer questions about large data sets; in one series of document-based tests, the models provided the correct answer only 40% 50% of the time.

“While models like Gemini 1.5 Pro can technically handle long contexts, we’ve seen many cases indicating that the models don’t really ‘understand’ the content,” said Marzena Karpinska, a postdoc at UMass Amherst and co-author of one of the studies, told JS.

Gemini’s context window is missing

A model’s context, or context window, refers to input data (e.g. text) that the model takes into account before generating output (e.g. additional text). A simple question: “Who won the 2020 US presidential election?” — can serve as context, just like a movie script, show, or audio clip. And as context windows grow, so does the size of the documents they can fit.

The latest versions of Gemini can hold over 2 million tokens as context. (“Tokens” are subdivided pieces of raw data, such as the syllables “fan,” “bag,” and “tic” in the word “fantastic.”) That equates to about 1.4 million words, two hours of video or 22 hours of audio — the largest context of any commercially available model.

In a briefing earlier this year, Google showed off several pre-recorded demos intended to illustrate the potential of Gemini’s long-context capabilities. One had Gemini 1.5 Pro search the transcript of the Apollo 11 moon landing broadcast — about 402 pages — for joke quotes, then find a scene in the broadcast that resembled a pencil sketch.

VP of research at Google DeepMind Oriol Vinyals, who led the briefing, described the model as “magical.”

“[1.5 Pro] performs these kinds of reasoning tasks on every page and every word,” he said.

That might have been an exaggeration.

In one of the aforementioned studies comparing these possibilities, Karpinska, along with researchers from the Allen Institute for AI and Princeton, asked the models to evaluate true/false statements about fiction books written in English. The researchers chose recent works so that the models couldn’t “cheat” by relying on prior knowledge, and they peppered the statements with references to specific details and plot points that would be impossible to understand without reading the books in their entirety.

Given a statement like “Using her skills as Apoth, Nusis can reverse engineer the type of portal opened by the reagent key found in Rona’s wooden chest,” Gemini 1.5 Pro and 1.5 Flash – after taking the relevant book – had to say whether the statement was true or false and explain their reasoning.

Image credits: UMass Amherst

Tested on a book approximately 260,000 words long (~520 pages), the researchers found that 1.5 Pro answered true/false statements correctly 46.7% of the time, while Flash answered correctly only 20% of the time. That means a coin is significantly better at answering questions about the book than Google’s latest machine learning model. Averaging all benchmark results, neither model managed to achieve higher than random probability in terms of question answering accuracy.

“We noticed that the models have more difficulty verifying claims that require viewing larger portions of the book, or even the entire book, compared to claims that can be resolved by gathering sentence-level evidence,” Karpinska said. “Qualitatively, we also observed that the models have difficulty verifying claims about implicit information that is obvious to a human reader but not explicitly stated in the text.”

The second of the two studies, co-authored by researchers at UC Santa Barbara, tested the ability of Gemini 1.5 Flash (but not 1.5 Pro) to “reason” through videos – that is, by searching and answering questions about the content of it. .

The co-authors created a dataset of images (for example, a photo of a birthday cake) combined with questions for the model to answer about the objects depicted in the images (for example: “Which cartoon character is on this cake?”). To evaluate the models, they randomly chose one of the images and inserted “distractor” images before and after it to create slideshow-like images.

Flash didn’t perform that well. In a test that required the model to transcribe six handwritten digits from a 25-image “slideshow,” Flash got about 50% of the transcriptions correct. Accuracy dropped to about 30% with eight digits.

“When answering real questions about images, it seems to be particularly difficult for all the models we tested,” Michael Saxon, a doctoral candidate at UC Santa Barbara and one of the co-authors of the study, told JS. “That small amount of reasoning – recognizing that a number is in a frame and reading it – could well break the model.”

Google overpromises with Gemini

None of the studies have been peer-reviewed, nor do they examine the Gemini 1.5 Pro and 1.5 Flash releases with 2 million token contexts. (Both tested the context versions of 1 million tokens.) And Flash isn’t supposed to be as capable as Pro in terms of performance; Google advertises it as a cheap alternative.

Nevertheless, both add fuel to the fire that Google overpromises – and underdelivers – with Gemini from the start. None of the models the researchers tested, including OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet, performed well. But Google is the only model provider that gets context-window-top billing in its ads.

“There is nothing wrong with simply stating, ‘Our model can use X number of tokens,’ based on the objective technical details,” Saxon said. “But the question is: what useful can you do with it?”

Generative AI, broadly speaking, is coming under increasing scrutiny as companies (and investors) grow frustrated with the technology’s limitations.

In a pair of recent surveys from the Boston Consulting Group, about half of respondents (all C-suite executives) said they don’t expect generative AI to deliver substantial productivity gains and are concerned about the potential for errors and problems. data compromises arising from generative AI-powered tools. PitchBook recently reported that early-stage generative AI dealmaking has declined for two consecutive quarters, falling 76% from its peak in Q3 2023.

Faced with chatbots that summarize meetings that elicit fictional details about people and AI search platforms that essentially amount to plagiarism generators, customers are looking for promising differentiators. Google – which has at times clumsily raced to catch up with its generative AI rivals – was desperate to make Gemini context one of those differentiators.

But the bet was premature, it seems.

“We haven’t yet come up with a way to actually demonstrate that there is ‘reasoning’ or ‘understanding’ about long documents, and in fact every group that releases these models is putting together their own ad hoc evaluations to verify these claims. to do,” Karpinska said. . “Without knowledge of how long context processing has been implemented – and companies do not share these details – it is difficult to say how realistic these claims are.”

Google did not respond to a request for comment.

Both Saxon and Karpinska believe the antidote to inflated claims around generative AI is better benchmarks and, in the same vein, a greater emphasis on third-party criticism. Saxon notes that one of the most common tests for long context (quoted freely by Google in its marketing materials), “needle in the haystack,” only measures a model’s ability to extract certain information, such as names and numbers, from data sets – and not to give answers. complex questions about that information.

“All the scientists and most engineers who use these models essentially agree that our existing benchmark culture is broken,” Saxon said, “so it’s important that the public understands that these massive reports with numbers like ‘general information about benchmarks ‘ must be interpreted as having an enormous impact. grain of salt.”