Google’s Meena is impressive, but AI chatbots are still ‘cheap imitations’ of humans

This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence.

This week, Google introduced Meena, a chatbot that can “chat about… anything.” Meena is the latest of many efforts by large tech companies trying to solve one of the toughest challenges of artificial intelligence: language.

“Current open-domain chatbots have a critical flaw — they often don’t make sense. They sometimes say things that are inconsistent with what has been said so far, or lack common sense and basic knowledge about the world,” Google’s researcher wrote in a blog post.

They’re right. Making sense of language and engaging in conversations is one of the most complicated functions of the human brain. Until now, most efforts to create AI that can understand language, engage in meaningful conversations, and generate coherent excerpts of text have yielded poor-to-modest results.

In the past years, chatbots have found a niche in some domains such as banking and news. Advances in natural language processing have also paved the way for the widespread use of AI assistants such as Alexa, Siri, and Cortana. But so far, current AI can only engage in language-related tasks as long as the problem domain remains narrow and limited, such as answering simple queries with clear meanings or carrying out simple commands.

Advanced language models such as OpenAI’s GPT-2 can produce remarkable text excerpts, but those excerpts quickly lose their coherence as they grow in length. As for open-domain chatbots, AI agents that are supposed to discuss a wide range of topics, they either fail at generating relevant results or often provide vague answers that can be given to various questions, like a politician evading giving specific answers at a press conference.

Read: [Google claims its new chatbot Meena is the best in the world]

Now, the question is, how much does Meena, Google’s massive chatbot, move the needle in conversational AI?

What’s under the hood?

Like the many innovative language models that have been introduced in the past few years, Google’s Meena has a few interesting details. According to the blog post and the paper published in the arXiv preprint server, Meena is based on the Evolved Transformer architecture.

The Transformer, introduced for the first time in 2017, is a sequence-to-sequence (seq2seq) machine learning model, which means it takes a sequence of data as input (numbers, letters, words, pixels…) and outputs another sequence. Sequence to sequence models are especially good for language-related tasks such as translation and question-answering.

There are other types of seq2seq models such as LSTM (long short-term memory) and GRU (gated recurrent unit) networks. Because of their efficiency in parallel processing and their ability to train on much more data, Transformers have been rising in popularity and have become the basic building block of most cutting-edge language models in the past couple of years (e.g. BERT, GPT-2).

The Evolved Transformer is a specialized type of the AI model that uses algorithmic search to find the best network design for the Transformer. One of the key challenges of developing neural networks is finding the right hyperparameters. The Evolved transformer automates the task of finding those parameters.

A bigger and costlier AI model

Like many other recent advances in AI, Meena owes at least part of its success to its huge size. “The Meena model has 2.6 billion parameters and is trained on 341 GB of text, filtered from public domain social media conversations,” Google’s AI researchers write. In comparison, OpenAI’s GPT-2 had 1.5 billion parameters and was trained on a 40-gigabyte corpus of text.

To be clear, we’re still far from creating AI models that match the complexity of the human brain, which has approximately 100 billion neurons (the rough equivalent of parameters in artificial neural networks) and more than 100 trillion synapses (connections between neurons). So, size does matter. But it’s not everything. For one thing, no human can process 340 GB of text data in their lifetime, let alone needing as much to be able to conduct coherent conversations.

And the obsession with creating bigger networks and throwing more compute and more data at the problem causes problems that are often overlooked. Among them is the cost and carbon footprint of developing such models.

According to the paper, the training of Meena took 30 days on a TPU v3 Pod, composed of 2,048 TPU cores. Google doesn’t have a price listing for the 2,048-core TPU v3 Pod, but a 32-core configuration costs $32 per hour. Projecting that to 2,048 cores ($2,048/hour), it would cost $49,152 per day and $1,474,560 for 30 days. While it’s interesting that Google can allocate such resources to researching bigger AI models, most academic research labs don’t have those kinds of funds to spare. These costs make it difficult to develop such AI models outside of the commercial sector.

More sensible and specific

Benchmarks play a very important role in ranking AI models and evaluating their accuracy and effectiveness. But as we’ve seen in these pages, most AI benchmarks can be gamed and provide misleading results.

To test Meena, Google’s engineers developed a new benchmark, the Sensible and Specificity Average (SSA). Sensibility means that the chatbot must make sense when it’s engaging in conversation with a human. So, if the AI produces an answer that in no way applies to the question, it will score negative on sensibility.

But providing a coherent answer is not enough. Some responses like “Nice!” or “I don’t know” or “Let me think about it” can be applied to many different questions without the AI necessarily understanding their meaning. This is where specificity comes into play. In addition to evaluating the sensibility of the AI, the reviewers also specify whether the agent generated a response that is relevant to the topic of their conversation.

In comparison to other popular chatbot engines, Meena scored much better on SSA.

google meena chatbot ssa score — The Sensibility and Specificity Average (SSA) benchmark determines the coherence and relevance of answers produced by AI chatbots.

This is a remarkable improvement. The paper contains several conversation examples that show Meena stays on the subject throughout multiple exchanges. It even comes up with witty responses in some cases, such as this joke that is mentioned both in the paper and the blog post.

google meena chatbot sample conversation — Google’s AI chatbot Meena sometimes engages in interesting conversations.

But Google has not yet released the model or a demo, so we only have the conversations released in the paper as guidance for how good the AI is. “We are evaluating the risks and benefits associated with externalizing the model checkpoint… and may choose to make it available in the coming months to help advance research in this area,” the researchers write.

Are we closer to AI that understands language?

While Meena shows the remarkable improvements of NLP research, has it gotten us closer to AI that has “common sense and basic knowledge about the world”?

“Many of the gains we are seeing are akin to an advanced form of memorization rather than intelligent behavior,” says AI researcher Stephen Merity. “This type of approach will work well for background knowledge. Common sense and abstraction are far less likely. We don’t yet know how well these models can perform complex reasoning so it’s likely that if you do see common sense, it’ll again be through a fuzzy form of memorization.”

“There’s no demo available, and until there is, I don’t think we should take it seriously,” says cognitive scientist Gary Marcus. “We’ve seen this movie before; GPT was originally dubbed ‘too dangerous to release’; when it came out, it was amusing, and maybe even impressive, but ultimately quite superficial.”

Marcus recently published his observations on evaluating GPT-2.

“I suspect that essentially the same problems will apply: reasoning will be poor, and the system won’t develop cognitive models that are rich enough for it to truly understand what’s going on. It might pass the Turing Test, but it won’t be genuinely intelligent,” Marcus says.

ZDNet’s Tiernan Ray also makes an interesting observation on Meena’s conversations: “Meena is recreating a distribution of language that is startlingly accurate, but also merely a recreation of information, which is boring. Patterns of language in the Meena form are highly associative at a word level, which makes Meena’s best examples those of wordplay. Wordplay is interesting up to a point, and it feels like it reflects intelligence, in some fashion. But it also quickly becomes superficial and tiresome.”

While it’s cool for AI chatbots to come up with conversations like the following, also drawn from Meena’s paper, they won’t serve any purpose.

Clearly, Meena—or any other AI agent for that matter—doesn’t have the same understanding of the world as we have. I don’t expect an AI that can, in a second, labor through more text data than I have read in my entire life to understand what it means like to take the weekend off after five days of hard work. Neither do I think it knows what it means to “love Friday.” As long as AI agents don’t experience life as we do, any semblance to humans in their behavior will at best be a cheap imitation.

On a more serious note, however, I’m looking forward to advances in AI providing us with more utilitarian chatbots, AI agents that can be more flexible in solving specific problems, getting information from the web, extracting implied meanings from the text, such as the conversation that Ray provides in his article.

tiernan ray zdnet chatbot conversation — A much more useful conversation with a chatbot. AI is not ready for this yet.

I’d doubt that Meena is ready to take on such a task. Google has given us another huge language model, but we have still a long way to go before we can claim that AI has come close to understanding human language.

“I think that these types of models will be quite important but that no matter how much we scale them we’re still missing components,” Merity says.

This story is republished from TechTalks, the blog that explores how technology is solving problems… and creating new ones. Like them on Facebook here and follow them down here:

Story by Ben Dickson

Ben Dickson is the founder of TechTalks. He writes regularly about business, technology and politics. Follow him on Twitter and Facebook (show all) Ben Dickson is the founder of TechTalks. He writes regularly about business, technology and politics. Follow him on Twitter and Facebook

Get the TNW newsletter

Get the most important tech news in your inbox each week.

Google’s Meena is impressive, but AI chatbots are still ‘cheap imitations’ of humans

What’s under the hood?

A bigger and costlier AI model

More sensible and specific

Are we closer to AI that understands language?

Get the TNW newsletter

Also tagged with

Earth has been habitable for billions of years — simulations show it was ‘just luck’

Here are 15 of the most asked Node.js interview questions

Discover TNW All Access

NASA delays Juno spacecraft’s retirement after detecting mysterious radio waves

Awful parking is the worst thing about escooters — TIER has a plan to fix it