Last year, Netflix reportedly published a whopping 1,500 hours of original content. And with the launch of streaming services from Apple and Disney, the on-demand video market is getting very competitive. Media houses and companies are already looking towards the next solution for producing content to keep up with the trend: AI avatars.
Last year in November, Chinese state-run media company Xinhua debuted an AI anchor that looked exactly like its real-life counterpart Zhang Zhao. The company said that the avatar speaks both in Mandarin and English. Xinhua said at that time that AI anchors are now officially a part of their team; aiming to provide “authoritative, timely and accurate news” round the clock, through its apps and social channels like WeChat.
A report from Tencent news published in February stated that the first batch of AI Anchors has produced more than 3,400 news reports, with a cumulative time of more than 10,000 minutes. It even debuted a female AI anchor named Xin Xiaomeng in February. These numbers indicate that at this rate, AI anchors can outwork their human counterparts very soon.
The news agency is already working with the Chinese search giant Soguo on a new male AI anchor named Xin Xiaohao, who’ll be able to gesture, stand, and move more naturally than the current versions.
In the future, news websites – which don’t produce videos with anchors – can use these models to produce a report from their articles, and compete for eyeballs with traditional TV outlets.
This January, Chinese television network CCTV produced its Network Spring Festival Gala, watched by nearly 1.4 billion people. It was the first time hosts of the program – Beining Sa, Xun Zhu, Bo Gao, Yang Long – were accompanied by their AI-generated avatars. CCTV worked with ObEN, an US-based AI company, to create these avatars for the hosts.
ObEN specializes in creating Personalised AIs (PAIs) using internally developed technology. To create celebrity AIs, the company scans humans through 3D camera to emulate their appearance. Next, it asks them to read a script (roughly 30-45 minutes long) to record their voice, and reproduce it through AI that tries to imitate tonality and emotiveness of their human counterpart’s voice.
The company’s technology can reproduce AI-avatar-based videos of celebrities in. Plus, the company can even make them sing, if the music studio provides them with a background track and vocal cues.
Last year, the company tied up with the Chinese music group, SNH48, to produce a video starring its members along with its avatars.
ObEN’s CEO, Nikhil Jain, said that the company’s technology can reproduce AI’s voice in multiple languages even if they record the script in English:
We’ve designed our algorithm in such a way that a PAI can speak English, Chinese, Korean, and Japanese fluently without losing the personality of its owner’s voice.
“One of the new things we’re working on is called expressive speech that allows us to generate a whole range of new emotions. Combination emotions like anger or sadness can make an individual recognizable,” said Mark Harvilla, the company’s chief technology specialist.
It’s important for avatar makers to keep in mind that they are essentially gunning to replace human entertainers, and they will have to make them emotionally appealing to viewers.
When I see art or entertainment, I think most of what I respond to is the feeling that someone else put a lot of care into creating it, and that by looking at it I can feel how much that other person’s mind must care. I can definitely get that same feeling looking at an AI avatar that seems really lovingly made, but I think at bottom it’s still the love of the person who made the avatar that I’m responding most to — so I hesitate to say the avatar is the source of anything.
Hardik Meisheri, a Natural Language Processing (NLP) Researcher at TCS Research and Innovation said that current generation of AIs are good at reading information, but they’re not very emotive:
Regarding different situations, AIs are mostly equipped with situations which are common and more readily available, so they are very good at reading news about traffic, weather, etc. But the natural tragedy is a tricky one, although it can be done since these are rare events they are not yet trained properly to handle that.
Another major challenge from the psychological perspective is the lack of empathy. When a human talks to a human, more or less there is a sense of empathy or micro-emotions which drives the conversation. These micro-emotions although are studied from decades are still far from modeled correctly in some form where AI would be able to mimic it easily.
He added that it is difficult to make them have a conversation which is emotionally challenging such as consoling someone or giving a pep talk.
At the moment, it seems that the models are ready to read basic news or information, but they’re not really good at any form of entertainment that requires them to emote.
“I think the most appealing AI avatar work will embrace its AI-ness. Instead of trying to make a human replica that fools people into thinking it’s a person, the AI work has fun with the parts of it that are spectacularly non-human,” Brew said.