Microsoft Claims To Reach Human Parity in Conversational Speech Recognition

Wednesday, October 19, 2016

Speech Recognition

Speech Recognition

Microsoft researchers say they’ve developed speech recognition system that can understand a human conversation as well as people do. The work shows that when the speech recognition software “listened” to people talking, it was able to transcribe the conversation with the same or fewer errors than professional – human – transcriptionists. 

Microsoft claims it has made a major breakthrough in speech recognition, creating a technology that recognizes the words in a conversation as well as a person does. A team of researchers and engineers in Microsoft Artificial Intelligence and Research reported a speech recognition system that makes the same or fewer errors than professional transcriptionists.

Related articles
In the work detailed in a recently published paper, the researchers reported a word error rate (WER) of 5.9 percent, down from the 6.3 percent WER the team reported just last month.

The 5.9 percent error rate is about equal to that of people who were asked to transcribe the same conversation, and it’s the lowest ever recorded against the industry standard Switchboard speech recognition task.

“We’ve reached human parity,” said Xuedong Huang, the company’s chief speech scientist. “This is an historic achievement.”

The milestone means that, for the first time, a computer can recognize the words in a conversation as well as a person would. In doing so, the team has beat a goal they set less than a year ago — and greatly exceeded everyone else’s expectations as well.

“Even five years ago, I wouldn’t have thought we could have achieved this. I just wouldn’t have thought it would be possible,” said Harry Shum, the executive vice president who heads the Microsoft Artificial Intelligence and Research group.

Microsoft Claims To Reach Human Parity in Conversational Speech Recognition
Microsoft researchers from the Speech & Dialog research group include, from back left, Wayne Xiong, Geoffrey Zweig, Xuedong Huang, Dong Yu, Frank Seide, Mike Seltzer, Jasha Droppo and Andreas Stolcke. (Photo by Dan DeLong) - Microsoft

"We’ve reached human parity. This is an historic achievement."

“This will make Cortana more powerful, making a truly intelligent assistant possible,” Shum said.

Parity with humans doesn't mean the computer recognized every word perfectly. In fact, humans don’t do that, either. Instead, it means that the error rate – or the rate at which the computer misheard a word like “have” for “is” or “a” for “the” – is the same as you’d expect from a person hearing the same conversation.

Geoffrey Zweig, who manages the Speech & Dialog research group, attributed the accomplishment to the systematic use of the latest deep neural network technology in all aspects of the system. "The key
to our system’s performance is the systematic use of convolutional and LSTM [Long Short Term Memory] neural networks, combined with a novel spatial smoothing method and lattice-free MMI acoustic training," the researchers describe in their paper.

“This lets the models generalize very well from word to word,” Zweig said.

For the project the team used Microsoft’s Computational Network Toolkit, a homegrown system for deep learning that the research team has made available on GitHub via an open source license.

The next steps according to Zweig, are working on ways to make sure that speech recognition works well in more real-life settings. That means places where there is a lot of background noise, such as at a party or while driving on the highway. The researchers also intend to focus on better ways to help the technology assign names to individual speakers when multiple people are talking, and on making sure that it works well with a wide variety of voices, regardless of age, accent or ability.

“The next frontier is to move from recognition to understanding,” Zweig said.

Shum has noted that we are moving away from a world where people must understand computers to a world in which computers must understand us. Still, he cautioned, true artificial intelligence is still on the distant horizon.

“It will be much longer, much further down the road until computers can understand the real meaning of what’s being said or shown,” Shum said.

SOURCE  Microsoft

By  33rd SquareEmbed


Post a Comment