January 21, 2014
From touch and gesture interfaces to advanced facial recognition, our computers are communicating with us on an increasingly human level. One technology that is showing particular promise is a computer’s ability to recognize human speech or Speech-to-Text (STT). Applications such as Apple’s Siri, Google Now, and Nuance’s Dragon have brought voice-activated commands to the masses while enterprise companies are employing the technology to discover new insights from previously untapped audio and video data sources.
One of the greatest benefits of STT is the ability to bridge the gap between unstructured audio/video data and advanced analytics such as machine learning, natural language processing (NLP), and graph analysis. A company’s ability to understand their most vocal customers, whether within their call centers or on video sharing sites, can lead to a better view of customers and their experiences.
Call center logs can reveal interesting patterns and trends in the quality of customer agent call handling and (when combined with other data) call center operational costs. These insights could then be used to retrain customer service agents, identify and stop a poorly conceived marketing campaign, or quickly understand the root cause for a spike in call center volume.
For example, PwC’s Emerging Tech Lab is working with several companies, including a leading telecom company, to stem some troubling trends in their call volume and improve overall operational efficiency by applying STT, advanced machine learning, and NLP algorithms to their call center recordings.
While the potential is great, companies should be aware that STT technology is rarely accurate “out of the box” and requires customization for a specific use case. Factors such as background noise, regional accents, number of speakers, and the diversity of words play a large role in determining word error rate (WER), a measure of transcription accuracy. The three primary components of a Speech-to-Text engine — the acoustic model, dictionary and language model — should be carefully configured for the particular use case. The general rule of thumb is each component should be constrained to the smallest dataset that would adequately represent the acoustic environment, words, and phrases used. Single speakers, controlled environments, and specific topic areas will always yield better results than broad or diverse use cases.
As the technology continues to mature, it opens the door to interesting possibilities just over the horizon. Advances in computing power could provide automated and real-time translation between foreign speakers while emotional analysis of recorded audio could yield computers that are not only aware of what we are saying but our emotional states as well. While we may be years away from having deep, thoughtful conversations with our computers, the ability to recognize and analyze rich media is here today. How are you capturing and extracting value from the audio/video your customers are creating? Tell us in the comments section below.
Niko Pipaloff contributed to this post.