A Brief History of ASR: Automatic Speech Recognition

Our friends at Descript begin a series on the evolution of ASR with the piece below. We at Speakeasy AI are excited about how our revolutionary approach to conversational AI via speech-to-intent ™ will mold the ASR landscape and help enable the future of what can be done with voice. – Frank Schneider, CEO, Speakeasy AI   by Jason Kincaid, @ Descript. This moment has been a long time coming. The technology behind speech recognition has been in development for over half a century, going through several periods of intense promise — and disappointment. So what changed to make ASR viable in commercial applications? And what exactly could these systems accomplish, long before any of us had heard of Siri? The story of speech recognition is as much about the application of different approaches as the development of raw technology, though the two are inextricably linked. Over a period of decades, researchers would conceive of myriad ways to dissect language: by sounds, by structure — and with statistics.

Early Days

Human interest in recognizing and synthesizing speech dates back hundreds of years (at least!) — but it wasn’t until the mid-20th century that our forebears built something recognizable as ASR. 1961 — IBM Shoebox Among the earliest projects was a “digit recognizer” called Audrey, created by researchers at Bell Laboratories in 1952. Audrey could recognize spoken numerical digits by looking for audio fingerprints called formants — the distilled essences of sounds. In the 1960s, IBM developed Shoebox — a system that could recognize digits and arithmetic commands like “plus” and “total”. Better yet, Shoebox could pass the math problem to an adding machine, which would calculate and print the answer. Meanwhile researchers in Japan built hardware that could recognize the constituent parts of speech like vowels; other systems could evaluate the structure of speech to figure out where a word might end. And a team at University College in England could recognize 4 vowels and 9 consonants by analyzing phonemes, the discrete sounds of a language. But while the field was taking incremental steps forward, it wasn’t necessarily clear where the path was heading. And then: disaster. October 1969 The Journal of the Acoustical Society of America

A Piercing Freeze

The turning point came in the form of a letter written by John R. Pierce in 1969. Pierce had long since established himself as an engineer of international renown; among other achievements he coined the word transistor (now ubiquitous in engineering) and helped launch Echo I, the first-ever communications satellite. By 1969 he was an executive at Bell Labs, which had invested extensively in the development of speech recognition. In an open letter³ published in The Journal of the Acoustical Societyof America, Pierce laid out his concerns. Citing a “lush” funding environment in the aftermath of World War II and Sputnik, and the lack of accountability thereof, Pierce admonished the field for its lack of scientific rigor, asserting that there was too much wild experimentation going on: “We all believe that a science of speech is possible, despite the scarcity in the field of people who behave like scientists and of results that look like science.” — J.R. Pierce, 1969 Pierce put his employer’s money where his mouth was: he defunded Bell’s ASR programs, which wouldn’t be reinstated until after he resigned in 1971.

Progress Continues

Thankfully there was more optimism elsewhere. In the early 1970s, the U.S. Department of Defense’s ARPA (the agency now known as DARPA) funded a five-year program called Speech Understanding Research. This led to the creation of several new ASR systems, the most successful of which was Carnegie Mellon University’s Harpy, which could recognize just over 1000 words by 1976. Meanwhile efforts from IBM and AT&T’s Bell Laboratories pushed the technology toward possible commercial applications. IBM prioritized speech transcription in the context of office correspondence, and Bell was concerned with ‘command and control’ scenarios: the precursors to the voice dialing and automated phone trees we know today. Despite this progress, by the end of the 1970s ASR was still a long ways from being viable for anything but highly-specific use-cases. This hurts my head, too.

The ‘80s: Markovs and More

A key turning point came with the popularization of Hidden Markov Models(HMMs) in the mid-1980s. This approach represented a significant shift “from simple pattern recognition methods, based on templates and a spectral distance measure, to a statistical method for speech processing”—which translated to a leap forward in accuracy. A large part of the improvement in speech recognition systems since the late 1960s is due to the power of this statistical approach, coupled with the advances in computer technology necessary to implement HMMs. HMMs took the industry by storm — but they were no overnight success. Jim Baker first applied them to speech recognition in the early 1970s at CMU, and the models themselves had been described by Leonard E. Baum in the ‘60s. It wasn’t until 1980, when Jack Ferguson gave a set of illuminating lectures at the Institute for Defense Analyses, that the technique began to disseminate more widely. The success of HMMs validated the work of Frederick Jelinek at IBM’s Watson Research Center, who since the early 1970s had advocated for the use of statistical models to interpret speech, rather than trying to get computers to mimic the way humans digest language: through meaning, syntax, and grammar (a common approach at the time). As Jelinek later put it: “Airplanes don’t flap their wings.” These data-driven approaches also facilitated progress that had as much to do with industry collaboration and accountability as individual eureka moments. With the increasing popularity of statistical models, the ASR field began coalescing around a suite of tests that would provide a standardized benchmark to compare to. This was further encouraged by the release of shared data sets: large corpuses of data that researchers could use to train and test their models on. In other words: finally, there was an (imperfect) way to measure and compare success. November 1990, Infoworld

Consumer Availability — The ‘90s

For better and worse, the 90s introduced consumers to automatic speech recognition in a form we’d recognize today. Dragon Dictate launched in 1990 for a staggering $9,000, touting a dictionary of 80,000 words and features like natural language processing (see the Infoworld article above). These tools were time-consuming (the article claims otherwise, but Dragon became known for prompting users to ‘train’ the dictation software to their own voice). And it required that users speak in a stilted manner: Dragon could initially recognize only 30–40 words a minute; people typically talk around four times faster than that. But it worked well enough for Dragon to grow into a business with hundreds of employees, and customers spanning healthcare, law, and more. By 1997 the company introduced Dragon NaturallySpeaking, which could capture words at a more fluid pace — and, at $150, a much lower price-tag. Even so, there may have been as many grumbles as squeals of delight: to the degree that there is consumer skepticism around ASR today, some of the credit should go to the over-enthusiastic marketing of these early products. But without the efforts of industry pioneers James and Janet Baker (who founded Dragon Systems in 1982), the productization of ASR may have taken much longer. November 1993, IEEE Communications Magazine

Whither Speech Recognition— The Sequel

25 years after J.R. Pierce’s paper was published, the IEEE published a follow-up titled Whither Speech Recognition: the Next 25 Years⁵, authored by two senior employees of Bell Laboratories (the same institution where Pierce worked). The latter article surveys the state of the industry circa 1993, when the paper was published — and serves as a sort of rebuttal to the pessimism of the original. Among its takeaways:
  • The key issue with Pierce’s letter was his assumption that in order for speech recognition to become useful, computers would need to comprehend what words mean. Given the technology of the time, this was completely infeasible.
  • In a sense, Pierce was right: by 1993 computers had meager understanding of language—and in 2018, they’re still notoriously bad at discerning meaning.
  • Pierce’s mistake lay in his failure to anticipate the myriad ways speech recognition can be useful, even when the computer doesn’t know what the words actually mean.
The Whither sequel ends with a prognosis, forecasting where ASR would head in the years after 1993. The section is couched in cheeky hedges (“We confidently predict that at least one of these eight predictions will turn out to have been incorrect”) — but it’s intriguing all the same. Among their eight predictions:
  • “By the year 2000, more people will get remote information via voice dialogues than by typing commands on computer keyboards to access remote databases.”
  • “People will learn to modify their speech habits to use speech recognition devices, just as they have changed their speaking behavior to leave messages on answering machines. Even though they will learn how to use this technology, people will always complain about speech recognizers.”

The Dark Horse

In a forthcoming installment in this series, we’ll be exploring more recent developments and the current state of automatic speech recognition. Spoiler alert: neural networks have played a starring role. But neural networks are actually as old as most of the approaches described here — they were introduced in the 1950s! It wasn’t until the computational power of the modern era (along with much larger data sets) that they changed the landscape. But we’re getting ahead of ourselves. Stay tuned for our next post on Automatic Speech Recognition by following Descript on Medium, Twitter, or Facebook.     This article is originally published at Descript.

Speech Is Coming Back, Just Not With Agents

  2018 is the year speech came back from the dark side and surprisingly it’s kids driving the renaissance. Kids do not know that you can press buttons on a remote control – they just talk to it. Need to know what clothes to wear to school – just ask Alexa. People are realizing that systems can understand what you say and that is a lot easier than typing, searching through menus or search. Voice UI is the original human UI. In 2018 IVRs are starting to catch up with these smart talking devices and offer answers to questions, provide personalized information, take action with SMS messages with solutions and up to date interactive information. There are silos of information that are being released from the agent desktop to the AI system that sits on your IVR – solutions that can thread voice intelligence through theses silos are transforming enterprises. What percentage of questions that your live agents answer today could be solved with AI? Based on the data we’ve collected with our active listening process, 30-40% of current live agent call center inquiries can be handled or improved by AI. In these use cases, our solution identifies what a user wants and provides an answer either to the customer or assistance to the agent to deliver the right answer. 2018 is the year speech came back and this renaissance can extend to IVRs with AI solutions like Speakeasy AI. Our Active Listening Pilot can kickoff your IVR come back by providing insights into what your users are asking for and what percentage of intents AI can answer – all in real time.  

Step Four of a 12 Step Recovery Program for AI Narratives

Hi, I’m Frank, and I am an AI narrative addict on the fourth step of a 12 step recovery program. In any good 12 step program, group meetings and crowdsourcing of experiences and lessons learned is key. And this blog post is no different. Let me start by saying, Seth Godin has it right. The future of customer service is going to be AI enabling a hierarchy of brand and consumer relationships that have large-scale, frictionless, personalized support at its core. However, the future is not AI taking everyone’s jobs as many fear, and that’s where my recovery begins. You see, there is a messaging and narrative issue in the AI-as-a-solution universe that I and my peers have created. When MIT is publishing articleson how AI predictions need to be more pragmatic, it seems clear that a personal inventory, also known as Step 4 is required. So here is my own inventory of the noise that is distracting us from getting to where we need to be with AI solutions, and especially the narrative or marketing surrounding it. I’ve reduced this list to three for now after ruminating on my sins and our industry’s self-created narrative problems. But I know we may have more, and the more we engage in an open dialogue, the more we can work on recovery together. If we can admit that we need to find another way to connect a narrative of exciting possibilities and empirical evidence surrounding an approach to utilize AI within business NOW, we can then admit that the sensationalized near tabloid level approach to AI marketing can be overcome. First, I commit to never using this image or any of its friends again. The benevolent terminator. This is an easy one. Maybe too easy. We’ve all seen this or a version of it across every digital piece of content about AI. I could blame Stephen Spielberg, but the reality is, this is what humans do to address our innate predisposition to use a picture to say 1,000 words. Oftentimes, if we are being honest, the one word encapsulated here is bullshit. Right now, we shouldn’t be looking to connect the dots between a creepier C3PO and an easier way to have conversations with a brand. And if we are being honest with each other, we aren’t even trying to. Instead, how about an image of a customer smiling at their Alexa? Maybe the caption reads that Sofia is changing her hotel reservation to include a cot while she is helping her husband build a spice rack in the kitchen (where most Amazon Echo live). Not as sexy as a robot overlord that will let you live, but it is an AI win that can be delivered now. Second, I promise not to insinuate AI requires training that humans don’t even require to operate daily. The hostage negotiating Mozart. This is a tough one for me… sales and narrative are in my DNA. I love telling stories. I love listening to stories even more. My best stories came from conducting conflict resolution meetings for high school kids who had committed serious crimes. I was humbled every day. I tried my best to get better, and I hope I showed incremental improvement in my skills to facilitate some level of understanding and peace in those kids’ lives. I’m not sure of the outcomes. But I am certain that this experience is not what should be used to “train” AI. Yet, in our space, I’ve seen sidebars about hostage negotiators being used for training sales AI models. One thing is clear: customers and hostages don’t comprise a Venn diagram that works for a business, unless it is a Dilbert cartoon on LinkedIn with extreme irony. Ignore this, of course, if your business is hostage negotiating training. Shout out to the NTOA. In this same vein, nothing makes me cringe more than hearing Common (the former hip-hop artist, now a cross between a romcom star and human activist) spitting verses about AI. Or better yet, Bob Dylan sighing over losing a songwriting conversation to that AI system that also won on Jeopardy. Do we really want to connect the dots between a fictitious AI Mozart and using AI to just understand a human utterance and relate it to an intent? Instead, let’s talk about AI enabling personalization at scale. There are some brands that I simply love… Nike, for example. I am a sneakerhead. I can’t get enough of their app – the smooth, sleek design and the notifications of when they drop the next opportunity for me to hand over my money for a retro Air Jordan XI. But AI can enable me, one of the millions of sneakerheads, to achieve a personalized experience based on where I’ve been, what I’ve done, and someday soon, what I’ve said with my own voice. I don’t want Nike to sell me AI designed sneakers better than Tinker Hatfield, I want to be able to buy and potentially return Tinker’s elegantly designed kicks. AI models need not be creative geniuses or hostage negotiators to help me. Stop it. I’m feeling better already. My first personal inventory item for AI was an appeal for a change of imagery. The second was an appeal for a change of emotion and tone. My final is an appeal based on, ironically when it comes to AI, words. Third, I will not call software, and in this case, AI, by terms that misguide, misinform and violate the actual definitions of the words being used. The intellectual adjudicator.   So I made up “intellectual adjudicator” to try and keep a certain AI brand nameless. I won’t be surprised if someone else takes it and runs with it. This aforementioned brand has built a new narrative to describe AI that can understand what you say, in your own words, and turn it into an intent for customer service. We at Speakeasy AI call attempts at understanding intent derived from unfiltered customer voice, Speech-to-Intent™. {Hold for applause}. Another company, describing a similar, yet different solution or approach, is utilizing a term that indicates a court proceeding – usually related to divorce disputes – and the ability to discern knowledge from such a proceeding. Maybe this is merely semantics, and I am being nitpicky, but the crux of AI that enables understanding at scale is words, spoken or written. If an AI needs to act as a divorce attorney to help me better get along with brands, much like the hostage negotiating narrative, I think we’ve lost. Instead, let’s choose simple language that focuses on, “why should anyone care?” and real, useful outcomes. For example, Apple has not described the iPhone X’s new technology as Putative Investigative Cognition technology. Their narrative is: look at your phone, turn it on, and it just knows you. Lovely. Facial recognition AI. I get that. Let’s commit to describe the awesome pragmatic ways AI can actually deliver something. I.e., voice technology you will use and love. AI that actually understands you. I can get behind those messages. There is no doubt that the future is bright for an AI-enhanced consumer and business world; however, let’s take personal inventory of the current narrative landscape, and work on a message that aligns with the outcomes we are delivering today and the future wins we are shooting for, in order to enable better, personalized service at scale with AI… now.   Note: This post was originally part of the Greenbook Blog Big Ideas series, a column highlighting the innovative thinking and thought leadership at IIeX events around the world. Frank will be speaking at IIeX North America (June 11-13 in Atlanta). If you liked this article, you’ll LOVE IIeX North America. Click here to learn more.

Don’t Let Security Drive Your Business (Unless Your Business is Security)

I get it.  Rarely a week goes by that we don’t read about the latest data breach at a large corporation that has compromised thousands or millions of customers.  As an executive or someone responsible for weighing the risks and making decisions that have some similar security concerns, it’s difficult not to immediately wonder who’s responsible and think of the wrong turn their career probably just took.  No one wants to be THAT person. When we begin to manage our business from a point of fear rather than creativity or customer convenience, security shifts from being one of the top considerations to being THE gating factor deciding what projects can pass.  After all, who is going to champion a project that was killed for being a security risk? This is how you, as a consumer, end up in an IVR that requires you to authenticate before allowing you to pay your own bill. Or when speaking to technical support, you are required to prove you are the account owner before getting help to troubleshoot your home internet connection (one you feel you are already paying too much for). But, there are some practical steps a business can follow to ensure security, enable customer convenience, and empower business innovation. Not all risks are created equal The first step is to understand what exactly is at risk.  Face it, every time one of us puts our credit card down at a restaurant, we take a security risk.  You’re unlikely to ever launch a new customer facing project that has zero risk. If you think you have, it probably means you missed something.  That being said, you can get a good gauge considering these basic points:
  1. Are there legal requirements or Government rules that apply?  PCI compliance and the handling of PII (Personally Identifiable Information) are two areas that stand out.  The important part here is that you really understand how these apply to your project.  PII rules, for instance, govern privacy and make sure you’re not giving this information out unless the proper credentials are supplied.  For most customer support and self-help projects, you aren’t giving this information out, so it doesn’t apply.
  2. What’s the scope of the risk?  Are we talking about exposing hundreds or thousands of accounts at once (such as exposing a large database) or is it a single account?  Going back to the IVR authentication issue above, while it’s possible someone would steal a credit card and then use it to pay for someone else’s cable bill, the risk (and subsequent damages) are small.  It doesn’t make sense to make all your customers go through this process.
  3. Can you ‘layer” your project?  When you start a project, especially a self-help application, it’s important to consider three layers: Generic, Identified, Authenticated.  Generic information is that which you can provide without knowing specifically who the customer is (or verifying that).  Things like where to find their balance on their bill or how to reboot their router are good examples. Identified information is account specific and requires some form of identifying, usually the account phone number, but falls short of requiring a “secret”, pre-established means of authenticating. This information can often allow you to do something as simple as telling a customer what speed tier they have or as complex as allowing them to pay your bill.  Authenticated information is the most secure, but also the most difficult for customers to access. This should be used for things like making account changes or before releasing any PII information. As a general rule in self-help, you should expect to lose about 25-33% of your users for each step you require, meaning you’ll only get 66-75% of your users when you force them to identify and 50%-66% when you force them to authenticate.
  4. What does your legal and/or security team think?  While you should always engage these teams when available, it’s important to remember their job isn’t customer support or sales.  I’ve learned that their role is to rate your risk on a scale of 1 to 10, where 1 means “Too risky, you probably shouldn’t do it” and 10 means “Too risky, you DEFINITELY shouldn’t do it.”  Don’t use them as a scapegoat or an excuse. Work with them diligently to understand their concerns and find legitimate ways to meet both your needs.
It’s here to stay, so you might as well get good at it! As technology continues to expand and enhance contact centers, there is no doubt that security concerns will continue to grow at an equal (or faster!) pace.  The better you get at managing these concerns, working with the correct internal teams, and continuing to roll out effective enhancements, the better off your company (and you as a sought-after resource) will be.  

Stop Touting CSAT Scores as Real Customer Satisfaction

I think we need to change how we are calculating our customer satisfaction scores in the call center, because if our scores were really 90%, we wouldn’t be getting such poor reviews in third party reports. I put my head in my hands as the executive in the front of the room continued his presentation to leadership. Don’t get me wrong, voice calls still get the highest scores of all our contact channels, but I think we need to review how we calculate the scores across them all.

Continue reading

© SpeakEasyAI. All rights reserved.
Powered by Brand Revive.
What do you think of my pop up?