- 17th May 2018
- Posted by: Manolis
- Category: Blockchain
In the past few years, automatic speech recognition (ASR) has become common practice, with billions of voice-enabled products and services. A wide variety of ASR technologies exists, each suited for different use cases. Undeniably though, the holy grail of ASR is natural language processing (NLP), which lets users speak freely, as if they were talking to another person. A simple example is that you can say “Set a reminder for 9AM the day after tomorrow” to any of the leading virtual assistants like Alexa, Google Assistant, Siri or Cortana, and they would understand the intent. There is no specific order or magic word that you have to say. You could also say “remind me on Wednesday at 9 in the morning” or “set a reminder on May 16th at 9 AM” and get the same result. The bottom line in NLP is extracting the meaning, regardless of the phrasing.
Are NLP-enabled chatbots better served in edge devices or in the cloud?
The recent advances in NLP have been achieved thanks to artificial intelligence (AI), and more specifically, deep learning. At Google I/O 2018, we caught a glimpse of how far this technology has come with the unveiling of Google Duplex. Google Duplex is a feature that enables Google Assistant to place calls on behalf of the user to schedule appointments like haircuts and restaurant reservations. In the demos, you’ll hear them doing it as naturally as a human caller would.
The technological challenge here is understanding the nuances of speech and adapting to unexpected situations. The deep neural networks used to achieve these feats use extremely complex calculations requiring massive processing and power consumption, available only in remote cloud servers.
On the other hand, many portable devices, like cameras and Bluetooth speakers, only enable certain predefined voice commands, like “on,” “off,” “record,” “play,” and “stop.” The main reason for this disparity between practically unlimited conversations with virtual assistants and very restricted voice commands for non-connected portable devices is because the processing is done on the edge device vs. the cloud. The appeal of edge processing is huge because the cloud is not available in all situations, and either unnecessary or undesirable in many others.
One example of such voice commands is found on most Android smartphones. You can snap a picture by saying “Cheese” or “Smile” when the camera app is open. If you had to depend on the availability of the cloud and wait for the command to be processed remotely, you’d probably miss many priceless moments. Therefore, edge processing is a must in this case.
Different vendors can modify or add on to the basic commands. For example, on LG phones, you can also say “Whisky” or “Kimchi” to take a photo. There is no NLP involved in these voice commands. The ASR engine identifies any of the specified words and triggers the shutter. So, this feature is only useful if the user knows the commands. If you are used to saying Kimchi to take pictures and you switch phones it might not work. Any alternative way of telling the camera to shoot won’t work unless it’s specified. That’s a big compromise on user experience and may lead users to abandon the technology for lack of flexibility and ease-of-use.
Another example is a cool feature in GoPro’s latest Hero action cameras that lets you tag special moments while filming. Later, you can go directly to the tags, making it easy to share and edit the best parts of your videos. The voice command for this is “GoPro HiLight.” But let’s say you’re snowboarding down a slope at Mammoth Mountain, and you see an awesome sight but don’t remember to say the command. The GoPro team thought of this, so they added the option to trigger the HiLight tag by saying “that was sick.” While that is a cool way to talk to your camera, it’s still not NLP. You need to know the command to use it. This type of interface forces the user to study the system’s spec, instead of the system adapting to the user’s manner of speaking.
Hmmm, what was that command?
According to the Google engineers behind the Duplex, they achieved their impressive results by limiting the chatbot’s context to a specific task. They stated in a blog post that a key insight in their research was that Duplex could work best in closed, narrow domains. In other words, the Duplex chatbots can only function for a specific task, and can’t conduct general conversations.
Similarly, Sensory, a company specializing in edge AI, created a barista chatbot that uses NLP to take coffee and tea orders. The big achievement here is that all the processing is done on the edge device, so no cloud connection is needed. The barista chatbot can be seen in this video.
Realistically, it isn’t feasible for an embedded processor powered by a small, lightweight battery to perform the same speech analysis as a cloud service. However, by confining the context and reducing the complexity of the interaction, NLP can be made slim enough to run on the edge. The scope that an edge chatbot could have is determined by the efficiency of the software and the engine that runs it.
From a user experience perspective, the important thing is that the chatbot covers the entire domain of its task. If we go back to the camera example, in addition to snapping photos and recording video, the user might want to playback a video, look at photos, show a slideshow, delete files and so on. Handling all these functions with NLP would generate a seamless and natural interface, even if questions about the weather or restaurant recommendations aren’t covered.
While fully featured on-device NLP remains an unsolved challenge, we can still expect to see a significant improvement in user experience in the near future. Advances in specialized AI architectures for edge devices and new techniques for reducing memory utilization of deep neural networks are showing exciting results. We will surely see multi-faceted NLP capabilities sans-cloud very soon.