Creating speech skills for Amazon Alexa and Google Assistant

Sep 4, 2018
10 min read

Updated: May 25, 2020

A dramatic transition to conversational interfaces has occurred in the last decade. As people hit 'peak screen' and even begin to scale back their use of the app with digital wellness apps embedded in most operating systems.

In order to combat screen fatigue, voice assistants have entered the market to become a preferred option for rapid information retrieval. A well-repeated status states that in 2020, 50 per cent of searches will be carried out by voice. It's also up to developers to add "Conversational Interfaces" and "Voice Assistants" to their tool belt, as adoption grows.

What Is An Interface to Conversation?

A Conversational Interface (sometimes shortened to CUI, is any interface in a human language. It is tipped to be a more intuitive interface for the general public than the Visual User Interface GUI, which front end developers are used to creating. A GUI allows people to know about the basic interface syntaxes (think buttons, sliders, and drop-downs).

This main difference in the use of human language makes CUI more intuitive to humans; it needs little knowledge and places the burden of interpretation on the instrument.

There are two forms of popular CUIs: chatbots and voice assistants. In the last decade, both saw a huge increase in take-up due to developments in the NLP.

JARGON UNDERSTANDING Speech

What Is An Assistant to the Voice?

A voice-assistant is a piece of NLP (Natural Language Processing) capable software. It receives a voice command, and returns an audio response. The complexity of how you can communicate with an assistant has grown and changed in recent years, but the crux of the system is natural language in, tons of computation, natural language out.

To those seeking a little more detail:

1. The program receives an audio request from a user, converts the sound into phonemes, which are the language building blocks.

2. By the magic of AI (Speech-To - Text in particular), these phonemes are translated into a string of the approximate request, which is held within a JSON file which also contains additional user, request and session information.

3. The JSON is then processed to find out the meaning and purpose of the request (usually in the cloud).

4. A answer is returned on the basis of intent, again in a larger JSON document, either as a string or as SSML (more on that later)

5. The answer is processed back using AI (naturally the reverse-Text-To-Speech) that is returned to the user afterwards.

There's a lot going on there, much of it needing no second thought. But-platform does this differently, and it is the platform nuances which require a little more understanding.

Devices with Voice Enabled

The specifications for a computer to have a baked in voice assistant are relatively small. They need a microphone, a connection to the internet, and a speaker. Smart Speakers like the Nest Mini & Echo Dot offer such low-fi voice control.

Next up in the ranks is voice + screen, this is known as a multimodal system (more on these later), which are devices such as the Nest Hub and the Echo Display. Because smartphones have this feature, they can also be considered a form of device allowed for multimodal voice.

Voice Competencies

First of all, for their 'Speech Skills,' each platform has a different name, Amazon goes with skills, for which I will stick as a widely understood word. Google opts for 'Actions' and for 'capsules' Samsung opts.

Growing platform has its own baked-in skills, including asking for the games of time, weather and sport. Developer-made (third-party) skills may be invoked with a specific phrase, or can be invoked indirectly without a main phrase if the user likes it.

INVOCATION EXPLICIT: "Hey Google, Speak to < app name >."

It is expressly specified which skills are being requested:

What Are Their Voice Assistants?

Audio assistants are very much a tri-horse challenge on the western market. Apple, Google and Amazon have very different approaches to their assistants and, as such, cater to developers and consumers of all kinds.

APPLE’S SIRI

DEVICE NAME: ”Siri”

WAKE PHRASE: ”Hey Siri”

Siri has more than 375 million active users, but I don't go into too much detail for Siri for the sake of brevity. Although it may be well accepted worldwide, and built into most Apple devices, it allows developers to have an app already on one of Apple 's platforms and is written in swift (whereas the others can be written in the favorite of all: Javascript). If you are an app developer who wants to increase the scope of their app, you can actually skip past apple until it opens up its website.

GOOGLE ASSISTANT

DEVICE NAMES: ”Google Home, Nest”

WAKE PHRASE: ”Hey Google”

Google has the most apps of the big three, with more than 1 trillion globally, this is largely due to the mass of Android devices that have baked Google Assistant in, in terms of their dedicated smart speakers, the figures are a little lower. Google's ultimate goal is to please consumers with its help, and they've always been really good at delivering light and intuitive interfaces.

Their key goal on the site is to use time — with the intention of being a frequent part of the everyday life of the customers. As such, they concentrate mainly on usefulness, fun in the family and pleasant experiences.

Google-built skills are best when they're pieces of interaction and games, with a emphasis on family-friendly fun. Their recent inclusion of game canvas is a testament to that approach. Google's site is far more rigorous for ability submissions and as such, their list is much smaller.

AMAZON ALEXA

DEVICE NAMES: “Amazon Fire, Amazon Echo”

WAKE PHRASE: “Alexa”

In 2019, Amazon has reached 100 million products , mainly from sales of its smart speakers and smart screens, as well as its range of fire or tablets and streaming devices.

Skills built for Amazon appear to be geared towards purchasing skills. Amazon is for you if you're looking for a forum to extend your e-commerce / service or offer a subscription. That being said, ISP is not an Alexa Skills prerequisite, they endorse all kinds of uses, and are far more open to submissions.

Installation On Amazon Alexa

Amazons voice ecosystem has evolved to allow developers to build all of their skills inside the Alexa console, so I will use its built-in features as a simple example.

Alexa deals with the processing of the natural language and then determines a suitable Purpose, which is passed on to our Lambda function to deal with the reasoning. This returns some conversational bits (SSML, text, cards, etc.) to Alexa which converts those bits to audio and visuals for display on the app.

It's pretty easy to work on Amazon because it helps you to build all parts of your ability inside the Alexa Developer Console. There is versatility to use AWS or an HTTPS endpoint but it should be enough to run anything inside the Dev console for basic skills.

LET's Create A SIMPLE SKILL ALEXA

Go to the console of Amazon Alexa, create an account if you don't have one, and log in.

Click Build Skill and then name it,

Choose your model customised,

And select Alexa-Hosted (Node.js) for the tool on your backend.

If provisioning is completed, you're going to have a simple Alexa experience, you're going to have your plan developed for you, and some back end code to start you up.

If you click in your Intents on the HelloWorldIntent, you'll see some sample utterances already set for you, let's add a new one at the top. Our ability is called hello world, so add a sample utterance to Hello World. The aim is to catch whatever the user may say to cause the purpose. This could be "Hello World," "Howdy World" etc.

What Happens In The JS Fulfillment?

This uses the ask-sdk-core, and basically builds JSON for us. CanHandle wants to know if it can handle attempts, namely 'HelloWorldIntent.' Handle takes the input and constructs the response. And it looks like this produces

Construction For Google Assistant

Use their AoG console in conjunction with Dialogflow is the easiest way to create Actions on Google, you can expand your skills with firebase, but let's keep it simple, as with the Amazon Alexa tutorial.

Google Assistant uses three primary parts: AoG, which deals with the NLP, Dialogflow, which works out the plans, and Firebase, which fulfills the request, and generates the response which will be returned to AoG.

As with Alexa, Dialogflow enables you to create your functions directly within the platform.

LET 'S ACTION ON GOOGLE BUILD

With Google's solution, there are three systems to juggle at once, which are reached by three separate devices, so tap up!

Set up Dialogflow

1. Let's start by logging into the Console for Dialogflow. After you've logged in, build a new agent just below the Dialogflow logo from the dropdown.

2. Give your agent a name and add to the 'Dropdown of the Google Project' when selecting "Build a new Google project."

3. Click the Create button, and let it do its magic, setting up the agent will take a little bit of time, so be patient.

Firebase Functions Setup

1. We can start plugging in the Fulfillment logic right now.

2. Head over to tab Fulfilment. Tick to allow the inline editor, and use the following JS snippets:

3. Itex.js

package.json

Now go back to your intent, go to Default Welcome Intent, and scroll down to fulfillment, make sure that 'Enable Webhook Call for this Intent' is tested for any javascript attempts you wish to fulfill. Touch Save.

Update AoG

Now, we are reaching the finish line. Head over to the Integrations tab and click at the top of the Google Assistant Option on Integration Settings. This will open a modal, so let's click check, which will integrate your Dialogflow with Google, and open a check window on Google Behavior.

We can click Talk to my test app on the test window (we'll change this in a second), and voila, we've got the message from our javascript displayed on a google assistant test.

In Develop tab, we can change the name of the assistant, up at the top.

And WHAT 'S TO THE Delivery JS?

Firstly, we use two npm packages, actions-on-google that provide all the fulfillment that both AoG and Dialogflow need, and secondly, firebase-functions that you guessed contain firebase helpers.

Then we build the 'template' which is an entity containing all our intent.

Every purpose that is generated is passed 'conv' which is sends the Behavior On Google conversation object. We can use the conv content to detect information about past user interactions (such as their ID and their session with us).

We return a 'conv.ask object' that contains our return message to the user ready to reply with a different purpose. If we decided to end the conversation there we could use 'conv.close' to end the conversation.

Finally, we wrap all up in an HTTPS firebase feature, which deals with the request-response logic for us on the server side.

Once again, if we look at the produced response:

We can see that conv.ask had its text inserted into the field of textToSpeech. If we had chosen conv.close the expectUserResponse would be set to false and the conversation would end after the message was delivered.

Third-party Voice Makers

Much like the software industry, 3rd party platforms have started popping up as voice gains momentum in an effort to ease the burden on developers, enabling them to build twice once they launch.

At the moment Jovo and Voiceflow are the two most popular, especially since Apple acquired PullString. Each platform offers a different abstraction level, so it really depends on how simplified your interface is.

Extend your know-how

Now that you've got your head around developing a simple 'Hello World' skill, there's plenty of bells and whistles to add to your skill. They are the cherry on top of Voice Assistants' cake, which can add a lot of added value to the customers, contributing to repeat practice which future business opportunity.

SSML

SSML stands for speech synthesis markup language and operates with a syntax similar to HTML, the main difference is that you are building up a spoken response, not text on a web page.

'SSML' as a concept is a little deceptive, can do so much more than a synthesis of expressions! You can have parallel voices, you can have sounds of the setting, speechcons (worthy of listening to them in their own right, think emojis for popular phrases), and music.

When Should I Use SSML?

SSML is great; it makes the user's experience much more enjoyable, but what it also does is that the audio output's versatility. I suggest to use it for more static speech areas. You can use variables in it for names etc, but unless you are planning to create an SSML generator, most SSML will be very static.

Begin with simple speech in your language, and once complete, upgrade areas that are more static with SSML, but get your core correct before moving on to the bells and whistles. That said, a recent report says that 71 per cent of users prefer a human (real) voice to a synthesized one, so if you have the facility to do that, go out and do it!

IN Sales ON SKILL

In-skill shopping (or ISP) is similar to the in-app purchasing model. Skills appear to be free, but some make it possible to buy 'premium' content/subscriptions inside the app, which can improve a user's experience, unlock new levels of gameplay, or enable access to paid content.

MULTIMODAL

Multimodal responses cover so much more than speech, this is where voice assistants can really shine on devices this provide them with complementary visuals. Multimodal interaction concept is much wider, which basically implies multiple inputs.

Multimodal skills are meant to complement the core voice experience , providing additional additional information to boost the UX. Recall the voice is the primary carrier of knowledge when creating a multimodal experience. Many apps don't have a screen, so your skills still have to operate without one, so be sure to check with different types of devices; either actual or simulated.

MULTILINGUAL

Multilingual abilities are abilities that function in multiple languages and expose your abilities to different markets.

The complexity of multilingualizing your skills lies in how dynamic your responses are. Skills with fairly static answers, e.g. returning the same phrase each time, or using just a small bucket of phrases, are much easier to render multilingual than dynamic skills distributed.

Multilingual trick is to have a reliable translation partner, whether by an agency or a Fiverr translator. You must be able to trust the translations provided, especially if you do not understand the language to which they are being translated.

Summary

If there ever was a chance to get into the voice business, it would be right now. Also in its prime and infancy, as well as the big nine, are plowing billions to expand it and get voice assistants into the homes and daily routines of all.

Choosing the platform to use can be difficult, but the platform to use will shine through or, failing that, use a third-party tool to hedge your bets and build on multiple platforms , particularly if your capacity is less complicated with less moving parts.

As a reputed Software Solutions Developer we have expertise in providing dedicated remote and outsourced technical resources for software services at very nominal cost. Besides experts in full stacks We also build web solutions, mobile apps and work on system integration, performance enhancement, cloud migrations and big data analytics. Don’t hesitate to get in touch with us!

This article is contributed by Ujjainee. She is currently pursuing a Bachelor of Technology in Computer Science . She likes most about Computer Engineering is that it allows her to learn and be creative. Also, She believes in sharing knowledge hence is very fond of writing technical contents for the same. Coding, analyzing and blogging are things which She can keep on doing!!

dedicated exceptional developers