Link to home
Avatar of GMartin
GMartinFlag for United States of America

asked on

how does Amazon Echo (Alexa) know what is said to it?

Hello and Good Afternoon Everyone,

            I am wondering how Amazon Echo (Alexa) understands human speech and able to respond with answers.  


Avatar of William Fulks
William Fulks
Flag of United States of America image

Blurred text
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
William is correct; an explanation and an update on the types of models are at:
1. Basic Explanation -
2. The Latest -

Since the "horsepower" of computing increases as the costs decrease (i.e. Moore's Law), more difficult / resource intensive tasks like voice commands / robotics and the like get processed much quicker and some are now instantaneous ... when they used to be either impossible, very slow or only possible on very expensive computers.
It works the same way your smartphone can responds to voice commands or even identify music. If you think of the physical representation of sounds, it is done in waves with a complex set of measurements that differentiate tones and such, so that the number 4 and the letter X will look different when shown as a wavelength. With this info you can create a database saying "word" matches whatever criteria and then you're just making a search.

This is why people with strong accents will sometimes have issues with voice commands because it will heart certain words as they are actually spoken and not take the accent into account. Some programs, like Dragon's Naturally Speaking, will learn from you the more you use it, effectively building a custom database based on your own voice and manner of speech. I know some disabled folks who use this for writing, etc.
They also have a larger database of phrases now to help with more natural speech patterns.  Early Dragon, Naturally Speaking (back in the late 90s), had you... talk... one... word... at... a... time... and... required... a... break... between... each... word....
If you didn't leave pauses, it would mess up.  These days, our computers hold much more data and can process them faster as well.
Avatar of GMartin


Hello and Good Afternoon Everyone,

         Thank you so very much for the enlightening feedback given in reply to my question.  I have to admit that I thoroughly enjoyed reading each person's shared thoughts and certainly did learn a great deal from this participation.

I hate to do this, but sorry William,  that answer is not correct for an Echo device.

An Amzon Echo only listens for its name, which it can usually recognize by simple pattern matching. Until it hears its key word, it throws away all other sounds.
When it hears its name, it records the voice that comes after that until a reasonable pause is heard, and streams that voice clip up to the Amazon servers.
Then the Amazon servers use very fast voice recognition and translation software and does a voice to "word" conversion, creating a string of words it heard. But that is not done on the Echo.
The string of words are then sent to a parser (in Amazon's cloud) to determine a best match to what you asked for.
Then, the Amazon servers send back a series of instructions and voice response info to do what you asked.

The actual voice to word conversion does not take place inside the Echo. This allows the processor to be lower speed and power, and allows the full power of high end servers to do the conversion.

Siri and Google do the same things. The conversion of voice to words is typically not done on the phones. That is why an internet connections is required to use those services. It is also why Amazon Alexa will tell you it cannot understand you when the internet is down.
Owen, his question was how it understands human speech not where the processing takes place. You are correct that it is a voice-acted interface for a cloud-based application. Same for phones and the like. My answer isn't incorrect, though.
Ok, I guess we read that differently. On a basic level, I agree, your answer explains speech to word conversion. But he asked how an Amazon works, and to be clear, it does not do the conversion.
To the layperson, it doesn't matter.  The modern computer eventually will evolve and we'll call cloud based systems a computer system.
On the other hand, I have been nitpicked to death at times here when not putting in full details and trying to give a simple answer. Sorry, but if you are going to answer a question here, why not be as accurate as possible?  There is a big difference NOW between a device doing all the work, and the device sending the work to the cloud. In this particular case, the answer does not explain why it stops working and understanding when the network is lost. Since points were already awarded I was simply trying to add more accurate details. Sorry if some of you are offended at trying to be more accurate with an answer.
I know this question has been marked as SOLVED (so I'm not adding further comment just to try and gain points or anything), but:

@serialband - "The modern computer eventually will evolve and we'll call cloud based systems a computer system."

That's not the modern computer evolving really, and I thought we already DID call 'cloud' systems a computer system.... because that's what they are!
Or have I mis-interpreted the post?
@IT-Expert - I think what he's saying is, in the past, we have referred to the computer as the actual hardware, etcetera which is physically in our house. It did all computing without having to "go for assistance outside the room." Then we started offloading specific processes to math coprocessors, video cards, etcetera but they all still resided at the same address, on the same piece of motherboard with no outside assistance. When one says "computer system" today, most non-technical people still only think of the physical box at the physical address as the "system."

In the future, with Alexa being a good example, the systems at my physical address will be "smart enough" to get started and then offload the remaining computing process(es) to a more powerful system at another physical address via "the cloud" (which is nothing more than a consumerized name for remarketing the Internet) then bring the answers back to my physical address. As this becomes "more publicly understood", saying "computer system" will simply mean "the stuff that gets me answers / results regardless of location."

Sorry for the long answer but that's what I heard when he posted that statement ... an evolution of understanding for the non-technical people using the "plug and play devices" like Alexa; plug it in, put it on Wifi and get me my stuff ...