Remote Exploit of Vista Speech Control 372
An anonymous reader writes "George Ou writes in his blog that he found a remote exploit for the new and shiny Vista Speech Control. Specifically, websites playing soundfiles can trigger arbitrary commands. Ou reports that Microsoft confirmed the bug and suggested as workarounds that either 'A user can turn off their computer speakers and/or microphone'; or, 'If a user does run an audio file that attempts to execute commands on their system, they should close the Windows Media Player, turn off speech recognition, and restart their computer.' Well, who didn't see that coming?"
or (Score:3, Informative)
Hey, no need to panic... (Score:4, Informative)
"Microsoft has said that even if the machine was primed to accept voice commands it would be unlikely the user would not be in the room to hear the file with malicious instructions being played."
Yeah, nobody ever leaves their computer unattended.
And of course, it would be completely impossible for a Trojan to pipe appropriate sounds directly to the input buffer of the sound hardware, thus negating the need for it to be played through your speakers at all. As we all know, Windows is completely watertight against that sort of thing.
This raises an interesting possibility, though - what if you could confuse the recogniser itself into making false positives? You could, for example, persuade it to recognise silence as a command of your choosing.
Best way round this is probably to prevent people doing potentially destructive operations via voice commands. But if this isn't suitable, you could employ clever confirmation strategies, like "If you're sure you want to delete c:\windows, please say the following words..." with the words in question being drawn from a dictionary. No malware could anticipate the sequence (although I suppose you could set the recogniser to work against itself, by playing the text-to-speech engine's own output back to it and triggering recognition).
Hmm. Promises to be quite fun, this.
howto for Mac users (Score:5, Informative)
$ echo "format sea slash you" | say -o evil.aiff
This makes your messages with a nice, clear, even voice--wouldn't want a bunch of 'um's and 'ah's borking up your exploit, now would you.
`man say` for more options.
Re:The Real Agenda of this Article? (Score:5, Informative)
All voice recognition software, no matter what platform, would suffer from this supposed "exploit". So why this article on Vista specifically?
This is untrue. Speech recognition software can be made to filter out anything coming in the mic that matches something going out the speaker channel. More simply, you can simply require all commands be preceded with an arbitrary word (like the computer's name). Call you computer "George" and then issue the command "George, kill dash nine star dot star." As opposed to "kill dash nine star dot star." Since the exploit writer won't know to include "George" their exploit fails almost all the time. This was a feature of MacOS 7, more than a decade ago, as I mentioned elsewhere.
Also, if the voice recognition software is trained for a specific user's voice, the chances of an exploit are reduced.
Depending upon the tolerance, this is entirely possible, but I don't see it as being as important or versatile as the other two methods I listed above. MS should have learned from the example of others.
Re:That's hardly an exploit (Score:5, Informative)
MS Security Response Blog: Adrian responds (Score:5, Informative)
Issue regarding Windows Vista Speech Recognition
Hey everyone this is Adrian and I am writing to try and clear up some concerns regarding a recently reported vulnerability in the Speech Recognition feature of Windows Vista. An issue has been identified publicly where an attacker could use the speech recognition capability of Windows Vista to cause the system to take undesired actions. While it is technically possible, there are some things that should be considered when trying to determine what the threat of exposure is to your Windows Vista system.
He goes on to list reasons why this is not a major issue. The first being that voice commands have to be turned on and configured for this to work.
He ends with
While we are taking the reports seriously and investigating them accordingly I am confident in saying that there is little if any need to worry about the effects of this issue on your new Windows Vista installation.
I think he's right. If this was a serious problem, the MacOS and OS/2 "exploits" mentioned above would've received a lot more press. Still, I expect in a future version, the voice software will be smart enough to ignore the computer's own output.
Personally, I don't like voice commands. They are necessary for users with certain impairments and useful for certain applications such as kiosks, but they are counterproductive in a shared-office environment and just plain weird on my desktop. Even on Star Trek - The Next Generation much of the computer input was via control consoles not voice.
Re:The Real Agenda of this Article? (Score:2, Informative)
Maybe they should ask the user for a keyword without offering a default? But how many people would use "computer" anyway?
Filtering? (Score:3, Informative)
Ever used a program such as skype or other voice-chat software? Notice when you have speakers and microphone on, you generally don't hear your voice constantly repeating into echoes (if echo-cancel is on, of course). Notice that you don't with the speakerphone on your cell either? That's because the software/hardware is smart enough to take the audio output and subtract/prevent it from entering the audio input (avoiding feedback loops etc). If used properly with voice-recognition software, it would defeat programs on a webpage from sending output to be re-picked up from your input system. Since MS assumedly has control over the audio subsystem of the operating system, it should be able to snag the master combined output and filter it in this way.
Now that doesn't preclude some annoying twit from walking by and telling your computer to do things it shouldn't. However, that issue could be prevented by engraining an element of "speaker recognition" (the person speaking, not the ones on your computer) to the machine. Further, it could require a user-defined prefix or suffix to the command, such as "Computer, earl grey tea, hot!" or "Open the doors, Hal!"
Re:The Real Agenda of this Article? (Score:3, Informative)
I've seen EXPENSIVE noise canceling speakerphones screw this up.
Re:The Real Agenda of this Article? (Score:3, Informative)
Really? How do I get a shell prompt on a Mac with root access without typing my password?
I notice he hasn't responded to this. I'm thinking it's because, well, there isn't an easy way to do it. In fact, I can't think of a _hard_ way to do it. Maybe an SUID script to open it as root, but then you have the display thing to deal with. Hm... more likely he was just talking out his arse.
Re:That's hardly an exploit (Score:4, Informative)
mkdir
bind
run_noisy_application
Re:That's hardly an exploit (Score:3, Informative)
Couldn't the system simply have a filter that removes the wave signature of what it is outputting before processing input as a command? This is relatively simple technology, as compared to voice recognition itself. You might have to re-calibrate if you move your speakers but I would think that is a small price to pay to not leave open the ability for a web site to control your system through an auto-playing wave file.
The quick answer is "no." Even though the computer knows what waveform it is playing, it has no idea what waveform will actually emerge from the speakers, or arrive at the microphone.
The problem is that the audio system taken as a whole (Sound card DAC -> speaker wire -> speaker driver -> air in the room -> microphone pickup -> microphone wire -> sound card ADC) introduces small but significant spectral distortion into the sound by the time it runs through the entire system. Even if we ignore the nonlinearities of the amplifiers, the finite resolution of the digital-to-analog converters, and everything else, we still run into the problem of objects MOVING in the room (like you, leaning 2 inches forward in your chair), which changes the impulse response of the system and therefore changes the spectrum of the received signal.
Even if we consider only two elements, the speaker cone and the air in the room, it is fairly easy to see that the sound wave generated is NOT equivalent to the wave being sent to the speaker cone. Imagine a step signal (e.g. a Heaviside function) where the speaker deflection instantaneously goes from 0 to 1, then stays there. What does the AIR PRESSURE right next to the speaker cone do? Does it instantaneously jump from 0 to x and then stay there? No, of course not -- a WAVE propagates from the speaker into the air of the room. So the signal applied to the speaker and the signal in the room are not the same signal.
Now in theory, if all of these effects are linear, then the total impulse response can be computed. This is the "calibration" you mention. The problem, though, is that the system is not TIME INVARIANT, meaning its impulse response changes with time simply because of all the variables which affect the system.
So it's not only a matter of "recalibrating when you move your speakers." You have to recalibrate when the speakers move, when the temperature changes, when the air pressure changes, when the microphone moves, when the microphone has dust on it interfering with pickup, when anything at all in the room moves, when there is a draft in the room, etc etc.
This would not be simple technology at all. Not impossible, but probably extremely expensive and unreliable.