Remote Exploit of Vista Speech Control 372
An anonymous reader writes "George Ou writes in his blog that he found a remote exploit for the new and shiny Vista Speech Control. Specifically, websites playing soundfiles can trigger arbitrary commands. Ou reports that Microsoft confirmed the bug and suggested as workarounds that either 'A user can turn off their computer speakers and/or microphone'; or, 'If a user does run an audio file that attempts to execute commands on their system, they should close the Windows Media Player, turn off speech recognition, and restart their computer.' Well, who didn't see that coming?"
That's hardly an exploit (Score:5, Insightful)
Taking a computer that obeys audio instructions, and playing it some audio instructions, is more of a 'duh' than an 'exploit'. But this problem is a very Good Thing. It can only mean:
-- EITHER people stop yakking on about voice computing, which has been the Way Of The Future since about 1935 or something
-- OR pressure is exerted on web designers to NOT make sites that start making noise the moment the page appears!
Either of these, but especially the latter, would be a big win. So here's to you, Mr. Exploit Finding Man!
amusing, but not much else (Score:3, Insightful)
Admitedly all I can think of is the Dilbert cartoon with Wally getting ticked at Dilbert having voice driven software.
Bug? (Score:4, Insightful)
The Real Agenda of this Article? (Score:4, Insightful)
Voice controlled video player. Echo cancellation? (Score:1, Insightful)
Microsoft's comments on the BBC site are poor. What microphone feedback? If it's not howling now it's not going to suddenly howl when someone tries this exploit. Clear dictation - but the attacker will make the dictation as clear as possible, and the consolation that the user will likely be in the room to hear it happening - what consolation is that?
A solution would be to use echo cancellation as used in phone systems to prevent output from the speaker being used on the microphone.
- Richard
Re:That's hardly an exploit (Score:2, Insightful)
Re:That's hardly an exploit (Score:5, Insightful)
There has never been any sound from a webpage that didn't make me want to immediately beat the person who wrote it with his own leg. I don't want to listen to your stupid MIDI file of whatever the fsck you think is cool on your web page.
There was never any good reason to embed sounds in web pages unless you have to click a button to specifically play it.
Cheers
Re:In One Ear and Out the Other (Score:2, Insightful)
I guess you never saw a room with more than one computer in it.
Re:The Real Agenda of this Article? (Score:3, Insightful)
"All voice recognition software, no matter what platform, would suffer from this supposed "exploit". So why this article on Vista specifically? What is the real agenda here? Also, if the voice recognition software is trained for a specific user's voice, the chances of an exploit are reduced."
Yup, this is an old one. There's an apocryphal tale of a user group meeting from long ago of a vendor demonstrating voice-control software and a smart aleck in the back of the room yelling "DEL *.*!" (or whatever the MS-DOS command was).
As you implied, the agenda is, of course, to have a laugh at Microsoft's expense. If they hadn't included voice control software, the opportunity would have been to point out that Microsoft spent $BIGNUM person-years working on Vista and didn't even include that feature. OSX's easy access to a shell prompt with root access is about as relevant an exploit as the voice control exploit, and the odds of a cat wandering into my house and walking on the keys in such a way to generate the wrong "rm" command are about the same as this Vista "exploit" happening to me. But, it's aways fun to have a laugh at Microsoft's expense, isn't it?
Maybe a good start, but not that easy (Score:3, Insightful)
I imagine it's not quite so straightforward. You'd need to take into account room acoustics, hardware effects, generic ambient noises, or even other interfering sounds in the same room that could all interfere with a comparison of outgoing sound to incoming sound. It's very rare that you'd ever have a time where your outgoing sound file exactly matches one that is sensed coming from the speakers.
Re:A Whole Decade of Nothing (Score:5, Insightful)
The sound that is output by the computer sounds similar to us when re-received through the mic and played back, but to the computer it's a totally alien waveform. A lot of distortion happens between when the computer sends a digital signal to the sound card and when it receives an analog signal from your microphone - so basically, the computer may know what it's playing, but it has very little idea how it'll sound when it reaches the mic.
There are advanced filters and algorithms that can try to match and isolate particular patterns and "sounds" within a waveform, but they're not nearly as powerful as CSI would have us believe, and they also require far too much computing power to be run in realtime.
Of course, the obvious low-tech solution to this issue is to wear headphones, as people in recording studios have for decades.
Re:A Whole Decade of Nothing (Score:5, Insightful)
Most simple schemes people come up with to address this are perfectly doable with a free sound program. Play some music, record the area while you're playing the music, then try your great idea. Like, you might think you can start out with inverting the source file and feeding it into the recording with a delay and modified amplitude. If you're really curious about this problem, this is a better way to learn about the difficulties then reading people on the internet, as, in my experience, you're quite likely to be skeptical about the explanations anyhow. The best (and in some sense, only true) explanations involve a lot of math.
I can offer you this meta-rule, though: If it were so easy, it would already have been done. Many things that I see people posting on Slashdot about "Why don't they just do this thing?" are covered by this rule.
Re:That's hardly an exploit (Score:3, Insightful)
This is yet another case of Microsoft putting ease-of-use ahead of security and reliablity. We've all heard this song before. Same story, different Windows version.
Re:Most Important Part of the Announcement (Score:1, Insightful)
Probably a good idea, though. And while we're at it, since Microsoft recommends rebooting (again, sigh), perhaps it is wise to do so with an installation CD of [linux distro of choice] in the drive. Seriously, who wants Vista? More trouble than it's worth.
I'm feeling anal today, so ... (Score:5, Insightful)
An exploit is, by definition, a successful manipulation of a bug/omission/hole/whatever in a computer system to make it perform something that it was not designed to do. Usually this term is only applied when said action is harmful or potentially harmful.
What is being described here is the possibility of controlling the voice recognition system in Vista remotely to make it perform potentially harmful tasks. Furthermore, this functionality is not something that said system was designed to do; it was only designed to accept commands via microphone.
Therefore, what is being described here is an exploit.
Q.E.D.
Re:Restart? Really? (Score:3, Insightful)
It's not necessary to restart the PC to turn off speech recognition - just say "stop listening" or click on the always visible recognition toolbar to turn the microphone off. It's also not on by default either, and only those interested in it will find it anyway. Not really an "exploit" that's actually exploitable.
Brilliant! (Score:3, Insightful)
Re:A Whole Decade of Nothing (Score:3, Insightful)
Re:The Real Agenda of this Article? (Score:3, Insightful)
For instance, the mic may not pick up any of the low frequencies due to location of a subwoofer, quality of speakers, sound absorbers (carpet, etc.). So in order to match the output to the input, you need to allow for these factors and by the time that you give yourself enough of a margin, you've in effect taken out all functionality.
Sure, it's fun to bash MS here on slashdot. Just don't let reality get it the way.
-dave
Re:The Real Agenda of this Article? (Score:3, Insightful)
However, this should be a solvable problem with current DSP technology.
If my cellular telephone can perform realtime echo cancellation, and subtract its own speakerphone audio from the microphone audio, and do it for several hours at a time on a battery the size of a matchbook, then I can only fucking hope that a modern dual-core machine would be able to tackle the task handily.
Even after the variables are all multiplied by some factor because the speakers might move relative to the microphone, there seems to be plenty of horsepower available to throw at the problem. The fundamentals have all been solved by folks like Bell Labs, US Robotics, and Polycom a long fucking time ago, with less DSP power than my $20 optical mouse, using the widely variable POTS network as a testbed, where even the -remote- handset affects the quality of your own voice on the line.
Just because there's layers of distortion, band limiting, spurious external noises, with dynamics and delay possibly being anywhere on the map and an echo signature that changes as people move around the room, does not mean that it's not all measurable, quantifiable, and possible to reduce it to acceptable levels.
Remember, you don't have to get rid of all the feedback, and it doesn't have to be perfect. We're talking about a limiting computer's ability to hear itself, which is a far easier task than anything involving a human being. You only have to get rid of enough that the computer does not respond to its own voice. And also, remember that the resultant quality of the recorded microphone audio need not be production-grade, but only good enough for the computer to understand human-generated voice commands.