Encrypted VoIP Meets Traffic Analysis 98
Der_Yak writes "Researchers from MIT, Google, UNC Chapel Hill, and Johns Hopkins published a recent paper that presents a method for detecting spoken phrases in encrypted VoIP traffic that has been encoded using variable bitrate codecs. They claim an average accuracy of 50% and as high as 90% for specific phrases."
TFA != Wiretap (Score:2)
No it does not work like that (Wire tapping encrypted video calls).
It does not tap the signal, but increases your odds when guessing whether something was communicated in a specific manner.
Re: (Score:1)
Re:Bleh (Score:5, Informative)
I'm pretty sure that identifying a specific word with 50% accuracy is better than random chance. There are more than two words in the English language.
Re:Bleh (Score:5, Funny)
Once they discover a method to wire trap encrypted video calls, that would open a new era in porn scene.
...
I'm pretty sure that identifying a specific word with 50% accuracy is better than random chance. There are more than two words in the English language.
Maybe he's talking about the porn film.90% seem to be "oh" or "yes" (or so i am told)
Re:Bleh (Score:5, Funny)
A low German voice - "ooohhh yaaaaa", over and over. then you have the high pitched Japanese squeak sound - "ii, ii, ii, kimochi". Which really gets annoying these days. It took a few years; but it IS annoying.
Re: (Score:2)
Re:Bleh (Score:5, Funny)
Re: (Score:2)
People only use two phrases when they talk?
Re: (Score:2)
Especially when being wiretapped.
Re: (Score:1)
People only use two phrases when they talk?
The phrases that it detects are "Badda-bing" and "Badda-boom."
Re: (Score:2)
Re: (Score:3)
Re: (Score:3)
I think if half the time you can identify a phrase in a supposedly encrypted stream ... that's better than 'chance'.
Re: (Score:1)
Theyare looking for specific words and phrases...
Bomb, president, freedom, take back control, uprising, constitutional....
You know, only words that the evil terrorists would use.
Re: (Score:2)
Oops ... wait a minute ...
Re:Bleh (Score:5, Funny)
Come on, 50% is better than most unencrypted voice recognition!
Re: (Score:1)
Re: (Score:1)
How many words are there in the English language - many tens of thousands at least.
Many tens of thousands???
I hope English is your second language.
There are over 1 MILLION English words in common and uncommon use.
[ http://www.languagemonitor.com/no-of-words/ [languagemonitor.com] ]
Yes.... many, many, many tens of thousands.
-AI
FWIW, in response to TFA... I realize their research is on phrases. Which
very quickly reduces the set. Since many of those words would only exist
in very few spoken phrases.
Re:Bleh (Score:5, Interesting)
Re: (Score:3)
I remember following this logic... when I was three. No shit, I have a vivid memory of trying to figure out how proportions worked - I knew that a penny tossed would give a 50/50 split, but that other problem with two states - e.g., when I threw a rock, I'd either hit the matchbox car or I wouldn't - weren't. I gave up, and figured it out later, when I was five or so.
Re: (Score:2)
Well, assuming that he has no knowledge about how the thing works and has no other information, his computation of probabilities is technically correct :)
Re: (Score:1)
but they're recognizing individual words, from a set of many thousands of potential words, half the time or better.
That's really quite impressive. And you're an idiot.
From a set of many thousands of words...
and he's the idiot?
-AI
That's not good (Score:1)
Better stick to a constant bitrate then :)
Re: (Score:1)
So...obvious solution then? (Score:5, Interesting)
Use fixed-bitrate encoding for VoIP.
Re: (Score:2)
Use fixed-bitrate encoding for VoIP.
Better still, two cans and a length of string.
Re: (Score:3)
Re: (Score:3, Interesting)
Not so obvious --- now you have a much less efficient use of bandwidth to deal with.
The article describes the method used to detect phrases ...
At a high level, the success of our technique stems from exploiting the corre-lation between the most basic building blocks of speech—namely, phonemes—and the length of the packets that a VoIP codec outputs when presented with these phonemes. Intuitively, to search for a word or phrase, we first build a model by decomposing the target phrase into its most likely constituent phonemes, and then further decomposing those phonemes into the most likely packet lengths. Next, given a series of packet lengths that correspond to an encrypted VoIP conversation, we simply examine the output stream for a sub-sequence of packet lengths that match our model.
Essentially, you gather enough information about how a VBR codec could encode a speech phrase you are looking for, then predict where it was spoken by looking at the "data bursts" being sent in the media stream. We'll need to research a way to "scramble" this predictability that's more efficient than using fixed bitrates, which eats up un-needed bandwidth.
Re:So...obvious solution then? (Score:5, Informative)
OpenSSH had a similar problem, it would leak information about your login password by the timing/size of the packets:
http://www.ece.cmu.edu/~dawnsong/papers/ssh-timing.pdf
I believe their solution was to introduce random NOP packets into the stream. This approach could work here too.
Re: (Score:2)
I immediately thought of this exploit as well. Seems to me you would need a lot of NOP packets comparatively, the login info is just a few keystrokes. Plus login info is not time sensitive on the receiving end, delays in a voice stream might not be acceptable.
Re: (Score:2)
So I guess it's like how dentist understand their patients when they have their hands and tools in their mouths.
Re: (Score:2)
Re: (Score:3)
We'll need to research a way to "scramble" this predictability that's more efficient than using fixed bitrates, which eats up un-needed bandwidth.
Any fix is going "waste" some amount of bandwidth.
One solution to this attack may be to semi-randomly inject "nops" to bridge phoneme breaks. So instead of being able to identify individual phonemes by bandwidth spikes, attackers will be limited to identifying entire word clusters - like filling the "space" between the phonemes in the first three words of a sentence to make it look like one really long phoneme.
But perhaps something more exotic might work, like randomly re-ordering chunks of audio so that t
Re: (Score:1)
But perhaps something more exotic might work, like randomly re-ordering chunks of audio so that they are transmitted somewhat out of order and then re-ordered on the receiving end. That probably won't use up much extra bandwidth but would increase latency.
Might not even need to re-order the audio, just burst it so that multiple phonemes are all "packed" together for transmission so there are much fewer phoneme breaks visible via traffic analysis. You burn latency that way too, but it would be much simpler to implement than a randomizing algorithm.
Re: (Score:2)
Agreed that the problem is the packing, not the data. However, grouping multiple short packets together is still leaking information. The only difference is that instead of looking at the length of packets, you have to look at the timing between packets.
I would suggest that the right solution is to modify your code so that instead of sending out packets of varying length isochronously, you instead send out packets of the same length isochronously, and adjust the average length every... say ten seconds, a
QoS (Score:2)
Thus you increase latency, which is the single most important thing in a phonecall.
Re: (Score:2)
Using a VBR and then inserting NOP's sounds like... using a non-variable streaming CODEC.
Re: (Score:2)
We'll need to research a way to "scramble" this predictability that's more efficient than using fixed bitrates, which eats up un-needed bandwidth.
It seems like there might be some promise in improving the compression method itself using the same techniques, so that the things that currently take more bandwidth would take less and therefore become less distinguishable, but if the compression is already near-optimal then this won't work without an efficiency loss because the change would correspondingly make the things that currently take less bandwidth take more, and those things might be more common.
The only general solution is some kind of padding s
Re: (Score:2)
Re: (Score:2)
At a high level, the success of our technique stems from exploiting the corre-lation between the most basic building blocks of speech—namely, phonemes—and the length of the packets that a VoIP codec outputs when presented with these phonemes. Intuitively, to search for a word or phrase, we first build a model by decomposing the target phrase into its most likely constituent phonemes, and then further decomposing those phonemes into the most likely packet lengths. Next, given a series of packet lengths that correspond to an encrypted VoIP conversation, we simply examine the output stream for a sub-sequence of packet lengths that match our model.
Awesome.
It's like listening to the "Mwa mwaa mwaa mwa mwa" voice that adults use in the old Peanuts television specials, and figuring out what they are saying based on the length of the "mwas" and their order in the conversation.
Re: (Score:2)
I wonder what my kids would compare it to
Re: (Score:3)
Not so obvious --- now you have a much less efficient use of bandwidth to deal with.
Enough to matter? According to my cell phone bill, I had over 100MB of data traffic last month. That's about 10 hours of 24 kbps CBR encoded voice, which is the highest possible CBR setting speex has. If it's on my DSL/cable/whatever line, who cares? Even if I did that 24x7 for a month it'd be 7-8 GB and I'm pretty sure even a teenage girl with mouth diarrhea has to sleep sometimes. If that's what it takes, I don't see CBR as being a dealbreaker.
Re: (Score:1)
Now take hundreds of thousands of calls like yours running through your service provider's network, being transferred to other providers networks, etc. Or, hundreds/thousands of calls running w/in a large enterprise such as from branch offices to HQ. Bandwidth costs money. In situations like these, you try to conserve bandwidth any way you can.
Re: (Score:2)
Not enough to matter.
VBR *does* save bandwith for equivalent quality, but not a lot of it.
Your 100MB gives you 10 hours of 24kbps of CBR encoded voice, and at a guess, VBR would maybe give you 13-15 hours of voice in the same bandwith.
Certainly trivial, and certainly the answer to this problem is that encrypted voice, should be encoded CBR to make traffic-analysis impossible.
Re:So...obvious solution then? (Score:5, Interesting)
Actually most people are using G.711 these days which is in fact a fixed bitrate (it's the same protocol used on your normal "hard" voice line).
But most VoIP providers do not offer SRTP or any encryption whatsoever so this whole thing is not even a question. More than likely anyone can listen in on your VoIP calls. We need to put more pressure on VoIP providers to offer encryption.
Re: (Score:1)
Stalin's Dream II (Score:3)
Teh Recognisining.
"I'd like to order pizza, with pepperoni, pineapple, mushroom and an Iludium Pu-36 space modulator delivered to Hall of Justice."
Re: (Score:3)
http://www.youtube.com/watch?v=7A4HeawmE6A [youtube.com]
Not knowing what an Illudium Pu-36 Explosive Space Modulator means you had a deprived childhood.
--
BMO
Re: (Score:1)
http://www.youtube.com/watch?v=7A4HeawmE6A [youtube.com]
Not knowing what an Illudium Pu-36 Explosive Space Modulator means you had a deprived childhood.
--
BMO
Hear, hear!
Marvin is the man! I mean, he's the silly thought and pseudo I use
for this nickname.
-AI
Duh! (Score:2, Insightful)
When you want to secure something, you must think carefully about how you might be leaking information. You can't just slap some encryption on and call it a day.
3 years old work (Score:3)
The conference version of the paper appeared in IEEE S&P 2008.
http://cs.unc.edu/~fabian/papers/oakland08.pdf [unc.edu]
No shit? (Score:1)
You mean when you vary a quality of your signal (in this case bitrate) based on content, people can read information about the content from those variations??? OMFG!
then it's shitty encryption (Score:3)
The definition (somewhere in the 'net archives) of encryption quality is how distinguishable the encrypted message is from random noise. Clearly setting bitrates, or any other parameter, based on the input, is not random.
Pick a better algorithm and/or suck it up and waste a little bandwidth.
Re: (Score:2)
(A common) definition of symmetric encryption is that a message should be indistinguishable from an equal-length string of random bits. In that sense, there's nothing wrong with this encryption scheme.
What is wrong here is that encryption does not hide message length, and in many cases mes
Google Voice (Score:1)
What phrases? (Score:2)
Re: (Score:2)
Somehow I think it's probably best at "hello".
I'm one step ahead of these known-plaintext attacks -- no longer do I use the same, small set of voice greetings. No no -- I prepend a nonce.
"Hello?" ... and you're not supposed to answer the phone like that!!!"
"Shgr'gl'hm-v'va Hi Mom, it's Clyde
Re: (Score:2)
Who answers with "Hello" still? Waste of time. Look at Caller ID, "Hi XXX."
Or... "This is XXX." That one always throws the telemarketers... "Is X there?" "Didn't I just say that?"
Or my favorite, old military and any kind of "Operations" job folks... we just answer with our last name. One word, contact established, identity verified... go with your traffic.
"Goodbye" is silly too. Just hang up.
Variable bit rate? (Score:2)
Did you note that they specified variable bit rate? In this case, I'll bet it had more to do with the timing and flow of the packets and bytes than with the actual content of the bytes. When there's a pause in a person's speech, there is a pause in the network traffic. Imagine someone trying to send morse code through an encrypted voice channel. Someone watching a bandwidth graph that had a high enough frequency would know exactly what coded message you sent regardless of the compression or encryption algor
RTP blinding (Score:2)
A few solutions...
Add some number of pad bytes to each packet to fill in blanks.
Tweak existing high complexity codecs (ilbc, speex..etc) to maintain a persistant bitrate by dynamically scaling quality to even out the per packet bits.
Use a fixed bitrate codec (most of these really suck from bw effeciency vs quality perspective)
Switch variability to the time domain adding jitter to mask the signal and control latency/security tradeoff.
SRTP scares me because it was invented for a single narrow purpose. Would
useless, and easy countermeasures (Score:3)
First of all, statements like "50% accuracy" are nearly useless; you need to know both precision and recall. And to the degree that "50% accuracy" tells you anything, it tells you that the system is pretty bad.
Finally, the countermeasure for this is the same as the countermeasure for other automated speech analysis techniques: play some singing or theater in the background.
Re: (Score:2)
Exactly. The phrases used are fairly long, for instance: "Laugh, dance, and sing if fortune smiles upon you." In the TIMIT corpus, there are 122 of them. In the English language, there are hmm, lots of sentences of that length. There are about 1000 different syllables in English, and I count 11 syllables in that sentence. Thus, there are some fraction of 10^33 sentences of that length.
So, if you tried this on English, one of two things would happen. If you used that recognizer without any modifi
Nexidia (Score:1)
Average accuracy of 50%? (Score:1)
On any digital signal, comparing a random source of bits should get you 50% accuracy.
Better than guessing? (Score:2)
An exercise of pattern detection (Score:2)
Now, DHS, I know I'm not at MIT, but other [wikipedia.org] cases showed I don't need to... So, just where is my grant for advanced research of the subject?