Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Communications Encryption Google Security IT

Encrypted VoIP Meets Traffic Analysis 98

Der_Yak writes "Researchers from MIT, Google, UNC Chapel Hill, and Johns Hopkins published a recent paper that presents a method for detecting spoken phrases in encrypted VoIP traffic that has been encoded using variable bitrate codecs. They claim an average accuracy of 50% and as high as 90% for specific phrases."
This discussion has been archived. No new comments can be posted.

Encrypted VoIP Meets Traffic Analysis

Comments Filter:
  • by Anonymous Coward

    Better stick to a constant bitrate then :)

    • Exactly, or just add enough random data into the stream, plus the voice channel or make it look like a constant stream of random data.
  • by Anthony Mouse ( 1927662 ) on Tuesday March 15, 2011 @10:30AM (#35492122)

    Use fixed-bitrate encoding for VoIP.

    • by ackthpt ( 218170 )

      Use fixed-bitrate encoding for VoIP.

      Better still, two cans and a length of string.

      • by Bengie ( 1121981 )
        until someone gets a warrant to string tap you. You'd think the string connecting the two cans is protected by quantum randomness from the string theory, but it is not.
    • Re: (Score:3, Interesting)

      by bsquizzato ( 413710 )

      Not so obvious --- now you have a much less efficient use of bandwidth to deal with.

      The article describes the method used to detect phrases ...

      At a high level, the success of our technique stems from exploiting the corre-lation between the most basic building blocks of speech—namely, phonemes—and the length of the packets that a VoIP codec outputs when presented with these phonemes. Intuitively, to search for a word or phrase, we first build a model by decomposing the target phrase into its most likely constituent phonemes, and then further decomposing those phonemes into the most likely packet lengths. Next, given a series of packet lengths that correspond to an encrypted VoIP conversation, we simply examine the output stream for a sub-sequence of packet lengths that match our model.

      Essentially, you gather enough information about how a VBR codec could encode a speech phrase you are looking for, then predict where it was spoken by looking at the "data bursts" being sent in the media stream. We'll need to research a way to "scramble" this predictability that's more efficient than using fixed bitrates, which eats up un-needed bandwidth.

      • by Anonymous Coward on Tuesday March 15, 2011 @11:09AM (#35492684)

        OpenSSH had a similar problem, it would leak information about your login password by the timing/size of the packets:

        http://www.ece.cmu.edu/~dawnsong/papers/ssh-timing.pdf

        I believe their solution was to introduce random NOP packets into the stream. This approach could work here too.

        • I immediately thought of this exploit as well. Seems to me you would need a lot of NOP packets comparatively, the login info is just a few keystrokes. Plus login info is not time sensitive on the receiving end, delays in a voice stream might not be acceptable.

      • by buback ( 144189 )

        So I guess it's like how dentist understand their patients when they have their hands and tools in their mouths.

      • by tixxit ( 1107127 )
        Some encrypted systems actually specify how much data can be "leaked" out per some amount of time. The idea is that, practically, you'll always lose something, so you need to determine a limit that is acceptable. I guess that while voice/sound "data" is very complex, speech is much less so and it doesn't take much data being leaked to get the gist of what was said. Since their method is essentially looking at a sequence of numbers, the more obvious solution may be to add some padding to the packets to foil
      • We'll need to research a way to "scramble" this predictability that's more efficient than using fixed bitrates, which eats up un-needed bandwidth.

        Any fix is going "waste" some amount of bandwidth.

        One solution to this attack may be to semi-randomly inject "nops" to bridge phoneme breaks. So instead of being able to identify individual phonemes by bandwidth spikes, attackers will be limited to identifying entire word clusters - like filling the "space" between the phonemes in the first three words of a sentence to make it look like one really long phoneme.

        But perhaps something more exotic might work, like randomly re-ordering chunks of audio so that t

        • by Anonymous Coward

          But perhaps something more exotic might work, like randomly re-ordering chunks of audio so that they are transmitted somewhat out of order and then re-ordered on the receiving end. That probably won't use up much extra bandwidth but would increase latency.

          Might not even need to re-order the audio, just burst it so that multiple phonemes are all "packed" together for transmission so there are much fewer phoneme breaks visible via traffic analysis. You burn latency that way too, but it would be much simpler to implement than a randomizing algorithm.

          • by dgatwood ( 11270 )

            Agreed that the problem is the packing, not the data. However, grouping multiple short packets together is still leaking information. The only difference is that instead of looking at the length of packets, you have to look at the timing between packets.

            I would suggest that the right solution is to modify your code so that instead of sending out packets of varying length isochronously, you instead send out packets of the same length isochronously, and adjust the average length every... say ten seconds, a

          • Thus you increase latency, which is the single most important thing in a phonecall.

        • by NateTech ( 50881 )

          Using a VBR and then inserting NOP's sounds like... using a non-variable streaming CODEC.

      • We'll need to research a way to "scramble" this predictability that's more efficient than using fixed bitrates, which eats up un-needed bandwidth.

        It seems like there might be some promise in improving the compression method itself using the same techniques, so that the things that currently take more bandwidth would take less and therefore become less distinguishable, but if the compression is already near-optimal then this won't work without an efficiency loss because the change would correspondingly make the things that currently take less bandwidth take more, and those things might be more common.

        The only general solution is some kind of padding s

      • It's very clever. Seems like using a CBR encoder would defeat this method, because every packet would have the same number of samples. Being *too* efficient might save you bandwidth, but it reveals something about your speech patterns.
      • At a high level, the success of our technique stems from exploiting the corre-lation between the most basic building blocks of speech—namely, phonemes—and the length of the packets that a VoIP codec outputs when presented with these phonemes. Intuitively, to search for a word or phrase, we first build a model by decomposing the target phrase into its most likely constituent phonemes, and then further decomposing those phonemes into the most likely packet lengths. Next, given a series of packet lengths that correspond to an encrypted VoIP conversation, we simply examine the output stream for a sub-sequence of packet lengths that match our model.

        Awesome.

        It's like listening to the "Mwa mwaa mwaa mwa mwa" voice that adults use in the old Peanuts television specials, and figuring out what they are saying based on the length of the "mwas" and their order in the conversation.

        • You mean like trying to decipher Kenny from South Park's words?

          I wonder what my kids would compare it to ...
      • by Kjella ( 173770 )

        Not so obvious --- now you have a much less efficient use of bandwidth to deal with.

        Enough to matter? According to my cell phone bill, I had over 100MB of data traffic last month. That's about 10 hours of 24 kbps CBR encoded voice, which is the highest possible CBR setting speex has. If it's on my DSL/cable/whatever line, who cares? Even if I did that 24x7 for a month it'd be 7-8 GB and I'm pretty sure even a teenage girl with mouth diarrhea has to sleep sometimes. If that's what it takes, I don't see CBR as being a dealbreaker.

        • Now take hundreds of thousands of calls like yours running through your service provider's network, being transferred to other providers networks, etc. Or, hundreds/thousands of calls running w/in a large enterprise such as from branch offices to HQ. Bandwidth costs money. In situations like these, you try to conserve bandwidth any way you can.

        • by Eivind ( 15695 )

          Not enough to matter.

          VBR *does* save bandwith for equivalent quality, but not a lot of it.

          Your 100MB gives you 10 hours of 24kbps of CBR encoded voice, and at a guess, VBR would maybe give you 13-15 hours of voice in the same bandwith.

          Certainly trivial, and certainly the answer to this problem is that encrypted voice, should be encoded CBR to make traffic-analysis impossible.

    • by Cthefuture ( 665326 ) on Tuesday March 15, 2011 @11:22AM (#35492852)

      Actually most people are using G.711 these days which is in fact a fixed bitrate (it's the same protocol used on your normal "hard" voice line).

      But most VoIP providers do not offer SRTP or any encryption whatsoever so this whole thing is not even a question. More than likely anyone can listen in on your VoIP calls. We need to put more pressure on VoIP providers to offer encryption.

    • Working in telephony and VoIP for the last 8 years I don't remember seeing a VBR codec in actual use - ever. At most silence detection is used but that has unpleasant side effects too. I also find useless to save 2-3 bytes when the UDP+RTP overhead is 40 (plus at least 4 if SRTP is used).
  • by ackthpt ( 218170 ) on Tuesday March 15, 2011 @10:31AM (#35492136) Homepage Journal

    Teh Recognisining.

    "I'd like to order pizza, with pepperoni, pineapple, mushroom and an Iludium Pu-36 space modulator delivered to Hall of Justice."

  • Duh! (Score:2, Insightful)

    by Anonymous Coward

    When you want to secure something, you must think carefully about how you might be leaking information. You can't just slap some encryption on and call it a day.

  • by slashdotmsiriv ( 922939 ) on Tuesday March 15, 2011 @10:59AM (#35492548)

    The conference version of the paper appeared in IEEE S&P 2008.

    http://cs.unc.edu/~fabian/papers/oakland08.pdf [unc.edu]

  • by Anonymous Coward

    You mean when you vary a quality of your signal (in this case bitrate) based on content, people can read information about the content from those variations??? OMFG!

  • by cellocgw ( 617879 ) <cellocgw&gmail,com> on Tuesday March 15, 2011 @11:25AM (#35492890) Journal

    The definition (somewhere in the 'net archives) of encryption quality is how distinguishable the encrypted message is from random noise. Clearly setting bitrates, or any other parameter, based on the input, is not random.

    Pick a better algorithm and/or suck it up and waste a little bandwidth.

    • The definition (somewhere in the 'net archives) of encryption quality is how distinguishable the encrypted message is from random noise. Clearly setting bitrates, or any other parameter, based on the input, is not random.

      (A common) definition of symmetric encryption is that a message should be indistinguishable from an equal-length string of random bits. In that sense, there's nothing wrong with this encryption scheme.

      What is wrong here is that encryption does not hide message length, and in many cases mes

  • Google is involved in this? Perhaps encryption could help them improve the accuracy of transcription in Google Voice... [twitter.com]
  • I'm hoping it's best at picking up obvious spy phrases, like "the eagle has landed", "the moon fish squicks wickedly at midnight", "long is the gap between cacti"... Somehow I think it's probably best at "hello".
    • Somehow I think it's probably best at "hello".

      I'm one step ahead of these known-plaintext attacks -- no longer do I use the same, small set of voice greetings. No no -- I prepend a nonce.

      "Hello?"
      "Shgr'gl'hm-v'va Hi Mom, it's Clyde ... and you're not supposed to answer the phone like that!!!"

    • by NateTech ( 50881 )

      Who answers with "Hello" still? Waste of time. Look at Caller ID, "Hi XXX."

      Or... "This is XXX." That one always throws the telemarketers... "Is X there?" "Didn't I just say that?"

      Or my favorite, old military and any kind of "Operations" job folks... we just answer with our last name. One word, contact established, identity verified... go with your traffic.

      "Goodbye" is silly too. Just hang up.

  • Did you note that they specified variable bit rate? In this case, I'll bet it had more to do with the timing and flow of the packets and bytes than with the actual content of the bytes. When there's a pause in a person's speech, there is a pause in the network traffic. Imagine someone trying to send morse code through an encrypted voice channel. Someone watching a bandwidth graph that had a high enough frequency would know exactly what coded message you sent regardless of the compression or encryption algor

  • A few solutions...

    Add some number of pad bytes to each packet to fill in blanks.

    Tweak existing high complexity codecs (ilbc, speex..etc) to maintain a persistant bitrate by dynamically scaling quality to even out the per packet bits.

    Use a fixed bitrate codec (most of these really suck from bw effeciency vs quality perspective)

    Switch variability to the time domain adding jitter to mask the signal and control latency/security tradeoff.

    SRTP scares me because it was invented for a single narrow purpose. Would

  • by t2t10 ( 1909766 ) on Tuesday March 15, 2011 @12:17PM (#35493600)

    First of all, statements like "50% accuracy" are nearly useless; you need to know both precision and recall. And to the degree that "50% accuracy" tells you anything, it tells you that the system is pretty bad.

    Finally, the countermeasure for this is the same as the countermeasure for other automated speech analysis techniques: play some singing or theater in the background.

    • Exactly. The phrases used are fairly long, for instance: "Laugh, dance, and sing if fortune smiles upon you." In the TIMIT corpus, there are 122 of them. In the English language, there are hmm, lots of sentences of that length. There are about 1000 different syllables in English, and I count 11 syllables in that sentence. Thus, there are some fraction of 10^33 sentences of that length.

      So, if you tried this on English, one of two things would happen. If you used that recognizer without any modifi

  • Nexidia has been selling proprietary tech to do this for years
  • On any digital signal, comparing a random source of bits should get you 50% accuracy.

  • I'm sure there's a mathematical/statistical reason why 50% accuracy is better than guessing in this case, but that would be very counter-intuitive. Same with as high as 90% under certain conditions. I could get to 90% accuracy if I could select out everything that reduced my accuracy as well. I don't doubt the full article explains better though. I'm not suggesting MIT, Google, etc scientists are stupid.
  • Seems that I started to detect a pattern between the current TFA and this [slashdot.org] one.
    Now, DHS, I know I'm not at MIT, but other [wikipedia.org] cases showed I don't need to... So, just where is my grant for advanced research of the subject?

If all the world's economists were laid end to end, we wouldn't reach a conclusion. -- William Baumol

Working...