New Binary Diffing Algorithm Announced By Google 192
Posted
by
timothy
from the smartness-and-light dept.
from the smartness-and-light dept.
bheer writes "Today Google's Open-Source Chromium project announced a new compression technique called Courgette geared towards distributing really small updates. Courgette achieves smaller diffs (about 9x in one example) than standard binary-diffing algorithms like bsdiff by disassembling the code and sending the assembler diffs over the wire. This, the Chromium devs say, will allow them to send smaller, more frequent updates, making users more secure. Since this will be released as open source, it should make distributing updates a lot easier for the open-source community."
dictionary (Score:2, Informative)
n. Chiefly British
A zucchini.
Re:wait a minute (Score:2, Informative)
Another problem is that you would need to maintain every little diff from previous versions, and apply them one after another. Every so often you would have a cumulative diff maintained, so you'd do: 3.0.3 -> 3.0.4 -> 3.0.5 -> 3.2 (cumulative 3.0.5 to 3.2) -> 3.2.1 -> 3.2.2 -> 3.5 (cumulative 3.2.2 to 3.5) -> 3.5.1
Re:dictionary (Score:1, Informative)
Like many brilliant ideas... (Score:5, Informative)
...it makes you smack yourself on the head and go "why hasn't everybody been doing this for years?".
The idea is simple, and reminds me of something I learned in school regarding signals. Some operations are easy to perform in the frequency domain, so you do the Fourier transform, perform the operation, and then transform back.
This is really just the same idea applied to the problem of patches. They're small in source; but big in binary. It seems so obvious that you could apply a transform,patch,reverse process... but only when pointed out and demonstrated.
It's almost like my favorite invention: the phonograph.
The instructions for making an Edison phonograph could have been understood and executed by any craftsman going back thousands of years. Yet, it wasn't done until the late 19th century.
Are the inventors that brilliant, or are we just that stupid.
binary diff (Score:2, Informative)
If you're not familiar with the process of binary diff (I wasn't) there's a paper linked from the article that explains some about bsdiff:
http://www.daemonology.net/papers/bsdiff.pdf [daemonology.net]
Wayback from 2007/07/09:
http://web.archive.org/web/20070709234208/http://www.daemonology.net/papers/bsdiff.pdf [archive.org]
Re:wait a minute (Score:5, Informative)
Re:The cool thing is... (Score:5, Informative)
You didn't RTFA before posting did you? When they say assembler they mean something of their own that takes a binary blob and one you have already and reassembles the original binary. It just so happens that the disassembler knows a lot about windows executables and the archive format that google uses breaks it up into some portions and assigns labels to addresses. Then it runs bsdiff on this smaller subset.
The code outlined in the blog post is really in these files:
win32_x86_generator.h
win32_x86_patcher.h
Notice these names? This is the disappointing aspect to all of this, it is one more new reason that Chrome is x86 and primarily Windows. You would need one for Mach-O and ELF to do this on other platforms and then if you were on another processor, say ARM or PPC, you would need something that understood that as well. Then there is the issue about 64-bit. In any case the assembler is not something like gas or MASM which is what you imagined.
Re:Can a layman get an explanation in English? (Score:3, Informative)
Rather than send the difference in the executable, they send the difference in a sort of source code. Saves space if a small source change (move this line past this function) causes a big executable difference.
Re:Can a layman get an explanation in English? (Score:5, Informative)
Binary executable files contain a lot of addresses (variables, jump locations, ...) that are generated by the assembler at compile time.
Now consider you just add one 1-byte instruction somewhere in the middle of your program (let's say "nop"). When you compile it again, all the code that reference addresses beyond the insert point will have changed because the address has been incremented. So these 4 bytes added to your source code could mean addresses that get incremented in the compiled file in thousands of places.
What they do basically is take the binary file, disassemble it back to pseudo source code (not real asm I guess), and diff that against old version. The patch engine on the client end does the same disassembling, applies the patch, and reassembles the patched source code to an executable file.
This means diffs gets much smaller (4 bytes vs. 1000s in my extreme example), but also makes the diff/patch process much more complex, slower, and not portable.
Re:Also less overhead for Google (Score:3, Informative)
DeltaRPM uses bsdiff - an impressive but generic binary diff algorithm.
Courgette is designed to replace this binary diff with one that understands compiled code well enough to optimize the binary diffs by a significant amount.
Re:Can a layman get an explanation in English? (Score:4, Informative)
A compiler takes source codes and turns them into assembler code. That's lines of human-readable machine instruction mnemonics (for example, "Copy from here to here." "Is that bigger than zero?"). The assembler takes those lines and turns them into machine instructions, a sequence of binary numbers.
Finding the difference between two huge gobs of binary numbers is difficult. Instead, they turn the binary numbers back into lines of mnemonics and use a algorithm that finds the difference between two huge listings of mnemonics.
That method is easier because the listings of a program that has been changed slightly can be very similar to the listing of a unmodified program. That has to do with how compilers work.
Capiche? ;)
Re:The cool thing is... (Score:3, Informative)
The source for the disassembler is pretty simple.
http://src.chromium.org/viewvc/chrome/trunk/src/courgette/disassembler.cc [chromium.org]
Porting that to parse x86 out of ELF or another executable container wouldn't be too difficult. Porting it to parse x64 or PPC would be tougher.
Re:The cool thing is... (Score:3, Informative)
Actually it would be pretty hard to go x86 PEX to x86 ELF as relocations are done completely differently in those formats. They are often names or indexes in DLLs with PEX and handles (think dlopen) while there are all sorts of relocation entries in ELF where the instructions themselves get modified by the run time linker to the correct address or offset (the premunged instruction either has an index in it to a name or another relocation table with more info).
Re:Like many brilliant ideas... (Score:4, Informative)
I've been reviewing various proposals like this, and basically, it's a tradeoff mirrors don't want. This sorta stuff has been proposed for ages. I listened to a recording of the author of rsync give an introduction to the algorithm and program, and among the questions was "is this suitable for .deb?". The answer was "Not unless the archive is completely recompressed with gzip patched to be rsync compatible".
Eventually that patch landed, and you could conceivably do this. Except it expands the entire archive by 1-2 percent. And there's a lot of CPU overhead where there was none before.
Then someone cooked up zsync. It's the same thing as rsync except with a precalculated set of results for the server side. This looked like a winner. But someone else really wants LZMA compressed packages to fit more onto pressed CDs. So now we're at a fundamental impass: optimize for the distribution of install media to new users, or optimize for the distribution of updates to existing users.
The best resolution I've seen is to use LZMA compression on the entire CD volume, but that requires the kernel to get their ass in gear and allow yet another compression in the kernel. That may have finally happened, I haven't checked recently. But generally LZMA requires more RAM to operate, so that could raise the minimum requirements on installs.
In short, it's a balancing act of effort, bandwidth, CPU and RAM. What works for some may not work for all.
Re:Solving the wrong problem (Score:3, Informative)
If the code is so awful that the bandwidth required for security updates is a problem, the product is defective by design.
You don't understand what the phrase "defective by design" means. It's used by anti-DRM folks to describe "features" that nobody wants and that actually reduce the usefulness of a product, but which are inserted into the product intentionally by the manufacturer out of a misguided desire to support DRM. If a bug/feature is "by design" then you should not expect a patch for it, ever.
A product that needs lots of security patches, on the other hand, is not defective by design; rather, it is simply badly designed.
Don't go out of your way to use catchphrases when simple English will do.