New Binary Diffing Algorithm Announced By Google 192
bheer writes "Today Google's Open-Source Chromium project announced a new compression technique called Courgette geared towards distributing really small updates. Courgette achieves smaller diffs (about 9x in one example) than standard binary-diffing algorithms like bsdiff by disassembling the code and sending the assembler diffs over the wire. This, the Chromium devs say, will allow them to send smaller, more frequent updates, making users more secure. Since this will be released as open source, it should make distributing updates a lot easier for the open-source community."
dictionary (Score:2, Informative)
n. Chiefly British
A zucchini.
Re: (Score:2)
it is from French. A cute word in French
Indeed, I couldn't help but lol when I first saw the name.
Re: (Score:2)
Thanks but in case you missed it I'm French, I know what that is.
Re: (Score:2)
Re: (Score:3, Funny)
Because the Americans pronounce the Italian word Zucchini flawlessly.
Re: (Score:2)
Stoopid Brits, Because the Americans pronounce the Italian word Zucchini flawlessly.
So I guess we Italians are lucky that 'zucchini' is _not_ an Italian word (the Italian word being zucchina (s.) / zucchine (pl.), the latter being pronounced zook-kee-neh) ;-)
Re: (Score:2, Insightful)
The same as in French (Score:2)
Re: (Score:2)
Re: (Score:2)
uses a primitive automatic disassembler (Score:4, Insightful)
Re: (Score:3, Insightful)
Why not just ship the affected .class files for Java? Disassemble the Jar that is being updated, replace/add/remove the class(es) as per the update instructions and rebuild the Jar.
Chrome's problem is that a massive, bulky, chrome.dll file needs to be sent out with each update, and that it isn't easy to split it up into its constituent parts because it is lower level.
It's not a new idea, but quite impressive that they've actually gone and worked it all out so that it is reliable. And nicer than patching via
Re: (Score:2)
Re: (Score:2)
If they ship the signature then the manifest can be updated. Shouldn't be a problem.
Re:uses a primitive automatic disassembler (Score:4, Interesting)
It would work fine, if you include Java/.NET specific disassemblers.
In fact, you can already compress Java JARs about ten times by using pack200 algorithm (it works essentially the same way).
Re: (Score:3, Interesting)
An interesting approach - I wonder if this would also work as well on compiled bytecode like .NET or Java uses?
I suspect that it would. I once heard Manuel De Icaza give a talk in the early days of the Mono project. He said the "bytecode" that came out of the C# compiler was analogous to the output that comes out of the front end of a C compiler. The JIT is the equivalent of the C compiler's back end, only it runs right before execution. I suspect that what Google's decompiler is doing is reverting the binary to something similar to the C compiler's internal representation -- and if so, this method would work pretty
Overly Complex.. (Score:2)
From my first pass over it this actually looks like a very strange way to describe a rather simple approach.
More generically they could achieve the same, if not more, by using a differential PPM or LZ method, which is very simple to design/implement, needs zeto knowledge of what data it is handling, and is in no way a new idea.
I suspect whomever designed this had a good idea but not that enough compression experience to know solutions were already on hand.
In effect the new binary should be compressed with
Re:uses a primitive automatic disassembler (Score:5, Insightful)
It's an interesting approach to a problem that should never have arisen. Why does everything have to be thrown into a 10MB dll? People are browsing on computers, not wristwatches, and there's no reason to abandon modular design to tangle things together for trivial performance gains.
Just because the end result is a 10MB dll doesn't mean that the code and design isn't modular. Thats like saying Linux isn't modular because the live CDs / DVDs come in a single gigantic 4.7Gb .iso file.
Re: (Score:2)
Look at the unholy amounts of research and tricky implementation they put into Apps and Gmail and Wave.. inexplicably designing all their own widgets and developing the mind-bogglingly expensive Google App Engine into a Python- and Java- linkable library and stretching and extending the capabilities of Javascript and the DOM far beyond their intended purpose to get drag-and-drop, fancy transitions, live editing... all of this is standard fare in a desktop app but Google insists on coding it (inevitably ugly
Re: (Score:2)
"You can't exactly hop on your friend's XP box and run an X application from a remote server, unless he happens to have Exceed installed (for $$$)."
Well, at least you can install the Cygwin environment for free, don't you?
Also less overhead for Google (Score:5, Insightful)
Google has to pay the cost for maintaining servers and handling bandwidth for all the OS updates they push out. The more efficient they are in this process, the more money the save.
The good news is that the same benefits could be applied to Red Hat, Ubuntu, openSUSE, etc. Lower costs helps the profitability of companies trying to make a profit on Linux.
The end users also see benefits in that their packages download quicker. I'd be honestly pretty disappointed in any major distro that doesn't start implementing a binary diff solution around this.
Re: (Score:2, Insightful)
Yeah, but it is probably more like unplugging wall warts than insulating your house.
That is, it will be a hassle, be visible, and show a tiny benefit, all the while distracting from more worthwhile activity.
Re: (Score:2)
> I'd be honestly pretty disappointed in any major distro that doesn't start implementing a binary diff solution around this.
Fedora already has Presto using DeltaRPMs [fedoraproject.org]
Re: (Score:3, Informative)
DeltaRPM uses bsdiff - an impressive but generic binary diff algorithm.
Courgette is designed to replace this binary diff with one that understands compiled code well enough to optimize the binary diffs by a significant amount.
Re: (Score:2)
Courgette uses bsdiff as well, after it breaks down into some parts that bsdiff better.
Re: (Score:2)
DeltaRPM also have to deal with non-code objects as well. I wonder how much of an average RPM is actually machine code.
Re: (Score:2, Interesting)
Fedora is already using binary diffs to speed up downloading updates - see yum-presto [fedorahosted.org]. With a better binary diff algorithm, the RPM updates can hopefully be made even smaller. As the Google developers note, making smaller update packages isn't just a 'nice to have' - it really makes a difference in getting vulnerabilities patched faster and cuts the bandwidth bill for the vendor and its mirror sites. Remembering my experiences downloading updates over a 56k modem, I am also strongly in favour of anything
Re: (Score:3, Interesting)
Re: (Score:2)
I doubt this is of much concern to them. How many people will be running their new OS and doing updates, vs the bandwidth required for your average google maps session or you-tube video, done by many more people? Even the daily search results and ad serving to the content network would far exceed this.
I'm pretty sure the amount of bandwidth would be trivial in comparison. It's not like Ubuntu or MS require people to re-download the entire OS to perform a patch.
Besides which, with the network of mirrors they
Re: (Score:3, Insightful)
I don't run Ubuntu, but rather openSUSE. When I download updates (including weekly snapshot builds of KDE 4, Firefox and OpenOffice) I end up downloading around 3 gigs each time. That is practically the whole OS. Small binary diffs would make a huge difference here.
I doubt I pull 3 gigs of bandwidth from all Google sites combined in a week.
The other difference is that other Google sites generate revenue. The OS likely will not.
Rsync (Score:2, Interesting)
Would this algorithm be useful to the rsync team?
Re: (Score:3, Interesting)
Not really. Where Google's algorithm really shines is in exactly the field they designed it for: efficiently trying to update a large number of identical binary files of known content (particularly those representing code) on many remote computers, by sending a compact list of the differences.
Rsync actually has to solve a different problem: figuring out where differences exist between two files separated over a slow link when you DON'T precisely know the content of the remote file, but know it's likely simi
Re: (Score:2)
Re: (Score:2)
1. This is geared toward a specific type of file (x86 executable), not generic data files.
2. Adding an educated-guess-where-all-the-pointers-are system might just mess with the rsync protocol.
3. Google has the advantage of knowing, with a quick version number check, exactly what changes need to be made: most data flows from server to client. The rsync destination would have to send back two sets of rolling checksums: first for the disassembly, then for the guess made u
Re: (Score:3, Interesting)
Umm, many of us use rsync like mad on binaries such as ISO images or repository trees full of RPMs which are full of compressed data.
The rsync algorithm (see the cool thesis) is actually quite adept at processing "non-text" byte sequences. It's main innovation is having a heuristic to break the stream into smaller phrases by identifying canonical start/end boundaries by running the same heuristic on both source and destination files (without ever possessing both files in the same place). It transfers a desc
Re: (Score:2)
I wonder whether their assembler is deterministic. Operating on source code in general could be great, but compilation with optimizers and linkers is not necessarily deterministic, so a different sort of integrity protection system would be needed to validate the results of the process.
Yah, that was my question kind of too (I'm assuming it is). Generally, I'd think that you want everyone with the same version to have the exact same deliverables.
Re: (Score:3, Interesting)
That was a very good high level explanation of how rsync works, thanks. It shows though that this approach is uninteresting to rsync. The hashes and heuristics are so that you do not have to compare the two ends over a potentially slow link, that is the beauty of rsync. In google's case they have all the versions locally and can do the comparisons orders of magnitude faster. So even if there was some way to get binary 1-1 (which I bet they do in fact) it would be useless for rsync as you would have to read
Re: (Score:2)
Rsync is usefull when you don't know anything about the data, if you actually know something about, their is a better way to do it.
Re: (Score:2)
Umm, many of us use rsync like mad on binaries such as ISO images or repository trees full of RPMs which are full of compressed data.
Neither of which are edited much by the end user, in my experience. What good is a diff algorithm on a file that hasn't changed?
Re:Rsync (Score:4, Interesting)
Well, your use of rsync is not necessarily typical. I use unison for file synchronization, and unison uses rsync; I have quite a few binary files that I sync. You're right that it would be a tradeoff of cpu versus bandwidth, but actually that's the whole point of rsync.
On the other hand, I can think of at least some other reasons that this algorithm would be less appropriate for rsync than for google's intended use. (1) Google knows that their file is an executable, and knows that it's an x86 executable. Rsync would have to detect this using the kind of heuristic algorithms used by the unix "file" utility. (2) It's kind of ugly to imagine building all this x86-specific stuff into a generic program like rsync, which people may be using on ARM machines, or which people may use in the future on architectures that haven't been invented yet. (3) Google is doing one-to-many file distribution (pushing out a single patch to millions of users), so that means that the tradeoff of cpu versus bandwidth is an excellent one for them. With rsync, the typical use is one-to-one (syncing machine A with machine B), so the tradeoff isn't as awesomely spectacular.
BTW, a completely different application of this that Google may be interested in is for Google Native Client, which runs x86 code sandboxed in a browser. Suppose google has 10^8 users using a particular web app that runs x86 code in their browser via Native Client. As long as the program doesn't change, most of the users are probably just getting it from cache, so it loads fast and doesn't cost google any bandwidth. Then on a certain day, google updates the software. Potentially this causes a bottleneck where 10^8 users are all trying to download the same code (could be especially bad on a corporate network with lots of people simultaneously downloading the same stuff on a Monday morning). If Courgette is incorporated in the browser, then potentially it could be smart enough to realize that it already has version x cached, and what it's trying to download is version x+1, and work out the patch without having to download the whole thing again.
Re: (Score:2)
Taking this idea a step further, there's no particular reason that it couldn't be done for other types of file as well. For example, uncompressed images could be losslessly compressed and transmitted, then rec
Re: (Score:2)
>> re: the computation overhead of disassembling, diffing, reassambling versus normal binary diffing
Executable files releases are one-to-many. Front loading computation overhead is useful.
Rsync transfers are (normally) one-to-one.
Re: (Score:2)
of course, there's a command line to override it either way
The cool thing is... (Score:3, Interesting)
The cool thing is, one can easily extend this to other executable formats, as long as the assembler is readily available client-side: Windows users could relate to those pesky, resource-hogging Java updates, and .NET and Mono applications could similarly benefit.
This is, interestingly, the second binary diffing innovation that affects me in the past few months. Fedora just turned on delta updates with Fedora 11, a feature borrowed from the openSUSE folks.
Re:The cool thing is... (Score:5, Informative)
You didn't RTFA before posting did you? When they say assembler they mean something of their own that takes a binary blob and one you have already and reassembles the original binary. It just so happens that the disassembler knows a lot about windows executables and the archive format that google uses breaks it up into some portions and assigns labels to addresses. Then it runs bsdiff on this smaller subset.
The code outlined in the blog post is really in these files:
win32_x86_generator.h
win32_x86_patcher.h
Notice these names? This is the disappointing aspect to all of this, it is one more new reason that Chrome is x86 and primarily Windows. You would need one for Mach-O and ELF to do this on other platforms and then if you were on another processor, say ARM or PPC, you would need something that understood that as well. Then there is the issue about 64-bit. In any case the assembler is not something like gas or MASM which is what you imagined.
Re: (Score:3, Informative)
The source for the disassembler is pretty simple.
http://src.chromium.org/viewvc/chrome/trunk/src/courgette/disassembler.cc [chromium.org]
Porting that to parse x86 out of ELF or another executable container wouldn't be too difficult. Porting it to parse x64 or PPC would be tougher.
Re: (Score:3, Informative)
Actually it would be pretty hard to go x86 PEX to x86 ELF as relocations are done completely differently in those formats. They are often names or indexes in DLLs with PEX and handles (think dlopen) while there are all sorts of relocation entries in ELF where the instructions themselves get modified by the run time linker to the correct address or offset (the premunged instruction either has an index in it to a name or another relocation table with more info).
Re: (Score:2)
wait a minute (Score:3, Insightful)
announced a new compression technique called Courgette geared towards distributing really small updates
I just RTFA, this has nothing to do with a compression technique.
What they developed is a technique to make small diffs from *executable binary files* and it doesn't look like it's portable to anything other than x86 because the patch engine has to embed an architecture specific assembler + disasembler.
Re: (Score:2, Informative)
Another problem is that you would need to maintain every little diff from previous versions, and apply them one after another. Every so often you would have a cumulative diff maintained, so you'd do: 3.0.3 -> 3.0.4 -> 3.0.5 -> 3.2 (cumulative 3.0.5 to 3.2) -> 3.2.1 -> 3.2.2 -> 3.5 (cumulative 3.2.2 to 3.5) -> 3.5.1
Re:wait a minute (Score:5, Informative)
Re: (Score:2)
I had this idea (Score:2)
I remember having this idea a while back and thinking that it would surely be out there already implemented. When it wasn't, I didn't follow through and try to invent it myself. I was focused on another project at the time. Never really came back to that original idea. Awesome that someone did, and that they are keeping it open so everyone can use it! Thanks, Google!
Re: (Score:2)
What is cool about this text is, you could claim it for anything and still look smart. ;-)
Bad explanation (Score:5, Insightful)
That's potentially very misleading. I can compress any document, down to a single but, if my compression algorithm is sufficiently tailored to that document. For example:
if (compressed_data[0] == 0):
return = get_Magna_Carta_text()
else:
return unzip(compressed_data[1:])
What we need to know is the overall distribution of compression ratios, or at least the average compression ratio, over a large population of documents.
Re: (Score:3, Funny)
I can compress any document, down to a single but,
Oh crap. There goes any chance of this being a technical discussion.
Re: (Score:3, Funny)
Most of your point is good, but I suspect that, no matter what language you're using, ONE of these will give you a syntax error ;)
Re: (Score:2)
their technique is specifically geared towards executable binary files, it doesn't make sense to use it, and also won't work at all with any other type of file.
so yes the 9x thing is quite unfair, that's like comparing bzip2 vs. lame for compressing music.
Even if you consider the set of binary strings that represent object code to be a subset of the set of all binary strings, I think my point still holds.
The only way my point would be totally invalid, I think, is if were always the case that the assembly-language diff can be expressed in fewer bits than can the corresponding object-code diff. If that is always true, then I agree that my point is invalid.
Today? (Score:4, Funny)
Google's Open-Source Chromium project announced a new compression technique called Courgette geared towards distributing really small updates today.
Better hurry! It won't work tomorrow!
Re: (Score:2)
I was thinking the same thing! Proofreading is a lost art these days.
Like many brilliant ideas... (Score:5, Informative)
...it makes you smack yourself on the head and go "why hasn't everybody been doing this for years?".
The idea is simple, and reminds me of something I learned in school regarding signals. Some operations are easy to perform in the frequency domain, so you do the Fourier transform, perform the operation, and then transform back.
This is really just the same idea applied to the problem of patches. They're small in source; but big in binary. It seems so obvious that you could apply a transform,patch,reverse process... but only when pointed out and demonstrated.
It's almost like my favorite invention: the phonograph.
The instructions for making an Edison phonograph could have been understood and executed by any craftsman going back thousands of years. Yet, it wasn't done until the late 19th century.
Are the inventors that brilliant, or are we just that stupid.
Re: (Score:2)
Or perhaps hindsight is 20/20.
Re: (Score:2)
Re:Like many brilliant ideas... (Score:4, Insightful)
Probably because the old solution was:
A) Simple
B) Good enough for most purposes.
Sure, you can shave 80% off your patch filesize... but unless you're as big as google, patch bandwidth probably isn't a major priority -- you've likely got much more important things to dedicate engineering resources to.
You know how they say "Necessity is the mother of invention"? Well, when an invention isn't really necessary for most folks, it tends to show up a little later than it might otherwise have.
Re: (Score:3, Insightful)
Actually, it made me smack my head and say "I assumed we were already doing this".
Sure, you can shave 80% off your patch filesize... but unless you're as big as google, patch bandwidth probably isn't a major priority
So, you've never installed a service pack or another OS update? I'd be more than happy to run "sudo aptitude patch-upgrade" to download the 3KB difference between OpenOffice 3.1.foo and 3.1.foo+1.
Re: (Score:2)
Sure, you can shave 80% off your patch filesize... but unless you're as big as google, patch bandwidth probably isn't a major priority
It's priority tends to scale though. Once you're as big as Google you have so many people downloading it that if you can shave just a little off each download it adds up to something that matters even to someone as big as you.
Re:Like many brilliant ideas... (Score:4, Informative)
I've been reviewing various proposals like this, and basically, it's a tradeoff mirrors don't want. This sorta stuff has been proposed for ages. I listened to a recording of the author of rsync give an introduction to the algorithm and program, and among the questions was "is this suitable for .deb?". The answer was "Not unless the archive is completely recompressed with gzip patched to be rsync compatible".
Eventually that patch landed, and you could conceivably do this. Except it expands the entire archive by 1-2 percent. And there's a lot of CPU overhead where there was none before.
Then someone cooked up zsync. It's the same thing as rsync except with a precalculated set of results for the server side. This looked like a winner. But someone else really wants LZMA compressed packages to fit more onto pressed CDs. So now we're at a fundamental impass: optimize for the distribution of install media to new users, or optimize for the distribution of updates to existing users.
The best resolution I've seen is to use LZMA compression on the entire CD volume, but that requires the kernel to get their ass in gear and allow yet another compression in the kernel. That may have finally happened, I haven't checked recently. But generally LZMA requires more RAM to operate, so that could raise the minimum requirements on installs.
In short, it's a balancing act of effort, bandwidth, CPU and RAM. What works for some may not work for all.
Re: (Score:2)
...it makes you smack yourself on the head and go "why hasn't everybody been doing this for years?". ... It seems so obvious that you could apply a transform,patch,reverse process... but only when pointed out and demonstrated.
Basically google created a big custom 'transform' that applies only to x86 exe files. The reason why this is a boring story and why nobody has done this before is because nobody has found a generic, simple way to do this. And they still haven't.
If google had actually created a 'new binary diffing algorithm' instead of a specific hack, and this worked for most binary files that have similarities that would be newsworthy. For instance if it could out of the box create small diffs for .exe, .doc, .xls, .tt
Re: (Score:2)
It's only a helicopter because it has a spinning thing who's axel is perpendicular to the ground. And it wouldn't actually work. Even if it could achieve lift by turning the screw, without a tail rotor (or counter-rotating screw) it would simply spin and go nowhere. Then there is the problem of the tremendous power it takes to power a helicopter rotor.
istartedi is right that any decent craftsman could have made a phonograph. Even with the correct information to design a good wing/blade, a helicopter wouldn'
Re: (Score:2)
The phonograph could have been executed thousands of years ago using what materials? I don't belive a practical phonograph could be made without modern materials.
Edison's first device used wax cylinders. That was also found to be impractical for scale production, but it seems conceivable that some machine using a crank or a wound spring and a clay/beeswax/shellac medium and analog amplification could have been made during the Bronze Age.
Because we have no evidence of such devices and we assume people are o
Nothing terribly new (Score:2, Interesting)
There was a old, pre-X MacOS binary-diff program that diffed each element of the resource fork [wikipedia.org] independently.
Back in those days, non-code elements usually were stored in unique "resources," one icon might be in ICON 1, another might be in ICON 2, etc.
Code was split among one or more resources.
If a code change only affected one section of code, the only 1 CODE resource would be affected.
Since each resource was limited to 32KB, a diff that only affected one resource would never be larger than twice that size.
Re: (Score:2)
One thing is very different from back then and now, disk space was at a premium, so there were a ton of optimizations to code fragments in the object file format to that end. For example most data was zero, so only portions that were not were in the object file bunched-up. The utility you refer to was able to take advantage of this. It was otherwise a simple ed-like that could insert, remove, and replace. I wish I could remember the name.
binary diff (Score:2, Informative)
If you're not familiar with the process of binary diff (I wasn't) there's a paper linked from the article that explains some about bsdiff:
http://www.daemonology.net/papers/bsdiff.pdf [daemonology.net]
Wayback from 2007/07/09:
http://web.archive.org/web/20070709234208/http://www.daemonology.net/papers/bsdiff.pdf [archive.org]
Re: (Score:2)
Yes and it is a generic approach that does not need to know anything about pex, x86, and archives. In fact it is used by this google implementation. And there was something much like what google has here implemented back in 1999 cited in the paper you posted the link to:
B.S. Baker, U. Manber, and R. Muth, Compressing Differences of Executable Code, ACM SIGPLAN Workshop on Compiler Support for System Software, 1999.
Can a layman get an explanation in English? (Score:3, Insightful)
Please?
Re: (Score:3, Informative)
Rather than send the difference in the executable, they send the difference in a sort of source code. Saves space if a small source change (move this line past this function) causes a big executable difference.
Re:Can a layman get an explanation in English? (Score:5, Informative)
Binary executable files contain a lot of addresses (variables, jump locations, ...) that are generated by the assembler at compile time.
Now consider you just add one 1-byte instruction somewhere in the middle of your program (let's say "nop"). When you compile it again, all the code that reference addresses beyond the insert point will have changed because the address has been incremented. So these 4 bytes added to your source code could mean addresses that get incremented in the compiled file in thousands of places.
What they do basically is take the binary file, disassemble it back to pseudo source code (not real asm I guess), and diff that against old version. The patch engine on the client end does the same disassembling, applies the patch, and reassembles the patched source code to an executable file.
This means diffs gets much smaller (4 bytes vs. 1000s in my extreme example), but also makes the diff/patch process much more complex, slower, and not portable.
Re: (Score:2)
Wouldn't the diff patch make the code more portable? IANA open source hacker, so this is an honest question.
.
Let say you are using the example you gave. All the addresses are referencing memory locations that may change based on the distribution being used (although that wouldn't be a factor for Google). You might have #ifdef statements in the code that include one set of files if compiled under BSD, as an example, and another set if compiled under GNU/Linux, and your memory locations may vary based on the
Re:Can a layman get an explanation in English? (Score:5, Insightful)
But, if the compilers are similar enough to create the same pseudocode/bytecode/ASM, or smart enough to save the source code, and use it for future comparisons, then wouldn't one patch be just as portable as the original source code?
It's a good theory and you're a smart person, but:
Portability doesn't appear to be Google's primary concern, though. They seem to be keen on the idea of delivering binaries over the wire (real binaries, not bytecode) -- see Google Native Client. [infoworld.com]
Re:Can a layman get an explanation in English? (Score:4, Informative)
A compiler takes source codes and turns them into assembler code. That's lines of human-readable machine instruction mnemonics (for example, "Copy from here to here." "Is that bigger than zero?"). The assembler takes those lines and turns them into machine instructions, a sequence of binary numbers.
Finding the difference between two huge gobs of binary numbers is difficult. Instead, they turn the binary numbers back into lines of mnemonics and use a algorithm that finds the difference between two huge listings of mnemonics.
That method is easier because the listings of a program that has been changed slightly can be very similar to the listing of a unmodified program. That has to do with how compilers work.
Capiche? ;)
Re: (Score:2)
Thanks! That makes sense.
Re: (Score:2)
I'm not sure about the benefits.
1) The bulk of many (most?) software packages are resources, not executables
2) A lot of the executables is a lot of linker/DLL overhead, specially the smaller ones
3) The optimizers (I think) remix the assembly instructions, so small changes in the program logic result in a lot of changes in assembly, ergo, in machine code. The best solution in terms of BW remains sending diffs in high level source code.
Re: (Score:2)
God points; you're pretty right regarding security patches. I hope those ideas could be implemented in the Debian package manager, since currently many little security fixes carry the update of several full packages because of the dependency rules. Of course this is a different but related problem.
>There's also the fact that (a) there's little motive to use full-blown optimized compilation on most of an application's code (best save it for the 20% of the code that runs 80% of the time),
AFAIK, the distrib
Re: (Score:2)
Re: (Score:2)
Imagine you had a tape of somenoe talking and you needed to tell someone what was different from the last time they spoke. One way to do that would be to listen to the language and words and compare their meanings to figure out the difference. The other way would be to just send them the new tape to replace the previous one. A similar problem happens with new versions of software. You want to send the changes but you end up having to send the whole thing.
This technology essentially writes down what the
That's just a dissembler. How about bittorrent? (Score:2)
This is diffs the dissembled version of the original against the update on the server, then does the opposite on the client. I couldn't help but think of this as similar to Gentoo's model ... download a compressed diff of the source and then recompile. Both have the same problem: too much client-side CPU usage (though Gentoo's is an extreme of this). Isn't Google Chrome OS primarily targeting netbooks? Can such things handle that level of extra client-side computation without leaving users frustrated?
Re: (Score:2)
Torrents are problematic in that they are contrary to the way many, if not all, ISP's are set up. Nodes are intended to be consumers, not data providers. Shifting the downloaded data to the UPSTREAM is against the intent of the architects and produces problems.
Redesign the network, sure. Until then, torrents for updates are only taking your problem and shifting it to someone with fewer resources (without permission.)
BCJ filter (Score:3, Interesting)
The concept of a Branch-Call-Jump (BCJ) filter is well-known in the data compression community, and is a standard part of quite a few deployed compression products. Used as a front end to a conventional compression algorithm -- or, in this case, a binary compression algorithm -- does indeed give significant improvements. The application to binary diff is particularly interesting, since it means you can deal with branches and other references *over* the compressed region, so this is really rather clever.
LLVM (Score:2)
wrong title, of course (Score:2)
its not 'binary diffing'.
does it work with gif files? jpg? png? dng? mov? mpg?
its meant for PROGRAM binary files. of a specific type of processor, at that.
that's fine.
but please say so, and don't imply its binary. its really specific and not helpful for image (pic, sound) formats.
What? (Score:2)
Since this will be released as open source, it should make distributing updates a lot easier for the open-source community.
We open source community don't have no F**KING business in binary distribution!
The dynamic deltup server network (Score:2)
A similar approach for distributing updates to source packages has been around for years: The dynamic deltup server network. You can tell their servers which source archives you already have and which new version you want. The server then unpacks both archives and sends you a deltup diff that can be used to create a bit-by-bit copy of the desired source archive, using the deltup program.
An example use case for this are source based operating system distributions, like Gentoo GNU/Linux. The saved bandwidth i
source code patch (Score:2)
If Google's OS is to become open source - why the need for binary diffs at all?
Isn't it feasible that the client store all the source and the patch process just involves a diff to the original source code that gets recompiled?
Also, this dissassemble,edit,reassemble scheme isn't new. There are a few viruses out there that perform similar actions in order to create space in the executable for themselves. pretty ingenious actually.
many ways to do that (Score:3, Interesting)
As bandwidth was a tad limited in those days we too looked for an efficient way to distribute updates. The solution was to distribute the smaller bug fixes as patches, similar to debug scripts. The loader would run those debug scripts after loading the program. To apply a patch the customer would put the patch file in the same folder as the program, restart the program on the hot standby side of the cluster and provoke a switchover to the standby.
The patch was applied without any downtime. If the customer wanted to back-out the bug fix then all they had to do was delete the patch file and switch back to the unpatched side of the cluster.
Most patches were small and we only had a few hundred bytes to send out at a time. Afterwards the world upgraded to Windows and forgot such technology :-(
Re: (Score:2)
How dare people ask questions. Nothing good ever came from doing that!
Re: (Score:2)
Don't be so hard on yourself. I'm sure you were already modded down way before that sentence.
Re:Solving the wrong problem (Score:4, Insightful)
If the code is so awful that the bandwidth required for security updates is a problem, the product is defective by design.
No one is saying that the bandwidth is a problem. They're just saying that the bandwidth is unnecessary. FSM-forbid that anyone try to optimize something.
Plus, as the article points out, smaller updates mean more people can receive the update per unit-bandwidth, which means faster distribution of security updates when something critical is fixed.
Re: (Score:2)
Re: (Score:3, Informative)
If the code is so awful that the bandwidth required for security updates is a problem, the product is defective by design.
You don't understand what the phrase "defective by design" means. It's used by anti-DRM folks to describe "features" that nobody wants and that actually reduce the usefulness of a product, but which are inserted into the product intentionally by the manufacturer out of a misguided desire to support DRM. If a bug/feature is "by design" then you should not expect a patch for it, ever.
A product that needs lots of security patches, on the other hand, is not defective by design; rather, it is simply badly desig
Re: (Score:2)
Still, whatever you call it, it's good to see progress being made. I just wonder why you can't create small, efficient diffs for any kind of binary file?
You can. It's called "bsdiff".
The problem is that when you change a large piece of compiled code, even if just to insert a subroutine, it ends up changing a giant portion of the code because all the jump addresses shift to account for this new code in the middle. So though most of the code is identical, every place that has an address is now shifted. Goo