Compress Wikipedia and Win AI Prize 324

Posted by CmdrTaco on Sunday August 13, 2006 @06:50PM from the what-does-this-mean dept.

Baldrson writes "If you think you can compress a 100M sample of Wikipedia better than paq8f, then you might want to try winning win some of a (at present) 50,000 Euro purse. Marcus Hutter has announced the Hutter Prize for Lossless Compression of Human Knowledge the intent of which is to incentivize the advancement of AI through the exploitation of Hutter's theory of optimal universal artificial intelligence. The basic theory, for which Hutter provides a proof, is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program. Think of it as Ockham's Razor on steroids. Matt Mahoney provides a writeup of the rationale for the prize including a description of the equivalence of compression and general intelligence."

This discussion has been archived. No new comments can be posted.

Compress Wikipedia and Win AI Prize

Load All Comments

Search 324 Comments Log In/Create an Account

Comments Filter:

WikiPedia on iPod! (Score:2, Interesting)

by network23 ( 802733 ) * writes:

I'd love to be able to have the whole WikiPedia available on my iPod (or cell phone), but without destroying [sourceforge.net]
info.edu.org [edu.org] - Speedy information and news from the Top 10 educational organisations.
- Re:WikiPedia on iPod! (Score:3, Insightful)
  
  by Fred Porry ( 993637 ) writes:
  
  Then it would be an encyclopedia, not a Wiki, thats another point why I say: forget about it. Would be nice though. ;)
  - - wikicast (Score:4, Funny)
      
      by VolciMaster ( 821873 ) writes: on Sunday August 13, 2006 @10:14PM (#15900414) Homepage
      
      a method for periodically re-syncing...
      
      So, we need a WikiCast - remember folks, you heard it here first!
      
      Parent Share
      twitter facebook
- Re:WikiPedia on iPod! (Score:4, Insightful)
  
  by CastrTroy ( 595695 ) writes: on Sunday August 13, 2006 @07:25PM (#15899896)
  
  Well, since it's currently only 1 Gig, you could probably put it on a flash card and read it from a handheld. It wouldn't be an ipod. but probably wouldn't require destroying a perfectly good piece of equipment either. You could probably even get weekly updates (hopefully in a diff file) to make sure your copy is in sync with the rest of the internet. Now that I think about it, this would be a really good application. There's lots of times when I'd like to look up something off wikipedia, but not connected to the internet.
  
  Parent Share
  twitter facebook
  - Re:WikiPedia on iPod! (Score:2)
    
    by Nutria ( 679911 ) writes:
    
    Well, since it's currently only 1 Gig?
    
    You didn't RTFA, did you?
    - Re:WikiPedia on iPod! (Score:4, Funny)
      
      by Asztal_ ( 914605 ) writes: on Sunday August 13, 2006 @08:16PM (#15900044)
      
      Umm... which of the 5 thousand links is the article?
      
      Parent Share
      twitter facebook
      - Re:WikiPedia on iPod! (Score:2)
        
        by Millenniumman ( 924859 ) writes:
        
        The 3897th one.
  - Re:WikiPedia on iPod! (Score:2)
    
    by Fordiman ( 689627 ) writes:
    
    I'm thinking, why not just store a copy of Wikipedia (as just webpages - a full mirror) in a squashfs (or similarly block-compressed) archive.
    
    It's the whole instant-access thing.
But captain (Score:5, Funny)

by Anonymous Coward writes: on Sunday August 13, 2006 @06:52PM (#15899787)

Marcus Hutter has announced the Hutter Prize for Lossless Compression of Human Knowledge the intent of which is to incentivize the advancement of AI through the exploitation of Hutter's theory of optimal universal artificial intelligence.

But captain, if we reverse the tachyon inverter drives then we will have insufficient dilithium crystals to traverse the neutrino warp.

Share
twitter facebook
- Re:But captain (Score:5, Funny)
  
  by Anonymous Coward writes: on Sunday August 13, 2006 @07:09PM (#15899836)
  
  You left out the part involving the deflector shield. Remember, the first rule of star trek technobabel is always involve the deflector in some way.
  
  Parent Share
  twitter facebook
  - Re:But captain (Score:2)
    
    by bcat24 ( 914105 ) writes:
    
    Come on, just reverse the damn polarity already.
  - Re:But captain (Score:2, Offtopic)
    
    by WilliamSChips ( 793741 ) writes:
    
    I thought the first rule of Star Trek was that the redshirts always die.
    - Re:But captain (Score:2)
      
      by Amouth ( 879122 ) writes:
      
      first.. it is always first.. then you might have the chance of killing of a yellow shirt 0or maming a blue one
  - Re:But captain (Score:2)
    
    by ScrewMaster ( 602015 ) writes:
    
    And the second rule is to always use a verteron pulse wherever possible, preferably transmitted through subspace.
Painful to read (Score:4, Insightful)

by CuriHP ( 741480 ) writes: on Sunday August 13, 2006 @06:54PM (#15899793)

For the love of god, proofread!

Share
twitter facebook
- Re:Painful to read (Score:2, Insightful)
  
  by Threni ( 635302 ) writes:
  
  > For the love of god, proofread!
  
  Yeah, I just read the write-up twice and have no idea if this is an AI contest, something to do with compression, or what. In fact, all I can remember now is the word "incentivize" which is the sort of thing I expect some bullshit salesman at work to say.
  - Re:Painful to read (Score:3, Insightful)
    
    by ameline ( 771895 ) writes:
    
    Agreed -- why can they not MOTIVATE us instead?
    
    No, they need to verbize another noun when there was a perfectly good word in the language that means *exactly* what they want. feh.
    - - Re:Not sure if that's a joke. (Score:4, Funny)
        
        by Andrew Kismet ( 955764 ) writes: on Monday August 14, 2006 @01:14AM (#15900890)
        
        Of course he was joking. If he was serious he would've said "verbificate".
        
        Parent Share
        twitter facebook
- Re:Painful to read (Score:5, Funny)
  
  by PeeAitchPee ( 712652 ) writes: on Monday August 14, 2006 @05:10AM (#15901350)
  
  He did, but Slashdot's AI compressed it for him.
  
  :-D
  
  Parent Share
  twitter facebook
lossy compression (Score:5, Insightful)

by RenoRelife ( 884299 ) writes: on Sunday August 13, 2006 @06:55PM (#15899798)

Using the same data lossy compressed, with an algorithm that was able to permute data in a similar way to the human mind, seems like it would come closer to real intelligence than the lossless compression would

Share
twitter facebook
- Re:lossy compression (Score:3, Insightful)
  
  by Anonymous Coward writes:
  
  Funny? That's most intelligent and insightful remark I've seen here in months, albeit rather naively stated.
  The human brain is a fuzzy clustering algorithm, that's what neural networks do, they reduce the space of a large
  data set by eliminating redundancy and mapping only the salient features of it onto a smaller data set, which in bio systems
  is the weights for triggering sodium/potassium spikes at a given junction. If such a thing existed a neural compression algorithm would be capable of immense data red
- Re:lossy compression (Score:3, Insightful)
  
  by Vo0k ( 760020 ) writes:
  
  that's one piece, but not necessarily - "lossy" nature of human mind compression can be overcome by "additional checks".
  
  Lossy relational/dictionary based compression is the base. You hardly ever remember text by order of letters or sound of voice reading it. You remember the meaning of a sentence, plus optionally some rough pattern (like voice rhythm) to reproduce the exact text from rewording the meaning. So you understand meaning of sentence, store it as relation to known meanings (pointers to other entri
  - Re:lossy compression (Score:5, Insightful)
    
    by swillden ( 191260 ) * writes: <shawn-ds@willden.org> on Sunday August 13, 2006 @11:14PM (#15900595) Journal
    
    You just need to re-create afile that matches the md5sum and still follows the rules of a Linux kernel. It is extremely unlikely any other file that can be recognized as some kind of Linux kernel and matches. Of course there are countless blocks of data that still match, but very few will follow the ruleset of "ELF kernel executable" structure which can be deduced numerically.
    Mmmm, no. You were fine up until you said "very few will follow the ruleset". That's not true. To see that it's not true, take your kernel, which consists of around 10 million bits. Now find, say, 512 of those bits that can be changed, independently, while still producing a valid-looking kernel executable. The result doesn't even have to be a valid, runnable kernel, but it wouldn't be too hard to do it even with that restriction.
    So you now have 2^512 variants of the Linux kernel, all of which look like a valid kernel. But there are only 2^128 possible hashes, so, on average, there will be four kernels for each hash value, and the odds are very, very good that your "real" kernel's hash is also matched by at least one of them. If by some chance it isn't, I can always generate a whole bunch more kernel variants. How about 2^2^10 of them?
    A hash plus a filter ruleset does not constitute a lossless compression of a large file, even if computation power/time is unbounded.
    
    Parent Share
    twitter facebook
  - - Re:lossy compression (Score:2)
      
      by MindStalker ( 22827 ) writes:
      
      I think the idea is to do a lossy compression combined with the hash could reproduce the original by guessing... Maybe..
As long as it is Wiki that we are talking about... (Score:4, Funny)

by gatkinso ( 15975 ) writes: on Sunday August 13, 2006 @06:56PM (#15899799)

There. All of wiki, in 31 bytes.

Share
twitter facebook
Who'da thunk... (Score:5, Funny)

by blueadept1 ( 844312 ) writes: on Sunday August 13, 2006 @06:59PM (#15899809)

Man, WinRar is taking its bloody time. But oh god, when its done, I'll be rich!

Share
twitter facebook
- Re:Who'da thunk... (Score:2)
  
  by mobby_6kl ( 668092 ) writes:
  
  Damn you and your WinRAR! When is the deadline? WinRK says it needs 3 days 14 hours, you might be finished before then, but I'll surely take the prize... when it's done®
Easy! (Score:2, Funny)

by RyuuzakiTetsuya ( 195424 ) writes:

arj
Lossy Compression? (Score:4, Funny)

by Millenniumman ( 924859 ) writes: on Sunday August 13, 2006 @07:06PM (#15899826)

Convert it to AOL! tis wikpedia, teh fri enpedia . teh bst in da wrld.

Share
twitter facebook
- Re:Lossy Compression? (Score:2, Insightful)
  
  by blueadept1 ( 844312 ) writes:
  
  That is actually an interesting idea. What if you added a layer of compression that converted every possible common acronym, made contractions, etc...
  - Re:Lossy Compression? (Score:2, Interesting)
    
    by larry bagina ( 561269 ) writes:
    
    1) it wouldn't be lossless and 2) most compression techniques use a dictionary of common used words.
    - Would be useful for images (Score:5, Funny)
      
      by aliquis ( 678370 ) writes: on Sunday August 13, 2006 @10:14PM (#15900417)
      
      ... now all we need is a dictionary for nudity and we could save a lot of bandwidth on the Internet!
      
      Parent Share
      twitter facebook
Comparison (Score:2, Informative)

by ronkronk ( 992828 ) writes:

There are some amazing compression programs out there, trouble is they tend to take a while and consume lots of memory. PAQ [fit.edu] gives some impressive results, but the latest benchmark figures [maximumcompression.com] are regularly improving. Let's not forget that compression is not good unless it is integrated into a usable tool. 7-zip [7-zip.org] seems to be the new archiver on the block at the moment. A closely related, but different, set of tools are the archivers [freebsd.org], of which there are lots with many older formats still not supported by open s
- - Re:Comparison (Score:2)
    
    by DarkProphet ( 114727 ) writes:
    
    erm... isn't that already a part of how data compression algorithms like ZIP work right now?
  - Re:Comparison (Score:2)
    
    by Breakfast Pants ( 323698 ) writes:
    
    Whoa, you are an inventive genious! Oh wait, that's kinda how nearly all compression works.
    - Re:Comparison (Score:3, Funny)
      
      by joshier ( 957448 ) writes:
      
      Well, if I knew that 15 years ago, I would indeed have been a genuis, sadly I realized too late and my genuis talents are wasted yet again.
      
      Have no fear though, I'm working on a new one.
It's a big world out there (Score:5, Interesting)

by Harmonious Botch ( 921977 ) writes: on Sunday August 13, 2006 @07:09PM (#15899835) Homepage Journal

"The basic theory...is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program." In a finite discrete environment ( like Shurdlu: put the red cylinder on top of the blue box ) that may be possible. But in the real world the problem is knowing that one's observations are all - or even a significant percentage - of the possible observations.
This - in humans, at least - can lead to the cyclic reinforcement of one's belief system. The belief system that explains observations initially is used to filter observations later.

TFA is a neat idea theoreretically, but it's progeny will never be able to leave the lab.

--
I figured out how to get a second 120-byte sig! Mod me up and I'll tell you how you can have one too.

Share
twitter facebook
- Re:It's a big world out there (Score:2)
  
  by Baldrson ( 78598 ) * writes:
  
  But in the real world the problem is knowing that one's observations are all - or even a significant percentage - of the possible observations.
  This is precisely the assumption of Hutter's theory.
  Chapter 2 of his book "Simplicity & Uncertainty" deals with this in more detail but the link provided does do an adequate job of stating:
  
  The universal algorithmic agent AIXI. AIXI is a universal theory of sequential decision making akin to Solomonoff's celebrated universal theory of induction. Solomonoff d
- Re:It's a big world out there (Score:2)
  
  by Baldrson ( 78598 ) * writes:
  
  This - in humans, at least - can lead to the cyclic reinforcement of one's belief system. The belief system that explains observations initially is used to filter observations later.
  There is no allowance for lossy compression. The requirement of lossless compression is there for precisely the reason you state.
- Re:It's a big world out there (Score:2)
  
  by nacturation ( 646836 ) writes:
  
  ... but it's progeny will never be able to leave the lab.
  
  "it is progeny"? Damn, I thought we'd fixed that bug. Back to the lab with you!
- Re:It's a big world out there (Score:2)
  
  by Kjella ( 173770 ) writes:
  
  But in the real world the problem is knowing that one's observations are all - or even a significant percentage - of the possible observations.
  
  No, that's just deductive science. I (or we, as a society) haven't tested that every cup, glass and plate in my kitchen (or the world) is affected by gravity, but I'm preeeeeeetty sure they are.
  
  The problem - and the really hard AI problem - is that there is no single "program", there's in fact several billion independent "programs" running. These "programs" operate i
- Re:It's a big world out there (Score:5, Funny)
  
  by gardyloo ( 512791 ) writes: on Sunday August 13, 2006 @07:54PM (#15899973)
  
  TFA is a neat idea theoreretically, but it's progeny will never be able to leave the lab.
  
  Your use of "TFA" is a good compressional technique, but you could change "it's" to "its" and actually GAIN in meaning while losing a character! You're well on your way...
  
  Parent Share
  twitter facebook
- Re:It's a big world out there (Score:5, Informative)
  
  by DrJimbo ( 594231 ) writes: on Sunday August 13, 2006 @08:13PM (#15900036)
  
  Harmonious Botch said:
  
  This - in humans, at least - can lead to the cyclic reinforcement of one's belief system. The belief system that explains observations initially is used to filter observations later.
  
  I encourage you to read E. T. Jaynes' book: Probability Theory: The Logic of Science [amazon.com]. It used to be available on the Web in pdf form before a published version became available.
  
  In it, Jaynes shows that an optimal decision maker shares this same tendency of reinforcing exiting belief systems. He even gives examples where new information reinforces the beliefs of optimal observers who have reached opposite conclusions (due to differing initial sets of data). Each observer believes the new data further supports their own view.
  
  Since even an optimal decision maker has this undesirable trait, I don't think the existence of this trait is a good criteria for rejecting decision making models.
  
  Parent Share
  twitter facebook
  - Re:It's a big world out there (Score:3, Informative)
    
    by Baldrson ( 78598 ) * writes:
    
    In it, Jaynes shows that an optimal decision maker shares this same tendency of reinforcing exiting belief systems. He even gives examples where new information reinforces the beliefs of optimal observers who have reached opposite conclusions (due to differing initial sets of data). Each observer believes the new data further supports their own view.
    I think what Hutter has shown is that there is a solution which unifies the new data with the old within a new optimum, which is most likely unique. I think
    - Re:It's a big world out there (Score:2)
      
      by DrJimbo ( 594231 ) writes:
      
      Be that as it may (and I highly doubt there will always be unique solutions) it is a separate issue entirely from what the original poster and I were talking about. We were talking about the problem of deciders becoming prejudiced over time, falling into a rut where new information tends to confirm and reinforce existing "beliefs".
      
      This effect is often seem clearly when as I said before and you quoted:
      ... optimal observers who have reached opposite conclusions (due to differing initial sets of data)
      - Local optima (Score:2)
        
        by Baldrson ( 78598 ) * writes:
        
        The fact that there are local optimal in utility functions in which you can get stuck is a problem for all learning systems but to varying degrees depending on the details of how they look around the space of "programs". The space of programs measured for Kolmogorov complexity has a lot of discontinuity.
- Re:It's a big world out there (Score:3, Insightful)
  
  by Ignis Flatus ( 689403 ) writes:
  
  I think the original premise is wrong. Real world intelligence is not lossless. The algorithms only have to be right most of the time to be effective. And our intelligence is incredibly redundant. If you want robust AI, you're going to have to accept redundancy and imperfection. Same goes for data transmission. Sure, you compress, but then you also add in self-error correcting codes with a level on redundancy based on the known reliability of the network.
  - Re:It's a big world out there (Score:2)
    
    by smug_lisp_weenie ( 824771 ) writes:
    
    > Real world intelligence is not lossless. The algorithms only have to be right most of the time to be effective.
    
    Right- then all you need to do is run the data through the AI system and make a list of the few times it is wrong- This would be a small list. Then add this list to the end of the compressed data- If the AI is any good then you should still have fantastic compression.
    
    ...so now you've taken intelligence that is "lossy" and made it "lossless" in a highly efficient manner.
    
    ...I can't see
  - Re:It's a big world out there (Score:4, Interesting)
    
    by kognate ( 322256 ) writes: on Sunday August 13, 2006 @09:45PM (#15900333)
    
    Yeah, but you can use Turbo codes to achieve near Shannon limit, and you don't have to worry too much about the addition of the ECC. Remember kids: study that math, you never know when information theory can suddenly pay off.
    
    Just to help (and so you don't think I made Turbo Codes up -- it's sounds like I did 'cause it's such a bad name)
    http://en.wikipedia.org/wiki/Turbo_code [wikipedia.org]
    
    Parent Share
    twitter facebook
    - Cool... (Score:2)
      
      by Baldrson ( 78598 ) * writes:
      
      I hope the prize fund becomes very large and someone like you comes up with an algorithm and gets rich enough to retire.
- Re:It's a big world out there (Score:2)
  
  by jpellino ( 202698 ) writes:
  
  "The belief system that explains observations initially is used to filter observations later. "
  
  Hoo-ray for empiricism! Which as Hume pointed out is circular reasoning.
  It does keep one from getting hit by buses, and is the driving force behind the alarm clock industry.
  So it's not a total waste, nevermind the nasty philosophical bit.
Comment removed (Score:4, Interesting)

by account_deleted ( 4530225 ) writes: on Sunday August 13, 2006 @07:15PM (#15899857)

Comment removed based on user account deletion

Share
twitter facebook
- Re:Er, I'm not so sure about this. (Score:2, Offtopic)
  
  by Baldrson ( 78598 ) * writes:
  
  Wikipedia is a representation of accumulated human knowledge (experience) presented primarily in natural language. The smallest self-extracting archive will necessarily have rules that imply that knowledge and in such a way that the rules of natural language are represented as well.
  The distinction between compressed experience and rules is an illusion. Rules _are_ compressed experience in the same sense that "x+y=z" is a compressed representation of the table of all ordered pairs (x,y) of numbers and th
Solution. (Score:5, Funny)

by Funkcikle ( 630170 ) writes: on Sunday August 13, 2006 @07:34PM (#15899916)

Removing all the incorrect and inaccurate data from the Wikipedia sample should "compress" it down to at least 20mb.

Then just apply your personal favourite compression utility.

I like lharc, which according to Wikipedia was invented in 1904 as a result of bombarding President Lincoln, who plays Commander Tucker in Star Trek: Enterprise with neutrinos.

Share
twitter facebook
That's easy (Score:2)

by CastrTroy ( 595695 ) writes:

That's easy, all you have to do is run your program on a computer that users 32-bit bytes. That way you can fit more bits in your bytes, and automatically beat the record by 4 times.
- Re:That's easy (Score:2)
  
  by fireboy1919 ( 257783 ) writes:
  
  Bytes are always eight bits. Nibbles are always four bits. Kilobytes are always 1024 bytes. Are you seeing a pattern?
  
  The word you should be using is, well, "word." That's what we used to call the bus width of the system. Today, though, the world is much more complicated. Lots of different buses, lots of different bus widths. CISC instruction sets are more CISC than they used to be (outside of mainframes).
  
  So really, that doesn't apply very well even in the way you intended it.
Well.. doesn't the dictionary make it smaller??? (Score:2)

by popo ( 107611 ) writes:

Doesn't the dictionary in PAQ8A,B,C,D result in smaller filesizes if you're talking about a 100M+ large file?
Incentivize? (Score:5, Funny)

by noidentity ( 188756 ) writes: on Sunday August 13, 2006 @07:47PM (#15899952)

the intent of which is to incentivize the advancement of AI

Sorry, anything which uses the word "incentivize" does not involve intelligence, natural or artificial.

Share
twitter facebook
- Re:Incentivize? (Score:2)
  
  by DavidD_CA ( 750156 ) writes:
  
  Please join me in my personal crusade to eliminate the word "incent" and "incentivize" from our culture.
  
  The root word is "incentive" and it wasn't until the last few decades that people came up with "incent", "incentivize", and -- god forbid -- "disincent".
  
  May I suggest these alternate words:
  - encourage
  - give incentive
  - influence
  - motivate
  - stimulate
  
  Hell, even "prod" is a better word. Now let us raise our torches and pitch forks and put these rogue words to rest.
I'll try: (Score:5, Funny)

by dcapel ( 913969 ) writes: on Sunday August 13, 2006 @07:59PM (#15899986) Homepage

echo "!#/bin/sh\nwget en.wikipedia.org/enwiki/" > archive

Mine wins as it is roughly 40 bytes total.To get your results, you simply need to run the self-extracting archive, and wait. Be warned, it will take a while, but that is the cost of such a great compression scheme!

Share
twitter facebook
- Re:I'll try: (Score:4, Funny)
  
  by MarkRose ( 820682 ) writes: on Sunday August 13, 2006 @08:37PM (#15900118) Homepage
  
  echo "!#/bin/cat /dev/tty0" > archive
  
  Here's one that's even shorter, but you have to type in the decryption key exactly right.
  
  Parent Share
  twitter facebook
Reductionary Transformation Theory (Score:2)

by transami ( 202700 ) writes:

Not to be concited, but I thought of that over a decade ago. I labeled it Transformation Theory. The theory essentially says, given an input and a desired output am explicit mapping can be drawn between the two and algorithms (and hence AI) derive from applying reductions to the mapping (eg. compression). I later dubbed it Reductionary Transformation Theory, so as not to confuse it with another meme by the same name.
Is lossless really best (Score:2, Interesting)

by Anonymous Coward writes:

I would argue that lossless compression really is not the best measure of intelligence. Humans are inherently lossy in nature. Everything we see, hear, fear, smell, and taste is pared down to its essentials when we understand it. It is this process of discarding irrelevant detials and making generalizations that is truly intelligence. If our minds had lossless compression we could regurgitate textbooks, but never be able to apply the knowledge contained within. If we really understand, we could reproduce wh
- Bingo (Score:2)
  
  by DaveAtFraud ( 460127 ) writes:
  
  Absolutely! The "right answer" for best *intelligent* compression would only store a *minimal* set of pertinent data points and then would use intellect to flesh out the details on decompression. Although you could end up with http://www.reducedshakespeare.com/ [reducedshakespeare.com], I'll worry that when some AI researcher starts down this path
  
  Minor disagreement with what you said... The details are relevant; they just aren't important enough to store in and of themselves. Sort of like mathematics where you only need to lear
Hutter's theory? (Score:2)

by Prof.Phreak ( 584152 ) writes:

Sounds quite similar to Solomonoffs' universal prediction.
- He cites Solomonov (Score:2)
  
  by Baldrson ( 78598 ) * writes:
  
  Hutter's theory is about an active AI agent seeking to maximize some utility function rather than passive prediction. However, it is probably the case that prizes for human knowledge compression should have been in place a long time ago -- probably as long ago as William of Ockham. IMHO guys like Solomonov provided sufficient formal rigor to justify massive government funding of a prize competition for compression of human knowledge decades ago, but it seems to be the case that people who control money ge
wikipedia (Score:2)

by gEvil (beta) ( 945888 ) writes:

Since it's wikipedia, am I allowed to edit the entries before I compress them? ;-)
C++ (Score:3, Funny)

by The Bungi ( 221687 ) writes: <thebungi@gmail.com> on Sunday August 13, 2006 @08:32PM (#15900103) Homepage

Interestingly enough, the source code [fit.edu] for the compressor is C++. One would expect the thing to be written in pure C.
A (good) sign of the times, I guess.

Share
twitter facebook
My entry (Score:2)

by mbstone ( 457308 ) writes:

echo I'mright!NoI'mright!You'reanasshole!So'syourmother !! >distilled-wiki.txt
how about.... (Score:2)

by ChrisGilliard ( 913445 ) writes:

# tar cvf wikipedia.tar /wikipediapath
# bzip2 wikipedia.tar
Remove words and punctuation (Score:2)

by failedlogic ( 627314 ) writes:

Start with: "the" "and" Then remove periods commas semicolons Speling mistkes ok if readable If other words are commonly used those could be removed It would be a first a fill in the blank-as-you-read encyclopedia Sales staff for door to door enclopedias can convince you new edition saves trees

The above text is a demonstration. No punctuation. Left out "and" and "the". Since some entries are hard to read, the fill-in-the-blank interpretation adds to the experience! ;)

If this new compression is added to the
Could the winning entry... (Score:2)

by dpbsmith ( 263124 ) writes:

...be a program that creates an "encyclopedia" website and invites humans to contribute to it?
I've got it! (Score:2)

by Millenniumman ( 924859 ) writes:

I've got it!

while (result != wikipedia) { result = [randomGenerator randomStringOfLength:[wikipedia length]]; }
And there was me thinking.. (Score:2, Funny)

by baz1860 ( 872019 ) writes:

that the entire knowledge of the world could simply be compressed without loss to

yeah, you guessed it..

42...
There is a theory that (Score:2)

by cdrguru ( 88047 ) writes:

for any given dataset there is a pseudo-random number generator that, with the proper parameters, produce that dataset of bytes in sequence.

The problem is that it is non-trivial to find the proper algorithm and parameters. However, if trial-and-error computation is not a problem, a large library of potential generators can be tried with all possible parameters.
Compressing text (Score:2)

by 4D6963 ( 933028 ) writes:

The way I see it, there's mainly one good way to compress text (Disclaimer : I do not know if this idea has ever been used before, but if it has i'd be glad to know about it). Let's say you index the values of the most frequently used characters, and that you keep a 'joker' value to insert a custom character after, let's say that it sums up to about 45 characters for these indexed characters. You replace each character by its indexed value, and instead of having it in base 256, you calculate it all in base
Doug in a dress? (Score:2)

by Ace905 ( 163071 ) writes:

douginadress [douginadress.com]
Hutter's Theory - Disproved (Score:4, Insightful)

by giafly ( 926567 ) writes: on Sunday August 13, 2006 @10:57PM (#15900538)

The basic theory, for which Hutter provides a proof, is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program. Think of it as Ockham's Razor on steroids.

A "Hutter AI" will be at a disadvantage when competing against an opponent which knows it's acting as above and can do the same calculations. Under these circumstances, the opponent will be one step ahead. The Hutter AI is predictable and so can be outmanoeuvered. Hence the Hutter AI's moves are not optimal.

Human poker players address this issue by deliberately introducing slight randomness into their play. I think a "Hutter AI" will make better real-world decisions if it does the same (see Game Theory).

Occam's razor (also spelled Ockham's razor) is a principle attributed to the 14th-century English logician and Franciscan friar William of Ockham. Originally a tenet of the reductionist philosophy of nominalism, it is more often taken today as a heuristic maxim that advises economy, parsimony, or simplicity in scientific theories. Occam's razor states that the explanation of any phenomenon should make as few assumptions as possible - Wikepedia [wikipedia.org]

Occam's razor is also highly suspect. There's the issue of cultural bias when counting assumptions. And all programmers will be aware of how they fixed "the bug" that caused all the problems in an application, only to find there were other bugs that caused identical symptoms.

Share
twitter facebook
Compress Wikipedia and win a prize? (Score:5, Funny)

by Dachannien ( 617929 ) writes: on Sunday August 13, 2006 @10:59PM (#15900551)

Can't I just punch the monkey for $20 instead?

Share
twitter facebook
- Re:for those who rtfa (Score:2, Informative)
  
  by kfg ( 145172 ) * writes:
  
  a) how big the compressed size was
  
  18MB
  
  b) how many bytes was wikipedia before it was compressed
  
  A sample of 100MB
  
  Your goal:
  .
  
  KFG
- Re:Can it be "lossy" compression? (Score:5, Funny)
  
  by Bill Kilgore ( 914825 ) writes: on Sunday August 13, 2006 @07:22PM (#15899879)
  
  I have a program that compresses 100M of Wikipedia to one bit with no loss at all. The program is somewhat special-purpose, and at 100,024,076 bytes, a little chunkier than I'd like.
  
  Parent Share
  twitter facebook
  - Re:Can it be "lossy" compression? (Score:2)
    
    by KiloByte ( 825081 ) writes:
    
    Compared to 104857600, you at least got some compression.
    - - Re:Can it be "lossy" compression? (Score:3, Informative)
        
        by KiloByte ( 825081 ) writes:
        
        Why so? The test file is exactly 10^8 bytes.
        I downloaded the corpus, and indeed, you're right -- it's 10^8 bytes. The article is incorrect, it says 100M where it means 95.3M.
        
        This inconsistency doesn't have any effect on the challenge, though -- that 50kEUR[1] is offered for compressing the given data corpus, not for compressing a string of 100MB.
        
        [1] 1kEUR=1000EUR. 1M EUR=1000000EUR. 1KB=1024B. 1MB=1048576B.
        And by the way, what about fixing Slash to finally allow Unicode -- either natively or at least as
  - Wrong contest (Score:4, Informative)
    
    by Baldrson ( 78598 ) * writes: on Sunday August 13, 2006 @07:55PM (#15899977) Homepage Journal
    
    That's another contest that is useless for the reason you cite.
    The contest for the Hutter Prize requires the compressed corpus to be a self-extracting archive -- or failing that to add the size of the compressor to the compressed corpus.
    
    Parent Share
    twitter facebook
    - Self-extracting on what platform? (Score:2)
      
      by tepples ( 727027 ) writes:
      
      The contest for the Hutter Prize requires the compressed corpus to be a self-extracting archive
      
      For what architecture?
      - Barebones Windows or Linux (Score:3, Informative)
        
        by Baldrson ( 78598 ) * writes:
        
        See the detailed rules for specifics [fit.edu] but generally the rules are just what you would expect: The program runs (and completes in a reasonable time) on a relatively recent system running Windows (currently XP) or Linux with no external inputs, eg no dynamically loaded libraries not included in the submission, no net communication and no disk I/O that isn't generated by the program itself.
        Points are not awarded for attempting to circumvent the intent of the competition. I expect such attempts would result
- - Re:Can it be "lossy" compression? (Score:5, Funny)
    
    by richdun ( 672214 ) writes: on Sunday August 13, 2006 @07:15PM (#15899858)
    
    Hmmm...well in that case, someone go edit the Wikipedia entry on "computers" and allow them to store data at the bit level. Also, I heard somewhere where computers in Africa have tripled in the past six months!
    
    Parent Share
    twitter facebook
    - Re:Can it be "lossy" compression? (Score:2)
      
      by bcat24 ( 914105 ) writes:
      
      Now that's an elephant of a tale.
    - Re:Can it be "lossy" compression? (Score:2)
      
      by StikyPad ( 445176 ) writes:
      
      That's ELEPHANTS. Computers in ELEPHANTS have tripled in the past 6 months. Get it right.
- - Re:Easy compression rule (Score:2)
    
    by Kadin2048 ( 468275 ) writes:
    
    the zip archive tool scans the whole code, it finds repeats in the code (1001010), abbreviates them and then indexes them in a seperate file within the archived file, then when the other computer begins to extract, it takes on the work of plonking back the repeated code, making the archive tiny.
    
    Huh? What do you think the compression software is doing right now? It searches through the file for blocks of similar information, and then replaces those blocks with pointers to an index, where it stores the bloc
    - Re:Easy compression rule (Score:2)
      
      by Jeremi ( 14640 ) writes:
      
      It's unlikely that you could do a better job than any reasonable compression program does already with the same data.
      Very simple way to win this contest:
      
      Write a program that generates the digits of PI
      Have the program run until it comes across a sequence of digits in PI that are equivalent to the Wikipedia file
      At that point, the program can just print out the index/offset into PI where the sequence starts, and that is your compressed output. 'Decompressing' it again is just a matter of looking up that seque
      - Re:Easy compression rule (Score:2)
        
        by 4D6963 ( 933028 ) writes:
        
        Very simple way to win this contest:...
        The only problem is that your index will be as big as your original file.
        Let's say you want to store your ten-digit phone number using that PI technique. A ten-digit phone number will be found at an average index of 10^10, or 10,000,000,000, the problem is that index is 11 digits long, so your really not winning. In this particular case, you could hope to find your index around 8^100,000,000, which would take 100,000,000 bytes (or 100 MB, the original size) to store.
        
        Re:Easy compression rule (Score:2)
        
        by Jeremi ( 14640 ) writes:
        
        The only problem is that your index will be as big as your original file.
        
        Well... maybe. It depends a lot on what data you are trying to compress. As a best-case scenario, it can compress the entire decimal expansion of PI down to a single digit: "0". Infinite digits compressed to a single byte, not bad eh? ;^)
    - - Re:Easy compression rule (Score:2)
        
        by morcego ( 260031 ) * writes:
        
        In theory, you can end up with a location that takes a number so large that you need more than 100MB (mb = milibits ?) to store it.
        
        Also, you would have to either store enough digits of pi (which would defeat any compression ideas), or calculate it "on site", which would be ... hummm ... kind of fun, I suppose.
  - Re:Easy compression rule (Score:2)
    
    by rm999 ( 775449 ) writes:
    
    Indexing into that extra file table takes bits. For example, if you have 1024 different possible abbreviations, you need to use on average 10 bits (at the least) to index that table for each abbrevation. You could use less bits for the more common ones (like huffman coding), but you are still wasting space indexing.
- Re:I can convert the data to 1 bit. (Score:2)
  
  by gkhan1 ( 886823 ) writes:
  
  If you read the rules of the contest it states that a submission has to be a single executable that produces the 100 mb file. It's the size of that decompressor that counts. So no, you couldn't do that.
  - libwikipedia? (Score:2)
    
    by tepples ( 727027 ) writes:
    
    If you read the rules of the contest it states that a submission has to be a single executable that produces the 100 mb file.
    
    What libraries is this executable allowed to call?
- Yes (Score:2)
  
  by Baldrson ( 78598 ) * writes:
  
  And indeed you can come up with all kinds of rules to help you predict the characters in a stream like Wikipedia, not just linguistic rules, but higher level rules are less likely to become crucial with the 100M contest and may need to wait for the 1G (or complete archive of Wikipedia) contest. I can imagine that certain rules relating to relational similarity (analogy) might come into play at the 100M level, particularly since there are now programs that can perform as well as college bound seniors on the
- Re:lzip! (Score:2)
  
  by Jeremi ( 14640 ) writes:
  
  Just use lzip! 100% compression on any data, even if it's already been compressed by another utility! It works fantastically, but you may run into trouble if you try uncompressing the data.
  Bah, lzip is nothing compared to azip [bebits.com], which has all the infinite-compression goodness of lzip, but also supports lossless decompression. The downside is that it currently only runs under BeOS, but it comes with source so it should be easy enough to port to (whatever).
- Re:Good idea (Score:2)
  
  by Feyr ( 449684 ) writes:
  
  no that's wrong. the posit is for an INFINITE number of monkeys typing for infinity. 1000 isn't enough

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

WikiPedia on iPod! (Score:2, Interesting)

Re:WikiPedia on iPod! (Score:3, Insightful)

wikicast (Score:4, Funny)

Re:WikiPedia on iPod! (Score:4, Insightful)

Re:WikiPedia on iPod! (Score:2)

Re:WikiPedia on iPod! (Score:4, Funny)

Re:WikiPedia on iPod! (Score:2)

Re:WikiPedia on iPod! (Score:2)

But captain (Score:5, Funny)

Re:But captain (Score:5, Funny)

Re:But captain (Score:2)

Re:But captain (Score:2, Offtopic)

Re:But captain (Score:2)

Re:But captain (Score:2)

Painful to read (Score:4, Insightful)

Re:Painful to read (Score:2, Insightful)

Re:Painful to read (Score:3, Insightful)

Re:Not sure if that's a joke. (Score:4, Funny)

Re:Painful to read (Score:5, Funny)

lossy compression (Score:5, Insightful)

Re:lossy compression (Score:3, Insightful)

Re:lossy compression (Score:3, Insightful)

Re:lossy compression (Score:5, Insightful)

Re:lossy compression (Score:2)

As long as it is Wiki that we are talking about... (Score:4, Funny)

Who'da thunk... (Score:5, Funny)

Re:Who'da thunk... (Score:2)

Easy! (Score:2, Funny)

Lossy Compression? (Score:4, Funny)

Re:Lossy Compression? (Score:2, Insightful)

Re:Lossy Compression? (Score:2, Interesting)

Would be useful for images (Score:5, Funny)

Comparison (Score:2, Informative)

Re:Comparison (Score:2)

Re:Comparison (Score:2)

Re:Comparison (Score:3, Funny)

It's a big world out there (Score:5, Interesting)

Re:It's a big world out there (Score:2)

Re:It's a big world out there (Score:2)

Re:It's a big world out there (Score:2)

Re:It's a big world out there (Score:2)

Re:It's a big world out there (Score:5, Funny)

Re:It's a big world out there (Score:5, Informative)

Re:It's a big world out there (Score:3, Informative)

Re:It's a big world out there (Score:2)

Local optima (Score:2)

Re:It's a big world out there (Score:3, Insightful)

Re:It's a big world out there (Score:2)

Re:It's a big world out there (Score:4, Interesting)

Cool... (Score:2)

Re:It's a big world out there (Score:2)

Comment removed (Score:4, Interesting)

Re:Er, I'm not so sure about this. (Score:2, Offtopic)

Solution. (Score:5, Funny)

That's easy (Score:2)

Re:That's easy (Score:2)

Well.. doesn't the dictionary make it smaller??? (Score:2)

Incentivize? (Score:5, Funny)

Re:Incentivize? (Score:2)

I'll try: (Score:5, Funny)

Re:I'll try: (Score:4, Funny)

Reductionary Transformation Theory (Score:2)

Is lossless really best (Score:2, Interesting)

Bingo (Score:2)

Hutter's theory? (Score:2)

He cites Solomonov (Score:2)

wikipedia (Score:2)

C++ (Score:3, Funny)

My entry (Score:2)

how about.... (Score:2)

Remove words and punctuation (Score:2)

Could the winning entry... (Score:2)

I've got it! (Score:2)

And there was me thinking.. (Score:2, Funny)

There is a theory that (Score:2)

Compressing text (Score:2)

Doug in a dress? (Score:2)

Hutter's Theory - Disproved (Score:4, Insightful)

Compress Wikipedia and win a prize? (Score:5, Funny)

Re:for those who rtfa (Score:2, Informative)