Building a Fast Wikipedia Offline Reader 208
ttsiod writes "An internet connection is not always at hand. I wanted to install Wikipedia on my laptop to be able to carry it along with me on business trips. After trying and rejecting the normal (MySQL-based) procedure, I quickly hacked a much better one over the weekend, using open source tools. Highlights: (1) Very fast searching. (2) Keyword (actually, title words) based searching. (3) Search produces multiple possible articles, sorted by probability (you choose amongst them). (4) LaTeX based rendering for mathematical equations. (5) Hard disk usage is minimal: space for the original .bz2 file plus the index built through Xapian. (6) Orders of magnitude faster to install (a matter of hours) compared to loading the 'dump' into MySQL — which, if you want to enable keyword searching, takes days."
Wow! (Score:3, Funny)
Re: (Score:3, Insightful)
Re:Why? (Score:5, Funny)
Just settle it the old way (Score:5, Funny)
Re: (Score:3, Funny)
Re: (Score:2)
Conversations around a campfire can go anywhere.
Take that, Mr Obviously A. Troll! (Score:5, Funny)
Re: (Score:2, Funny)
Re:Take that, Mr Obviously A. Troll! (Score:5, Funny)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Ironically, You're already reading slashdot. You had just wasted your time.
You're reading it again, wasting more time ehh??....
But the point is, if programming an offline wikipedia makes you happy and you don't need the money then you would understand....
Re:Why? (Score:5, Insightful)
Complex numbers originated from something "useless" like trying to solve the quartic polynomial in radicals...try building a bridge without them. In fact all of science is built upon people going in random tangents doing things they enjoy, discovering seemingly "useless facts" but most of it becomes useful *and* gives us an idea of the universe in which we live.
Only working on immediate practical problems is very shortsighted, and if mandated throughout the academic community, would mean the death of innovation and most discoveries.
Re: (Score:2)
which is one of the greatest motivations for human advancements.
Re: (Score:2)
Re: (Score:2)
Why? Have you seen the price of 4G flash cards recently?
Re: (Score:2)
SDHC? (Score:5, Informative)
Re: (Score:3, Informative)
Ho-Hum ... (Score:5, Funny)
Let us know when you're ready for prime time
Re:Ho-Hum ... (Score:5, Insightful)
Re: (Score:3, Funny)
Not too hard if you have a sub-etha net connection handy. Better check that the article about The Earth which you have been working on hasn't been cut down to two words though.
Re: (Score:2)
I hope (Score:4, Funny)
George W Bush
Is a dick head!!!!11
Re:I hope (Score:5, Funny)
Oh, nevermind, I see the problem:
George W Bush
Is a dick head!!!!11
should be
George W Bush
Is a dick head!!!!!!
Man, those out to mess with the content are getting more and more subtle...
Re: (Score:2)
But... (Score:2, Funny)
Hitchhiker's guide here we come! (Score:5, Funny)
Re:Hitchhiker's guide here we come! (Score:5, Funny)
Re:Hitchhiker's guide here we come! (Score:5, Funny)
Re: (Score:2)
Re:Hitchhiker's guide here we come! (Score:5, Insightful)
Re:Hitchhiker's guide here we come! (Score:5, Funny)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re:Hitchhiker's guide here we come! (Score:5, Funny)
* 1. It is slightly cheaper
* 2. It has the words "You can copy and edit me for free" inscribed in large friendly letters in the license.
Also like the guide, although it cannot hope to be useful or informative on all matters, it does make the reassuring claim that where it is inaccurate, it is at least definitively inaccurate
Re: (Score:2)
Only 2 days huh (Score:2, Funny)
Re: (Score:2, Funny)
Good part of the page: the explanation (Score:5, Insightful)
Comment removed (Score:5, Informative)
Re: (Score:2)
Re: (Score:2)
Your URL leads to a domain parking page. Google search for Wikistick didn't bring results on the first page either. AFAIK, full Wikipedia (text and images) is too large for a USB stick.
What did you want to tell us?
Re: (Score:2)
Re: (Score:2)
Mass inserts into mysql... (Score:4, Informative)
Re: (Score:2)
What?? (Score:5, Funny)
1. Not a thinly-veiled attempt to advertise a crappy product
2. Not bashing Microsoft
3. Not about somebody who is trolling open-source (i.e. SCO)
4. Not about Bush taking away all our rights and ending freedom
5. Not about voting fraud and the end of democracy/America/the world
6. Not decrying Vista DRM and its ties to the MAFIAA
7. Posted on Slashdot
Furthermore, TFA is interesting and informative.
Am I in heaven?
Re:What?? (Score:5, Funny)
The Point? (Score:2)
I know that not everyone has a permanent connection to the net everywhere they go, but what is the point of storing a local copy of Wikipedia?
The beauty of it is that it is online and always up-to-date (wrong, or less wrong).
Trying to capture it locally seems to me to be like trying to print The Internet. By the time it's done spooling, it's out of date.
If it's an academic project, that's really cool, but I don't see a practical point to it.
Re:The Point? (Score:5, Insightful)
Joe has all-you-can-eat broadband at home, or an understanding employer with a fat pipe, and spends two hours each day on the train. Two and a half gig per month (and lets face it, you probably don't want to update it more frequently that that) and he's got probably half his reading material sorted out.
Wang lives in Buttfuckistan, a fictional country with totalitarian leanings with too many real-world counterparts. The Great Firewall of Buttfuckistan (i.e. squidguard, under the control of Buttfuckistan Telecom, and settings in the routers to drop non-port-80 traffic half the time) makes it impossible to reliably access Wikipedia from inside their borders, which is a great shame because the entry on Buttfuckistan is particularly unflattering. Once a month, Joe sticks a DVD with five minutes from an old re-run of Friends and an encrypted dump of Wikipedia in an airmail envelope and sends it to Wang.
Mary is still at secondary school, and her particular school has wifi access for students who are encouraged to purchase their own laptops, but since the local pastor discovered http://en.wikipedia.org/wiki/Image:Dream_of_the_f
Still can't see the point?
Re: (Score:2)
The beauty of it is that it is online and always up-to-date (wrong, or less wrong).
Trying to capture it locally seems to me to be like trying to print The Internet. By the time it's done spooling, it's out of date.
>>
Sure, what's the point of reading an old version of the history of the Battle of Hastings, or the technical specifications of the P-51 Mustang, or the characteristics of a dominant seventh chord? After six months, it's complete obsolete and worthless, right?
What about moulin? (Score:2)
http://moulinwiki.org/l/en/ [moulinwiki.org]
Re: (Score:2)
This guy's work required about 3GB for the compressed Wikipedia data dump (split up into compressed chunks using bzip2recover), plus python, perl, a little database library (xapian) and a web server (Django). He seems to be working in English only, and doesn't seem to provide a "why" or who this might be useful to.
Moulin has a concrete aim in mind, they are starting with the much smaller French version of Wikipedia, and have built a CD-ROM sized offline viewer for released in
Re: (Score:2)
Re: (Score:2)
Um, anyone who wants to have the entire English version of Wikipedia on their local machine, for those times when they're away from the net?
People who "would love to have Wikipedia on their laptop, since this would allow them to instantly check for things they want regardless of their location (business trips, hotels, etc). Others simply don't have an Internet connection - or they don't want to dial up one every time they need to check someth
...or the HTML export feature? (Score:2)
But hey, two days and a few hundred lines of code is cool. You geek (verb). If we always took the easy way out we'd be using Windows and have committed suicide long ago.
can we get a PSP version of it? (Score:4, Interesting)
better yet, a DS version (Score:4, Informative)
Or a PalmOS version of it? (Score:2)
wikipedia for iPod is already here... (Score:2)
http://encyclopodia.sourceforge.net/en/index.html [sourceforge.net]
for the iPod; also the Encyclopodia Ebook format (basically an indexed b2zipped articles or blocks), is far better suited for portable devices.
Now if any PSP/DS/Palm developer is reading this...
There's a bug in TFA: Missing articles. (Score:5, Insightful)
The wikipedia database file is one large bzip2'ed XML file which the author splits into blocks of 900k (bzip2's natural blocking) which he then parses for the "title" and "text" XML tags.
The problem with that approach is that some of these tags may well end up being split over block boundaries, so some articles risk being missed. EG:
END-OF-BLOCK: blablablabla...blabla[/text][othertag][ti
START-OF-NEXT-BLOCK: tle][sometag]blablablablabla...
So searching for "[title]" in boths blocks separately like TFA does will fail for one article.
(I've used square brackets instead of lessthans and greaterthans because slashdot won't let me use them.)
Re: (Score:2)
Re: (Score:2)
Anyway, gotta go to work. When I come back, I'll do some more in-depth sleuthing.
Re:There's a bug in TFA: Missing articles. (Score:4, Informative)
As for the title, what you describe can't happen because of a fortunate side-effect: when compressing, bzip2 (as other compressors) look for previous appearances of a token (in this case, '<title>') and code a reference to it (instead of the full text) to save space. Since "text" and "title" appear all the time in these blocks (at least once for each article), they will NOT be split - they will be encoded as "references", and therefore, what you describe shouldn't happen (I hope :-)
Hook it in to your desktop search... or... (Score:2)
sdict? (Score:2)
Anyone with experience of sdict [sdict.com]?
They offer a dictionary reader for various systems, including portable devices, and dictionaries including Wikipedia.
Unfortunately their Wikipedia dict is a old (January), but it seems like a good approach for laptops or other small devices. When I get an 8Gb SDHC I'm going to try it on my Nokia N800.
Not New (Score:2)
Also, as others have noted, the his choping the file into chunks means you're going to loose at least one article per chunk.
I'd implemented this with a compressed file system and maybe some symlinks. Happily, the static content is already there for the taking. Some find and grep o
Gears? (Score:2)
The author did some nifty hacking that resulted in the following stack of dependencies:
* Perl 5.8.5
* Python 2.5
* PHP 5.2.1
* Xapian 1.0.2
* Django 0.9.6
He cited not wanting to use a RDBMS since he's not writing to the database, just reading. I can give him that,
Now get it on a mobile phone (Score:2)
Re:Uh.... (Score:5, Interesting)
Re: (Score:2, Insightful)
Have you ever worked on a project called "Clippey", by chance?
Re:Uh.... (Score:5, Funny)
Have you ever worked on a project called "Clippey", by chance?
Re: (Score:2)
As if that was possible.
Re: (Score:2)
As if that was possible.
Re:Uh.... (Score:4, Informative)
I know the feeling (Score:5, Insightful)
Re: (Score:2)
Photography I do enjoy, but I have no delusions that everyone and their brother wants to see my photos, so I don't slap them all over flikr. On the other hand, my house is decorated entirely with photos I've taken, since *I* like them.
Re: (Score:2)
I couldn't find one of somebody photographing while skiing past a rock-climber on a beach, sorry.
Re: (Score:2)
Re: (Score:3, Interesting)
Its funny how time changes you.
Re: (Score:2)
(Or to always have something to read on your laptop while traveling - this is what I would use it for)
Re: (Score:3, Funny)
(Or to always have something to read on your laptop while traveling - this is what I would use it for)
Sorry, I couldn't resist!
local resource, better interface (Score:2)
Re: (Score:2, Insightful)
There are bastards of every academic, social, and financial background.
Re:Just hope you don't get an effed image. (Score:5, Insightful)
Re:Just hope you don't get an effed image. (Score:4, Funny)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:3, Funny)
Re:2X (Score:5, Informative)
Re:2X (Score:5, Informative)
(1) http://schools-wikipedia.org/ [schools-wikipedia.org]
(2) http://download.wikimedia.org/enwiki/latest/ [wikimedia.org]
1 is 4625 articles hand picked for school age children, hence the website name
2 is a straight dump of wikipedia
Just imagine my surprise when the schools-wikipedia website didn't have the wiki article on Goatse!
In what namespace? (Score:2)
Re: (Score:3, Informative)
I suspect the former, plus creating the index, plus the not inconsiderable overhead of running an SQL server.
DBs have their place. For a "real" Wiki, or more generally
Re: (Score:2)
I certainly wouldn't go that far. In the memory it takes to tolerably run a Wiki starting with a real dump, you could easily run three or four entire virtual systems. A basic XP or RedHat/Gnome system runs decently in 256MB. Import a 2.5GB BZipped Wiki with MySQL limited to 256MB and tell me how responsive it feels.
I suspect the author just didn't want to bother to tune MySql.
Nor sho
Re: (Score:2)
Re: (Score:2)