Slashdot Log In
Building a Fast Wikipedia Offline Reader
Posted by
kdawson
on Mon Aug 13, 2007 09:53 PM
from the you-could-look-it-up dept.
from the you-could-look-it-up dept.
ttsiod writes "An internet connection is not always at hand. I wanted to install Wikipedia on my laptop to be able to carry it along with me on business trips. After trying and rejecting the normal (MySQL-based) procedure, I quickly hacked a much better one over the weekend, using open source tools. Highlights: (1) Very fast searching. (2) Keyword (actually, title words) based searching. (3) Search produces multiple possible articles, sorted by probability (you choose amongst them). (4) LaTeX based rendering for mathematical equations. (5) Hard disk usage is minimal: space for the original .bz2 file plus the index built through Xapian. (6) Orders of magnitude faster to install (a matter of hours) compared to loading the 'dump' into MySQL — which, if you want to enable keyword searching, takes days."
Related Stories
This discussion has been archived.
No new comments can be posted.
Building a Fast Wikipedia Offline Reader
|
Log In/Create an Account
| Top
| 208 comments
| Search Discussion
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Wow! (Score:3, Funny)
(http://www.ferrobyte.com/ | Last Journal: Monday October 20 2003, @12:20AM)
SDHC? (Score:5, Informative)
(http://myatomic.com/ | Last Journal: Sunday November 19 2006, @12:31AM)
Re:Why? (Score:5, Funny)
Just settle it the old way (Score:5, Funny)
Take that, Mr Obviously A. Troll! (Score:5, Funny)
Re:Take that, Mr Obviously A. Troll! (Score:5, Funny)
(http://slashdot.org/my/logout)
Re:Why? (Score:5, Insightful)
(Last Journal: Friday September 14, @02:08PM)
Complex numbers originated from something "useless" like trying to solve the quartic polynomial in radicals...try building a bridge without them. In fact all of science is built upon people going in random tangents doing things they enjoy, discovering seemingly "useless facts" but most of it becomes useful *and* gives us an idea of the universe in which we live.
Only working on immediate practical problems is very shortsighted, and if mandated throughout the academic community, would mean the death of innovation and most discoveries.
Wow (Score:1)
(http://blog.woodysroom.com/)
Now we need to work on porting that to over OS's and we will be set.
Ho-Hum ... (Score:5, Funny)
(http://www.tftb.com/)
Let us know when you're ready for prime time
Re:Ho-Hum ... (Score:5, Insightful)
Uh.... (Score:1)
(http://www.hormel.com/)
Re:Uh.... (Score:5, Interesting)
(http://www.stopstupidity.com/ | Last Journal: Sunday February 23 2003, @11:16PM)
Re:Uh.... (Score:5, Funny)
Have you ever worked on a project called "Clippey", by chance?
Re:Uh.... (Score:4, Informative)
I know the feeling (Score:5, Insightful)
I hope (Score:4, Funny)
George W Bush
Is a dick head!!!!11
Re:I hope (Score:5, Funny)
Oh, nevermind, I see the problem:
George W Bush
Is a dick head!!!!11
should be
George W Bush
Is a dick head!!!!!!
Man, those out to mess with the content are getting more and more subtle...
But... (Score:2, Funny)
Hitchhiker's guide here we come! (Score:5, Funny)
Re:Hitchhiker's guide here we come! (Score:5, Funny)
Re:Hitchhiker's guide here we come! (Score:5, Funny)
Re:Hitchhiker's guide here we come! (Score:5, Insightful)
Re:Hitchhiker's guide here we come! (Score:5, Funny)
Re:Hitchhiker's guide here we come! (Score:5, Funny)
* 1. It is slightly cheaper
* 2. It has the words "You can copy and edit me for free" inscribed in large friendly letters in the license.
Also like the guide, although it cannot hope to be useful or informative on all matters, it does make the reassuring claim that where it is inaccurate, it is at least definitively inaccurate
Only 2 days huh (Score:2, Funny)
Days? Please clarify (Score:1)
(http://www.geocities.com/tablizer | Last Journal: Saturday March 15 2003, @01:22PM)
Do you mean searching takes days, or loading? Searching should be quick if you index the words. If you are duplicating a bunch of local clones of wiki, then simply copy down the raw MySql table data files rather than reload from delimited files etc. (One needs to make sure their version of MySql is compatible with the table file format.)
Faster than a speeding slug... (Score:1)
But....but....I thought MySQL was fast!
Good part of the page: the explanation (Score:5, Insightful)
(http://www.drones.com/)
It doesn't take days (Score:5, Informative)
(http://en.wikipedia.org/wiki/Slashdot)
Use the ANSI C implementation, which takes about 20 minutes to convert the XML to SQL and then takes a few hours to import into MySQL. Please not that you need a properly configured MySQL server in order to efficiently run a local copy of Wikipedia, which must have at least 8GB of ram.
http://meta.wikimedia.org/wiki/Xml2sql [wikimedia.org]
Linda Mack! (Score:1, Funny)
http://yro.slashdot.org/article.pl?sid=07/07/27/1
Mass inserts into mysql... (Score:4, Informative)
Xapian (Score:1)
I've used it for over 2 years on various sites and am really pleased with it.
What?? (Score:5, Funny)
(http://www.icydog.net/)
1. Not a thinly-veiled attempt to advertise a crappy product
2. Not bashing Microsoft
3. Not about somebody who is trolling open-source (i.e. SCO)
4. Not about Bush taking away all our rights and ending freedom
5. Not about voting fraud and the end of democracy/America/the world
6. Not decrying Vista DRM and its ties to the MAFIAA
7. Posted on Slashdot
Furthermore, TFA is interesting and informative.
Am I in heaven?
Re:What?? (Score:5, Funny)
C&D Tomorrow? (Score:1)
The Point? (Score:2)
I know that not everyone has a permanent connection to the net everywhere they go, but what is the point of storing a local copy of Wikipedia?
The beauty of it is that it is online and always up-to-date (wrong, or less wrong).
Trying to capture it locally seems to me to be like trying to print The Internet. By the time it's done spooling, it's out of date.
If it's an academic project, that's really cool, but I don't see a practical point to it.
Re:The Point? (Score:5, Insightful)
Joe has all-you-can-eat broadband at home, or an understanding employer with a fat pipe, and spends two hours each day on the train. Two and a half gig per month (and lets face it, you probably don't want to update it more frequently that that) and he's got probably half his reading material sorted out.
Wang lives in Buttfuckistan, a fictional country with totalitarian leanings with too many real-world counterparts. The Great Firewall of Buttfuckistan (i.e. squidguard, under the control of Buttfuckistan Telecom, and settings in the routers to drop non-port-80 traffic half the time) makes it impossible to reliably access Wikipedia from inside their borders, which is a great shame because the entry on Buttfuckistan is particularly unflattering. Once a month, Joe sticks a DVD with five minutes from an old re-run of Friends and an encrypted dump of Wikipedia in an airmail envelope and sends it to Wang.
Mary is still at secondary school, and her particular school has wifi access for students who are encouraged to purchase their own laptops, but since the local pastor discovered http://en.wikipedia.org/wiki/Image:Dream_of_the_f
Still can't see the point?
WP:1.0 wants you (Score:1)
(http://en.wikipedia.org/wiki/User_talk:Titoxd)
Why didn't he post his howto on wikipedia? (Score:1)
What about moulin? (Score:2)
http://moulinwiki.org/l/en/ [moulinwiki.org]
...or the HTML export feature? (Score:2)
But hey, two days and a few hundred lines of code is cool. You geek (verb). If we always took the easy way out we'd be using Windows and have committed suicide long ago.
can we get a PSP version of it? (Score:4, Interesting)
(http://muzzle.footourist.com/ | Last Journal: Wednesday November 03 2004, @01:10PM)
better yet, a DS version (Score:4, Informative)
(http://myatomic.com/ | Last Journal: Sunday November 19 2006, @12:31AM)
There's a bug in TFA: Missing articles. (Score:5, Insightful)
The wikipedia database file is one large bzip2'ed XML file which the author splits into blocks of 900k (bzip2's natural blocking) which he then parses for the "title" and "text" XML tags.
The problem with that approach is that some of these tags may well end up being split over block boundaries, so some articles risk being missed. EG:
END-OF-BLOCK: blablablabla...blabla[/text][othertag][ti
START-OF-NEXT-BLOCK: tle][sometag]blablablablabla...
So searching for "[title]" in boths blocks separately like TFA does will fail for one article.
(I've used square brackets instead of lessthans and greaterthans because slashdot won't let me use them.)
Re:There's a bug in TFA: Missing articles. (Score:4, Informative)
(http://ttsiodras.googlepages.com/)
As for the title, what you describe can't happen because of a fortunate side-effect: when compressing, bzip2 (as other compressors) look for previous appearances of a token (in this case, '<title>') and code a reference to it (instead of the full text) to save space. Since "text" and "title" appear all the time in these blocks (at least once for each article), they will NOT be split - they will be encoded as "references", and therefore, what you describe shouldn't happen (I hope :-)
awesome. just what I need. why a waste???? (Score:1)
(http://brianvaughan.net/)
I cruise around in a sailboat. my longest passage was 35 days. how I would have LOVED to have been able to read wikipedia articles on that passage, even if they were a few weeks old. What do I care if an article is a few weeks old? 35 days at sea I'd have read a paper encyclopedia if I had one, but my boat isn't big enough to carry the weight of a paper encyclopedia. It sails like shit as it is from how many books I have stuffed in my V-berth.
and sometimes I'm on some random little island with no internet access for a periods of time... hanging out with a bunch of other sailors, and of course we get into discussions that leave us wishing we could go google something.
Even in Bora Bora, they had internet but it was 24$/hour, on crappy old computers! this would have been great!
and now! now I'm in China! They block parts of wikipedia. yeah I can setup and SSH tunnel when I happen to have internet access available, but how great it is to have a local (though somewhat outdated) copy of wikipedia, including any blocked articles!
sounds great!
Hook it in to your desktop search... or... (Score:2)
(http://www.scarydevil.com/~peter/ | Last Journal: Monday September 26 2005, @06:53PM)
wikipedia on java mobile? (Score:1)
I have one called firefox (Score:1)
(http://www.gamerslastwill.com/)
lucene (Score:1)
(http://www.footballfans.tv/)
Not enough framewurkz (Score:1)
The later execution, however... Perl, Python, PHP, Xapian, Django all together as the runtime, and add to this C code for the preparation (I might be wrong about the details, just skimmed), for such a small application.
The other poster who rejoices about how wonderfully old-skool TFA is, is obviously right. This kind of duct-tape Linux development feels badly sooo 90's and smells like maintenance, performance, installation and portability problems.
Keeping it all in just one of the scripting languages would make it much more serious. (Perl, or maybe Bash for the easiest installation?)
sdict? (Score:2)
Anyone with experience of sdict [sdict.com]?
They offer a dictionary reader for various systems, including portable devices, and dictionaries including Wikipedia.
Unfortunately their Wikipedia dict is a old (January), but it seems like a good approach for laptops or other small devices. When I get an 8Gb SDHC I'm going to try it on my Nokia N800.
Not New (Score:2)
(http://www.chaingang.org/code/)
Also, as others have noted, the his choping the file into chunks means you're going to loose at least one article per chunk.
I'd implemented this with a compressed file system and maybe some symlinks. Happily, the static content is already there for the taking. Some find and grep on a file system should be enough to do a title search with little overhead. The web server ( for searches from a browser ) need be little more than a daemon, in perl, python, etc, you could do it in less than fifty lines.
Gears? (Score:2)
(Last Journal: Wednesday December 08 2004, @01:13PM)
The author did some nifty hacking that resulted in the following stack of dependencies:
* Perl 5.8.5
* Python 2.5
* PHP 5.2.1
* Xapian 1.0.2
* Django 0.9.6
He cited not wanting to use a RDBMS since he's not writing to the database, just reading. I can give him that, but it seems like it caused more trouble that it's worth.
This leaves me wondering: why not just use Google Gears and be done with it. Sure, the hacking part would shift largely to the javascript side of things (would be mostly wiki conversion), but you'd have the other bits (web server, and storage) already worked out. All you'd have to do is slap together some little app to insert the XML data into the database.
Now get it on a mobile phone (Score:2)
OLPC (Score:1)
(http://www.thekaran.com/ | Last Journal: Tuesday June 14 2005, @11:06AM)
TomeRaider does this for Pocket PCs (Score:2)
(http://www.dailygrrl.com/)
http://www.tomeraider.com/ [tomeraider.com]
They provide Wikipedia versions that you can use with their e-Reader. I bought one because I can use it on my Pocket PC, and it's just awesome having the Wikipedia available instantly, anytime. It's the fucking Hitchhiker's Guide version 0.1, gyat damn. =)
Why didn't he use 'like' ? (Score:1)
"The result of the import process was also not exactly what I wanted: I could search for an article, if I knew it's exact name; but I couldn't use parts of the name to search; it was all or nothing. To allow these "free-style" searches to work, one must create the search index - which I'm told, takes days to build. DAYS!"
This seems kind of stupid to me...could he not have done an SQL query using the moral equivalent of 'select articlettext where title like '%thingiaminterestedin%' - thus meaning you don't need to know the exact title.
so, xapian is great (Score:1)
(http://ls-themes.org/)
Re:2X (Score:1)
Re:Just hope you don't get an effed image. (Score:2, Insightful)
There are bastards of every academic, social, and financial background.
Re:Just hope you don't get an effed image. (Score:5, Insightful)
(Last Journal: Friday June 23 2006, @01:26PM)
Re:Just hope you don't get an effed image. (Score:4, Funny)
(Last Journal: Saturday October 26 2002, @11:59PM)
Re:2X (Score:5, Informative)
Re:2X (Score:5, Informative)
(Last Journal: Saturday February 25 2006, @11:02PM)
(1) http://schools-wikipedia.org/ [schools-wikipedia.org]
(2) http://download.wikimedia.org/enwiki/latest/ [wikimedia.org]
1 is 4625 articles hand picked for school age children, hence the website name
2 is a straight dump of wikipedia
Just imagine my surprise when the schools-wikipedia website didn't have the wiki article on Goatse!
Re:Just hope you don't get an effed image. (Score:2)
Re:Just hope you don't get an effed image. (Score:1)
(http://shortcircuit.us/ | Last Journal: Sunday October 14, @02:01AM)
Re:Just hope you don't get an effed image. (Score:3, Funny)
(http://www.flickr.com/photos/tomhaines/ | Last Journal: Thursday January 04 2007, @06:29PM)