Open Source Moving in on the Data Storage World 169
pararox writes "The data storage and backup world is one of stagnant technologies and cronyism. A neat little open source project, called Cleversafe, is trying to dispell of that notion. Using the information dispersal algorithm originally conceived of by Michael Rabin (of RSA fame), the software splits every file you backup into small slices, any majority of which can be used to perfectly recreate all the original data. The software is also very scalable, allowing you to run your own backup grid on a single desktop or across thousands of machines."
I don't think you know what that word means. . . . (Score:2, Interesting)
Re:I don't think you know what that word means. . (Score:5, Funny)
Re:I don't think you know what that word means. . (Score:1)
I just rely on Echelon for my data backups...
Re:I don't think you know what that word means. . (Score:2, Funny)
Re:I don't think you know what that word means. . (Score:1, Funny)
Re:I don't think you know what that word means. . (Score:2)
Re:I don't think you know what that word means. . (Score:3, Funny)
Nice fileserver you 'ave there. Shame if somefing were to 'appen to it. Know what I mean, 'squire?
Re:I don't think you know what that word means. . (Score:2)
Imagine a scenario where the Legato sales rep is buddy-buddy with the CIO: takes him out golfing and dinners on Legato's dime, in exchange the CIO sticks with Legato no matter how much the product may diverge from the company's true needs. The sales rep keeps a lucrative contract and the CIO gets free stuff: regardless of the performance of each.
I don't think genuine friendship is ne
Editors, please note! (Score:5, Informative)
Editors, please note that there is some incorrect information in this post. Firstly, the original concept of the IDA was designed by Shamir of RSA fame, not Rabin.
Also note that the Cleversafe IDA is a custom algorithm, and is only similar to Shamir's initial concept.
Re:Editors, please note! (Score:3, Interesting)
Re:Editors, please note! (Score:2)
I think this is wrong again (Score:2)
Shamir invented secret sharing.
Rabin invented the Rabin public key cryptosystem, and IDA.
IDA is not like secret sharing.
With secret sharing, you have a secret, which you break up into shares. You can decide how many shares you need to reconstruct the secret when you break it up. Without the right number of shares, you know nothing about the secret. But the big difference is that EACH SHARE IS SLIGHTLY BIGGER THAN THE INITIAL SECRET.
With IDA, you have lots of data. Yo
Re:I think this is wrong again (Score:3, Interesting)
It is a classic example of a bad patent. There was prior art (though admittedly this was kept top secret till 1997) and it also failed the obviousness test. Clearly if someone else came up with the same algorithm four years earlier it was clearly obvious to someone skilled in the art of cryptograp
Re:I think this is wrong again (Score:4, Insightful)
Backup for Backuper? (Score:3, Interesting)
If there is a creator/seeder, then we are still burdened by having to keep this seeder safe so that we can retrieve the distributed slices.
If there is no creator/seeder, is this safe enough so that people cannot patch slices together by way of trial-and-error?
Looking at it here for work (Score:2)
At work we're looking into this to store critical data on out intranet which spans several states and facilites. Looks great, but only time will tell.
I seem to remember a project months ago that was going to use P2P to backup your data on other P2P users computers which to me sounds quite insane. Anyone know if this is related?
http://religiousfreaks.com/ [religiousfreaks.com]Re:Looking at it here for work (Score:1)
If it could be set up properly to be used on a large corporate intranet, then there's some merit to it. If you could use this system to spread chunks of data out over an intranet that spans several states, then it could be a useful way to store critical data during hurricane season or the like. If a building took sufficient damage from weather, earthquake, terrorist, broken water main, etc. so that the data center in that building was a loss, t
Re: (Score:1)
Re:Looking at it here for work (Score:2)
The 'R' stands for Rivest, not Rabin (Score:5, Informative)
While Michael Rabin was inventor of the Rabin cryptosystem [wikipedia.org] in 1979, it was Ronald Rivest, Adi Shamir and Len Adleman behind RSA [wikipedia.org] two years earlier.
Re:The 'R' stands for Rivest, not Rabin (Score:3, Funny)
Think RAID5, only way better (Score:5, Interesting)
Using the information dispersal algorithm originally conceived of by Michael Rabin (of RSA fame), the software splits every file you backup into small slices, any majority of which can be used to perfectly recreate all the original data.
It seems like this can be tuned to provide varying levels of fault tolerance. According to the abstract (I don't have an ACM web account, and I couldn't find the full text), it seems like I can take a file and make it so that any four chunks can be used to rebuild the file. I can then take those chunks and distribute them eight times to different machines. Thus, five of the eight machines would have to be rendered inoperable before I were unable to retrieve my data.
If I understand it correctly, then this is really slick.
Re:Think RAID5, only way better (Score:1, Informative)
Re:Think RAID5, only way better (Score:4, Interesting)
Rabin has shown how to come up with l vectors of which k are mutually orthogonal.
MOD PARENT REDUNDANT (Score:4, Funny)
Re:Think RAID5, only way better (Score:2, Funny)
I think I've suddenly gone blind because your "[non-]fancy way of saying" doesn't sound a damn thing like the gibberish my eyes just read. "mutually orthagonal vectors" ?!
If I'm wrong, then I should probably go and lie down, but I just showed my wife and now she's crying... so I think it's your explanation and not me.
*goes to find advil*
Re:Think RAID5, only way better (Score:2)
"mutually orthagonal vectors" simply means that two separate things are going in the X-Y plane, which is good. If one of them might be travelling in the Z plane, it might have poked you in the eye for reading it. That would be bad.
Re:Think RAID5, only way better (Score:2)
Consider three dimensional space (l = 3, k = 2). Let the k dimensional vector be v = (1/2, 1/3) in the x-y plane. Then the l dot products are v.i = 1/2, v.j = 1/3, v.k = 0. I cannot pick any two products (say 1/2 and 0) to reconstruct v.
Something must be lost in translation
HUM?!? (Score:1)
Do you mean that we have a k-dimensional vector space V, a vector on this vector space and calculate the dot product with l mutually orthogonal vectors where l>k?
Is that it? Because if it is it's strange to say the least.
Re:Think RAID5, only way better (Score:1)
On a k dimensional vector space you can't come up with l>k (non null) mutually orthogonal vectors. After all k non null mutually orthogonal vector will form a basis for the vector space.
Re:Think RAID5, only way better (Score:2, Informative)
Re:Think RAID5, only way better (Score:1)
I suspect the original poster expressed himself incompletely because this is nonsense.
The only way this can make any sense is if the vector belongs to a k dimensional vector *subspace* of another vector space of at least dimension l>k.
In that scenario the subspace can't be orthogonal any of the l mutually orthogonal vectors for things to work as described.
This needs further clarification.
Re:Think RAID5, only way better (Score:4, Informative)
Byzantine for Beginners (Score:3, Interesting)
Re:Think RAID5, only way better (Score:2)
In the classic formulation, the secret is split into N parts, such that no part reveals any information about the secret (that is, knowing one of the parts does not make any possible secret more likely than any other possible secret). The really cool thing is that you can decide that's not good enough, and can split up your secret such that knowing M or fewer parts reveals no information about the secret (for sufficiently large N). Normally
stagnant?? (Score:4, Insightful)
Re:stagnant?? (Score:1, Funny)
Wouldn't distributed storage be loosing data? After all, it's being set loose from one device, to be stored upon many...
Re:stagnant?? (Score:2)
The MP3s of the many, outweigh the MP3s of the few.
Re:stagnant?? (Score:2)
Re:stagnant?? (Score:2)
One method of reducing risk is to place redundant vowels in some of your words. In case the first one gets loost somehow, you still have the second one.
Re:stagnant?? (Score:1)
1 create say 500 meg chunks (compressed)
2 write a base system + itself to disk
3 build an iso with 8 chunks (or 16 if target is a DL disc)
4 write out the disc
5 loop until disk has been backed up
(so what the state of backup to stone tablets???)
Re:stagnant?? (Score:2)
Addendum (Score:2, Funny)
KFG
oh yea (Score:1)
Re:oh yea (Score:2)
Rar + Par + BitTorrent? (Score:5, Interesting)
Par files (for use with QuickPar, etc) are great, saving all sorts of extra posting on binary newsgroups.
Re:Rar + Par + BitTorrent? (Score:4, Funny)
Re:Rar + Par + BitTorrent? (Score:1)
Here, have some of my advil - it sounds like you may need them more than I.
Re:Rar + Par + BitTorrent? (Score:2)
Re:Rar + Par + BitTorrent? (Score:2)
Re:Rar + Par + BitTorrent? (Score:2)
Re:Rar + Par + BitTorrent? (Score:2)
It's much easier to get 50% of a 200% larger file than to get 100% of the original. You'd never need to see a complete download, thus you wouldn't get cases where your transfer gets delayed waiting for a seed for the one little piece that nobody else has.
50% is probably overkill for bittorrent. If zip format were built that could reconstruct its contents with any arbitrary 90% of the file, that would be *amazing* for torrents.
So er...
Not a new idea (Score:5, Informative)
Sourceforge page (Score:1, Informative)
http://it.slashdot.org/it/06/04/26/2039224.shtml [slashdot.org]
Re:Sourceforge page (Score:2)
I really hope they have a backup handy.
You mean Shamir, not Rabin (Score:5, Interesting)
Even more amazingly Shamir's secret sharing scheme allows computing math functions, such as digital signatures, without ever recovering secret keys. This is called threshold cryptography, some of you may be interested to learn about its many wonders. Shamir rocks and so is threshold crypto!
Implementations? (Score:2)
Re:You mean Shamir, not Rabin (Score:2)
innovation (Score:2, Interesting)
Maybe one day vendors will stop pushing overly expensive and utterly bland storage solutions. i.e. Last time I had a meeting about storage the product was: 2x Servers 2x Disk Arrays with possible storage of a little under 2TB (using 24 80Gb SCSI HDDs) with RAID 5, Oh and the storage was presented as 4 @500Gb drives to the OS (Some proprietary thing). all
Storage should be Boring! (Score:5, Insightful)
I'm kinda missing the point of the "editorializing" in this article: when a storage system is doing its job, it IS boring. You put bytes in, assured they will be stored, and you get them out on demand. You want nothing "interesting" to happen to the data that your business is built on! Sure, the technology is stagnant, if that means customers can get access to the data, reliably, year after year. We Slashdotters are prepared to take "bleeding edge" risks that enterprise customers are not.
Re:Storage should be Boring! (Score:2)
maybe you should look at iSCSI (Score:2)
They basically sell a stack of drive arrays which you can configure as volumes as you see fit. Some notable features:
Ability to configure multiple RAID types within the stack. So you could have RAID 10 and RAID 0 within the same stack of drives depending on if you need speed or redundancy.
The ability to stripe the data and parity across units in the whole stack (RAID 10 level 2 and 3). So if you have 3 4 drive sys
Re:maybe you should look at iSCSI (Score:2)
I'm sure you can. So why not show me exactly the hardware and software components you could use?
"Spend some time build your own, dont speend that 100k because someone sells it well in a glossy mag"
Time = money. You spend money rolling your own in staff time. Then you need to maintain it. Rather than have a single source for support you have multiple vendors or open source software components to m
been done before (Score:5, Informative)
Related companies/projects happened in this order: MojoNation [archive.org] .. MNet [mnetproject.org] .. HiveCache [archive.org] .. AllMyData [allmydata.com]
good luck!
Re:been done before (Score:2)
Publius (Score:3, Interesting)
It's nice to see another attempt that's free. Free speech requires anonymity.
Re:Publius (Score:2)
Anonimity contributes to meaningless and criminal communication. Perfect anonimity will result in nearly worthless communication. Take a look at the "p3n15 pi11z!!" offers in your Email inbox for an excellent example.
Free speech requires VIGILANCE by a population to ensure that the rights to speak freely are not suppressed, and that takes organization, effort, and might.
Re:Publius (Score:2)
Any right X "contributes to meaningless and criminal" Y.
You are making the classic argument in support of a police state. Just because X can be used by criminals, or X makes harder the police's job in catching criminals, does not mean that we can or should criminalize X itself.
Perfect anonimity will result in nearly worthless communication... Email
Just because most people use a lousy email system with a rotten design and limited capabilities in
Re:Publius (Score:2)
You are making the classic argument in support of a police state. Just because X can be used by criminals, or X makes harder the police's job in catching criminals, does not mean that we can or should criminalize X itself.
I don't believe I said that. I only said that anonymous communication results in meaningless communication, and is certainly not a requirement for living in a free society.
Just because most people use a lousy email system with a rotten design and limited capabilities in no way means that
Virtual file server -- was a program for old Macs (Score:5, Interesting)
By chance, anyone remember this technology? I have no idea what happened to it, but it would be a blockbuster open source app if done today, and was platform independant. If done right, one could create data brokerage houses, where people could buy and sell storage space, and also reliability, where space on a RAID or server array would be of higher value than space on a laptop that is rarely on the Internet.
Its great-grandchild, Google file system (Score:2)
Very roughly, this is what GFS does. I dn't have 25,000 servers at my disposal, so I haven't been able to test it though. Maybe next week. Meanwhile, I muddle through with tape.
Re:Virtual file server -- was a program for old Ma (Score:2, Informative)
Re:Virtual file server -- was a program for old Ma (Score:2)
By chance, anyone remember this technology? I have no idea what happened to it, but it would be a blockbuster open source app if done today, and was platform independant.
That's very interesting. If I understand what you're saying, was it something like this [willden.org]? That's a description I wrote up for a system I'd like to build if I every get the time.
Re:Virtual file server -- was a program for old Ma (Score:2)
I wonder what ever happened to the company's IP and source code. Hopefully its not sitting on some old SCSI-1 drive in some clearinghouse, slowly bit-rotting away. Even worse, the code of this program ending up lost to history.
This is why I think software vendors that wish to obtain copyright protection for their software should be required to publish source code along with the binaries. They wouldn't be required to grant any license whatsoever, so no one would have any right to use the source -- not ev
Re:Virtual file server -- was a program for old Ma (Score:2)
Re:Virtual file server -- was a program for old Ma (Score:2)
Re:Virtual file server -- Mango (Score:2)
Re:Virtual file server -- Mango (Score:2)
redundancy = your secret is safe (with us) (Score:1)
Re:redundancy = your secret is safe (with us) (Score:3, Insightful)
For example, if I tell you my 8 character password has a "q" in it, you've only lowered the number of possible passwords from 2821109907456 to 78364164096. Not exactly useful, either way.
And of course, what good is keeping the data out of the wrong hands if the RIGHT HANDS can never get to it?
Re:redundancy = your secret is safe (with us) (Score:1)
Re:redundancy = your secret is safe (with us) (Score:3, Informative)
I am the chief designer of the Cleversafe dispersed-storage system (aka a grid-storage software system) and am one of the project's co-founders. The Cleversafe system never stores a complete copy of the data in any one place (or "grid node" in our terminology). At most 1/11th of the file data--we call it a file "slices"--is stored at any one grid node in a "scrambled" (i.e., non-contiguous), compressed, and encrypted/signed fashion. The grid _never_ stores more than one copy of the data on the grid
Borg Technology (Score:5, Funny)
I was immediately visualizing a Borg Cube regenerating after a hit from the Enterprise.
regardless, it sounds cool.
Link to pay-for-view contents (Score:4, Insightful)
Re:Link to pay-for-view contents (Score:2)
Re:Link to pay-for-view contents (Score:2)
New idea... NOT. (Score:5, Informative)
I just hope they don't patent it [uspto.gov]!
Cleversafe mirror (Score:2)
The others are there too.
Sounds familiar. Like my master's thesis. (Score:5, Interesting)
In fact, I wrote a RSRaid driver for Linux for my thesis and did some performance testing on it. I'll save you the 30 pages and just tell you that the algorithm is far too CPU intensive to scale up very well for fileserver use (my original intent,) but I did conclude it could be used as a backup alternative to tape. Hmmmm.
Direct Link [dyndns.org]
Google Cache [72.14.203.104]
Please forgive the double brackets, I fought witH Word and lost.
Contact me if you'd like to play with the code. I never did any reconstruction code, but the system did work in a degraded state, and was written for the Linux 2.6 kernel.
Re:Nice thesis (Score:2)
However, that would req
I hope they backed up (Score:2)
And how can you say backing up to a *single* desktop pc is of any value?
Shameless plug... (Score:2)
Cleversafe's headquarters are located at the new University Technology Park [university...gypark.com] at IIT...no, not that IIT, this one [iit.edu].
Par and Par2? (Score:1)
RAID 5 at the File Level (Score:3, Interesting)
From the summary : "the software splits every file you backup into small slices, any majority of which can be used to perfectly recreate all the original data."
So, basically it is like RAID 5 striping and parity [wikipedia.org] applied to the file level.
Neat concept.
Notes from lead Cleversafe designer (Score:5, Informative)
Hello-
I am the lead designer of the first Cleversafe dispersed-storage system (aka a grid-storage software system) and am one of the project's co-founders. The Cleversafe system never stores a complete copy of the data in any one place (or "grid node" in our terminology). At most 1/11th of the file data--we call it a file "slices"--is stored at any one grid node in a "scrambled" (i.e., non-contiguous), compressed, and encrypted/signed fashion. The grid _never_ stores more than one copy of the data on the grid, and that one copy is never stored all in the same place--it's dispersed using an optimized information-dispersal algorithm that we created but has similar properties to the previously-published info-dispersal algorithms (IDAs).
If a grid node and its associated content--i.e., the user's file slices on that node--are ever completely compromised (firewall comes down, all encryption and scrambling is cracked, etc), then the cracker acquires at most 1/11th (one-eleventh) of the data users data.
Further, if any half (or at least 5 out of any 11) of the grid nodes are for any reason destroyed or otherwise unavailable, all of the user's data is still accessible. This is done by generating a "coded" file slice for every data slice that we store on the node, and regenerating missing file slices from down nodes by pumping the available data and coded slices through our info-dispersal algorithms (which are all open-sourced, by the way) that are executed on the client side or when the grid "self heals" for destroyed nodes.
The system can also be implemented in a cost-effective fashion. The grid system can sustain so many concurrent, per-node outages that the availability/uptime requirements for each node are minimal. Also, the grid-node servers need not support much processing capability, for the client offloads much of the work from the servers.
We feel this system provides a powerful combination of reliability, scalability, economy, and security.
The hardest part of the design, imo, is to be able to reliably track all of these file slices across a large and heterogeneous set of grid-node machines housing these info-dispersed file slices. We designed the grid meta-data system from the ground up to do this and to be capacity-expandable, performance-scalable, and easily serviceable. More details for the open-source flavor of the grid-software design can be found here:
http://wiki.cleversafe.org/Grid_Design [cleversafe.org] [cleversafe.org]
There's much more that I can say about this system; I plan to add additional comments to this thread as more questions and comments arise. I'm sure there are new comments I have yet to read, for they're coming in pretty quickly...
I also encourage further discussion at our newly-created web forums: http://forums.cleversafe.org/ [cleversafe.org] [cleversafe.org]
Mailing lists (that will be synchronized with the web forums) will also be available at cleverafe.org in the near future.
-Matt
Cleversafe project lead
Comparing Cleversafe IDA algorithms with others (Score:2, Informative)
The Cleversafe information dispersal algorithms (IDAs) were designed to provide real-time performance with large amounts of data storage and retrieval (gigabytes, petabytes and above). Previous algorithms, like Rabin, Shamir and Reed-Solomon, are very effective at storing smaller amounts of data (kilobytes), but their computational overhead which is proportional to the square of the data block size or greater arent well suited for quickly dispersing/restoring l
"any majority of which" (Score:2)
Ok, I'm numb in the morning, but what the hell does that mean ?
Re:"any majority of which" (Score:2, Interesting)
Microsoft has a similar concept (Score:2)
http://research.microsoft.com/sn/Farsite/ [microsoft.com]
Pretty cool stuff, check this out:
Re:Correction!!! (Score:2, Informative)
Most people seem to know RSA names well but the IDA algorithm in this article is not related to RSA. So their comments on what R in RSA stands for can misguide the readers!