Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Databases Data Storage Open Source Upgrades IT Apache

Cassandra 0.7 Can Pack 2 Billion Columns Into a Row 235

angry tapir writes "The cadre of volunteer developers behind the Cassandra distributed database have released the latest version of their open source software, able to hold up to 2 billion columns per row. The newly installed Large Row Support feature of Cassandra version 0.7 allows the database to hold up to 2 billion columns per row. Previous versions had no set upper limit, though the maximum amount of material that could be held in a single row was approximately 2GB. This upper limit has been eliminated."
This discussion has been archived. No new comments can be posted.

Cassandra 0.7 Can Pack 2 Billion Columns Into a Row

Comments Filter:
  • by Sarusa ( 104047 ) on Sunday January 16, 2011 @09:16PM (#34900940)

    Well good on them for solving an interesting technical problem, but the use cases for this are all bad.

    Obvious first use: boss will suggest we optimize the database by using only one gigantic row with two billion columns.

  • by Son of Byrne ( 1458629 ) on Sunday January 16, 2011 @09:28PM (#34901012) Journal

    Cassandra appears to be a multi-dimensional datastore that does not store data in the same fashion as a typical RDBMS. It uses columns and rows both to store sets of data uniquely. If you're familiar with Big Table, then, apparently, its kinda like that.

    That just means that they've added even more storage vectors to it than before...not sure why it made slashdot front page...

  • by ogrisel ( 1168023 ) on Sunday January 16, 2011 @09:35PM (#34901044)
    Not with column store databases such as Cassandra, HBase and BigTable.
  • Indexes (Score:4, Informative)

    by Twillerror ( 536681 ) on Sunday January 16, 2011 @10:32PM (#34901328) Homepage Journal

    Cassandra like many of the "no sql" type databases doesn't have classic indexes.

    So instead of having an index you typically have a separate table that acts as the index.

    Image you have a users table. One of the field is country. Now you want to know all the users for a particular country.

    In standard RDMS type systems you just scan each row or have a index that has done that "ahead of time" or as rows are inserted.

    In Cassandra the rows of users are distributed possibly among 100s of servers. So scanning for all users that have a particular country would require scanning all rows which could a long time.

    Unlike RDMS like system rows don't have a 2d structure and don't have real limitation on the number of columns they can have. And columns can essentially be arrays\rows of objects.

    So as you design/bang out your application you typically realize you need to know "users by country" for some stupid report. So you create a new table to hold these values. This has one row per country. As users are entered you append to this row. This essentially creates an array like structure. You then lookup the row for a particular country and you now know all the users for that particular country.

    Sounds like Cassandra is getting rid of a limitation that could have caused very large index to require multiple rows.

  • by red_blue_yellow ( 1353825 ) on Monday January 17, 2011 @12:50AM (#34901900)

    Columns in Cassandra aren't analogous to columns in an RDBMS. Every row is basically a list of (key, value) pairs. This is referred to as a column, with the key being the column name. There's no requirement that rows have the same set of column names.

    Typically large rows are used for indexes or timelines. In a timeline example, you might use a timestamp for every column name and store the entry as the column value. Cassandra keeps the row sorted by column name, so all of the entries in the row (timeline) will be in chronological order.

    In the case of indexes, you may use one row for every indexed value (say, one row for all users from Utah, one for all from Texas, etc). Here, each column would store the row key (primary key) of a row in another column family (table) that matches that indexed value; in this case, every column might hold a userId.

  • by mini me ( 132455 ) on Monday January 17, 2011 @01:39AM (#34902104)

    Cassandra did not support said indexes until this very release. Even with secondary indexes, storing data in columns is still a reasonable design choice for many requirements. A column in Cassandra is not like a column in a relational database.

    I am sure that this is welcome news for big Cassandra users, but I do agree that it is a strange choice for the front page of Slashdot. Then again, with the number of comments asking why you would need so many columns, it seems that Slashdot needs to talk about Cassandra a little more.

  • by NFN_NLN ( 633283 ) on Monday January 17, 2011 @01:50AM (#34902146)

    Any application developed by one or more Visual Basic developers, given enough time.

    How could that possibly be true, MS Access only supports 255 columns.

    And now you understand why Cassandra is so important! :-)

    In all seriousness I had no idea what Cassandra was or what made it unique as a database. However, I did find this tutorial that others might also find useful:

    http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model [arin.me]

  • by Sarten-X ( 1102295 ) on Monday January 17, 2011 @02:16AM (#34902214) Homepage

    Welcome to the first five minutes of using a column store. Screwey, ain't it?

    My understanding is that rows' contents are indexed such that they may be retrieved quickly. Think of a row name as a primary key. It's easy to get the whole row when you know its name. Continuing the census application, it's be like asking for all the birth years of everyone in a geographical region. The requested column family (geographical region) is opened, and each column (person) is quickly checked for the particular row's contents (in case the birth year wasn't provided). Partitioning is done by both row and column family, so only some of the column family's data is actually scanned. That's where the cluster provides a very nice speedup, as well.

    locating a value in a specific row can't tell how to retrieve that entire column

    Now, I'm not sure if I understand your rage-induced rambling correctly, but if you're trying to make a SQL example, you're starting from the wrong premise, which explains why you're having trouble making sense of it all.

    Quick review: The "R" in "RDBMS" stands for "relational", referring to a n-ary relation. SQL is intended to manipulate those relations, isolating the data you want to extract. Something that is not described as an RDBMS should not be expected to have relations.

    Cassandra functions (from the application perspective) as a key-value store, with no relation structure. That means you don't work with sets, and you don't need to think about set operations. Pull out a row, and you get a list of columns with defined values, as well as those values. Iterate through each value looking for whatever value you're looking for. When you find it, you already have the column name. Just ask for the whole column next. Since the whole thing is running in a cluster, you can parallelize the iterations (I think... I've used HBase, but not Cassandra personally) to speed up the scan.

    If that's not fast enough for you (which is likely), you can use Hadoop's MapReduce framework to scan each cell and create an index, possibly laid over the other table as just more rows & columns (though a different table would be better, from a sanity perspective). Since there's no mandatory structure, that's legit.

    Of course, that's only valid for this particular census application, which assumes that the only reason for the database is either basic statistics or something complex enough for a MapReduce program.

    It's entirely possible to run Cassandra arranged similar to a normal RDBMS. Use only a few column families with very specific columns (such as a single family for all the "Name, address, etc."). Throw in a bunch of index families, updated with MapReduce. Then, your processing can be a complex MapReduce job, iterating over each row with a particular set of rows meeting all your needed criteria. It'd be just like a normal RDBMS, except you have better scalability, and maintain indexes yourself.

    If the trouble of indexing is too much for you, you can follow Google's route with Colossus, which runs MapReduce-like tasks when rows are changed. That's your dynamic indexing.

    Here's some links to help your understanding:

  • by red_blue_yellow ( 1353825 ) on Monday January 17, 2011 @02:17AM (#34902218)

    Indeed, and there are edge cases, like Facebook, or Google, or whatever. The edge cases are gigantic databases that are accessed in certain specific way.

    It's true that many people attempt to prematurely optimize by using Cassandra first instead of something they are already familiar with. However, when faced with some of the pains of growing an RDMBS beyond what a single box can handle, it's worth it to consider your other options. Keep in mind that if it's easy to store and make use of a huge pile of data, you're more tempted to gather that data in the first place, where 10 years ago it might have been prohibitively expensive or difficult.

    There are probably less edge cases than actual NoSQL codebases, which is pretty surreal. There are more actual products then the number of people who need the products. And 99.99% of the people playing with them don't need them at all.

    I can assure you that you're incorrect, but since you don't have any data to back this up, I won't bother either.

    The real joke is people using them in ways that are actually slower than any RDBMS, but they think it's 'easier', usually because they never bothered to learn how JOINs work, and don't understand that it's perfectly fine to make a dozen SQL queries on a web page...that's what indexes are for.

    Yes, only knuckle-dragging imbeciles are interested in new systems... *sigh*. This is an often-touted piece of flamebait that has little basis in reality. Some of the largest Cassandra users are companies who already have extensive experience scaling MySQL and other RDMBS.

    While some might find that document stores like MongoDB are "easier" and use it for that reason, Cassandra has a reputation for being difficult to get started with; the reason it gets used nevertheless is because the benefits outweigh the steep learning curve.

  • by Sarten-X ( 1102295 ) on Monday January 17, 2011 @02:56AM (#34902336) Homepage

    Close. It's more of a hash table of a sorted hash table... Columns are unsorted, but rows are (I think... I've only used HBase personally).

    If you know what you'll be looking for ahead of time, you can make your life easy with a write-heavy system. What's missing in standard Cassandra is a way to run ad-hoc queries. My understanding is that Cassandra can now run with Hadoop's MapReduce framework. Any query or computation can be run against the Cassandra table in a widely-distributed fashion as a MapReduce job. It's not as fast as an SQL query on an indexed column, but far better than a query on an unindexed one, because everything runs in parallel across the cluster.

  • by bjourne ( 1034822 ) on Monday January 17, 2011 @08:28AM (#34903356) Homepage Journal
    Maybe Cassandra should have choosen some other terminology for their database that so obviously doesn't conflict with already existing terms. A column in Cassandra is a tuple which in an RDBMS is a row. Confusion all around.

"A car is just a big purse on wheels." -- Johanna Reynolds

Working...