Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
IT

The Big Promise of 'Big Data' 78

snydeq writes "InfoWorld's Frank Ohlhorst discusses how virtualization, commodity hardware, and 'Big Data' tools like Hadoop are enabling IT organizations to mine vast volumes of corporate and external data — a trend fueled increasingly by companies' desire to finally unlock critical insights from thus far largely untapped data stores. 'As costs fall and companies think of new ways to correlate data, Big Data analytics will become more commonplace, perhaps providing the growth mechanism for a small company to become a large one. Consider that Google, Yahoo, and Facebook were all once small companies that leveraged their data and understanding of the relationships in that data to grow significantly. It's no accident that many of the underpinnings of Big Data came from the methods these very businesses developed. But today, these methods are widely available through Hadoop and other tools for enterprises such as yours.'"
This discussion has been archived. No new comments can be posted.

The Big Promise of 'Big Data'

Comments Filter:
  • LiveSQL (Score:4, Interesting)

    by ka9dgx ( 72702 ) on Tuesday September 14, 2010 @01:51PM (#33578432) Homepage Journal

    I think that the real innovation will be a variation of SQL that allows for the persistence of queries, such that they continue to yield new results as new data is found to match them in the database. If you have a database of a trillion web pages, and you continue to put more in, it doesn't make sense to re-scan all of the existing records each time you decide you need to get the results of the query again. It should be possible, and far more computationally efficient to have a stream of results from a LiveSQL query that can feed a stream, instead of batch mode.

    I've registered the domain name livesql.org as a first step to helping to organize this idea and perhaps set up a standard.

    • by PPH ( 736903 )

      This sounds like some sort of agent or RSS feed. You tell it to message you every time some event matching your query occurs. Not a big deal for a single database or enterprise-wide data store.

      This could get a bit involved for a Internet-wide system with the scope of Google. How many Slashdotters would submit queries like "Tell me whenever new p0rn appears".

      • by Anonymous Coward

        That query is peanuts. How about "tell me whenever new p0rn about pink haired furry hermaphrodite with big tits being raped by purple tentacles appears".

    • Re: (Score:2, Insightful)

      by starsky51 ( 959750 )
      Couldn't this be done using regular sql and an indexed timestamp column?
    • by tigre ( 178245 )

      Interesting idea. Basically would need to establish event triggers on relevant tables. Should also be able to invalidate results that were previously found, and provide updates as well. Would require a lot of memory to persist enough information about the previous results that you don't end up with duplicates. I'll try and check in when you actually have a site.

    • by Drumpig ( 13514 )

      whois livesql.org
      NOT FOUND

      Liar.

      • by ka9dgx ( 72702 )
        one has to wait a bit for these things, as DNS updates are still batch oriented. 8)
    • Re:LiveSQL (Score:5, Informative)

      by Laxitive ( 10360 ) on Tuesday September 14, 2010 @02:45PM (#33579266) Journal

      There are some serious technical challenges to overcome when you think about actually implementing something like this.

      Take something like "select stddev(column) from table" - there's no way to get an incremental update on that expression given the original data state and a point mutation to one of the entries for the column. Any change cascades globally, and is hard to recompute on the fly without scanning all the values again.

      This issue is also present in queries using ordered results (as changes to a single value participating in the ordering would affect the global ordering of results for that query).

      The issue that "Big Data" presents is really the need to run -global- data analysis on extremely large datasets, utilizing data parallelism to extract performance from a cluster of machines.

      What you're suggesting (basically a functional reactive framework for querying volatile persistent data), would still involve a number of limitations over the SQL model: basically disallowing the usage of any truly global algorithm across large datasets. Tools like Hadoop get around these limitations by taking the focus away from the data model (which is what SQL excels in dealing with), and putting it on providing an expressive framework for describing distributable computations (which SQL is not so great at dealing with).

      -Laxitive

      • Re: (Score:3, Informative)

        by Rakishi ( 759894 )

        Take something like "select stddev(column) from table" - there's no way to get an incremental update on that expression given the original data state and a point mutation to one of the entries for the column. Any change cascades globally, and is hard to recompute on the fly without scanning all the values again.

        Stddev is trivial to recompute on the fly and I'd be surprised if any decent sql engine didn't compute it one row at a time. Store mean(column) and mean(column^2). SD = sqrt(mean(c^2)-mean(c)^2) not considering the unbiasing stuff. Add new row value deltas to both, do some simple math and you're done.

        Now medians and quantiles are a bitch.

        Frankly complex data mining of large data is a pain in the ass on hadoop as much as anywhere else. You can't do anything too global with hadoop because then you'd need to s

        • Frankly complex data mining of large data is a pain in the ass on hadoop as much as anywhere else. You can't do anything too global with hadoop because then you'd need to send all your data to one box anyway. You need specialized complex algorithms since you can only keep a fraction of your data in memory at a time. Simple regression? Have fun.

          That said if you're already using hadoop it's quite possible you're using some sort of online learning algorithm anyway for just that reason so converting it to real time updating would be easy.

          Maybe. But I'm looking at doing it with pushing data to two boxes and two hadoop clusters (mainly for dr purposes). As for complex algorithms, Hive solves many of those problems with allowing people with SQL backgrounds to mine their information without learning a new language.

          It's new, it's not perfect but then again, it is improvement over what's already in the market. Over the next couple years I plan on becoming a hadoop expert (even looking at learning java now).

          Gotta start somewhere right?

        • by Laxitive ( 10360 )

          I should have thought things out a bit better with the stddev example - and realized that it does indeed have a reasonable closed form. Good catch.

          Complex data mining is hard everywhere, that's true. The problem is that even straightforward data mining is hard once the dataset sizes reach into the hundred-millions or billions or trillions in size (implying absolute dataset sizes of terabytes or more). For google it's webpages, for biology labs it's sequences.

          The big killer is the cost of transferring dat

      • by ka9dgx ( 72702 )

        It turns out there is already a protocol [waveprotocol.org] in place for doing a lot of the grunt work. It allows for the federation of changes to a given object across organizations. While I wouldn't want to try build an ACID database on it in my free time, I supposed it could eventually be done with a larger team of programmers.

        A query can be distributed across machines, which is what the map-reduce meme is all about. The next stage is to eliminate redundant calculations across time. LiveSQL will do that.

    • Aren't you basically talking about a materialized view [oracle.com]? (This FAQ item [orafaq.com] has a simpler explanation than you'll get digging through the documentation above)

      I haven't worked with materialized views, but if you want notification when the data changes, usually you can set up a trigger...

      • I have worked with materialized views and, yes, you are entirely correct. It takes one "CREATE MATERIALIZED VIEW" statement to have exactly what GP is describing. Unfortunately, in my experience, Oracle often requires so much tuning that a roll-your-own solution can be favorable (though less uniform and thus not suit-friendly).
    • Re:LiveSQL (Score:4, Informative)

      by BitZtream ( 692029 ) on Tuesday September 14, 2010 @03:20PM (#33579754)

      I think that the real innovation will be a variation of SQL that allows for the persistence of queries

      Thats been done for years, materialized views, using triggers on INSERT/UPDATE/DELETE to update the views on the fly.

      Streaming results as needed is done with cursors.

      I know you think you're probably talking about something that 'materialized views and cursors don't do'. Fortunately, you're wrong and just don't understand how to use them.

      It really bothers me how people who talk about problems with SQL really have no fucking clue what they are talking about or how to work with the data in the first place.

    • So you describe a set of conditions, let's be blatant and call them "rules" or even "filters", and when they match something you act upon them?

      Stunning :)

    • by Prune ( 557140 )
      Can't this be recast as a form of the eventual consistency paradigm?
  • ...and 'Big Data' tools like Hadoop are enabling IT organizations...

    ...these methods are widely available through Hadoop and other tools...

    Oh... also... did I mention HADOOP!!??

  • Looks like somebody got their PR spin piece relayed as a news story again. Bravo!

  • End of Science (Score:2, Informative)

    by mysterons ( 1472839 )
    Related to using Big Data in Business is Big Data in Science. Wired ran a nice series of articles looking at this (http://www.wired.com/wired/issue/16-07). This raises all sorts of problems (for example, how can results be reproduced? What if the model of the data is as complex as the data? Are all results obtained with Small Data simply artefacts of sparse counts?).
    • How can results be reproduced?

      I don't follow how near infinite storage affects the ability of researchers to re-perform an experiment to gather data a second time.

      What if the model of the data is as complex as the data?

      Then it is, by definition, not a model. A model is a system that describes another system more complex than itself, a model that is as complex as the system it is trying to describe is just different way of looking at the system. It can still be useful, but it doesn't simplify the problem the same way a real model does.

      Are all results obtained with Small Data simply artefacts of sparse counts?

      Science has had a way to handle that question for centur

      • Experiments being reproduced can be hard if no-one else has the data (this can happen --for example if you are Google and publish results using large fractions of the Web as data) or even if something as trivial as moving it from one site to another requires a lot of effort. This is not really a question of storage costs --it is a question of having the data in the first place and the mechanics of moving it around. Models are used in Science as idealisations; but if you really really want to model the l
  • by BitZtream ( 692029 ) on Tuesday September 14, 2010 @03:31PM (#33579894)

    Consider that Google, Yahoo, and Facebook were all once small companies that leveraged their data and understanding of the relationships in that data to grow significantly.

    Because their business is based entirely on how that data correlates.

    99.999999999% of the rest of the world do other things as their primary business model. Small businesses aren't going to do this because it requires a staff that KNOWS how to work with this software and get the data out.

    Walmart might care, but they aren't a small business.

    The local auto mechanic, or plumber, or even small companies like lawn services or maid services simply aren't big enough to justify having a staff of nerds to make the data useful to them, and they really don't have enough data to matter. It simply is too expensive on the small scale.

    Companies that can REALLY benefit from the ability to comb vast quantities of data have been doing it for well over a hundred years. Insurance companies are a prime example. You know what? They aren't small in general, so they have the staff to do the data correlation and find out useful information because it works on that scale.

    Anyone who cares about churning through massive amounts of data already has ways to do it. Computing will make it faster, but its not going to change the business model.

    I'm kind puzzled why virtualization has anything to do with this, unless someone is implying that a smart thing to do is setup a VM server, and then run a bunch of VMs on it to get a 'cluster' to run distributed apps on ... if thats the point being made then I think someone needs to turn in their life card (they clearly never had a geek card).

    So now that I've written all that, I went and read the article.

    Now I realize that article is written by someone who has absolutely no idea what they are talking about and simply read a wikipedia page or two and threw in a bunch of names and buzzwords.

    Hadoop doesn't help the IT department do anything with the data at all.

    Its the teams of analyists and developers that write the code to make Hadoop (which is only mentioned because of its OSS nature here) and a bunch of other technologies and code all work together and produce useful output.

    This article is basically written like the invention of the hammer made it so everyone would want to build their own homes because they could. Thats a stupid assumption and statement.

    Slashdot should be fucking ashamed that this is posted anywhere, let alone front page.

    • You're missing the point.

      Sure, insurance companies have kept claims data around for many years. They make some pretty good observations about obvious correlations. People who speed too much tend to hit more things. People with chronic diseases tend to die.

      What about the data they couldn't handle, though? What about the effects of someone's purchases? Did they buy quality brake pads? What about the circuit breakers installed in their house the last time it was remodeled? What contractor did they hire? How ma

    • by Prune ( 557140 )
      What's missing is a killer app for most businesses, and it's the data gathering and management side that's lacking, but the analytics side. I think that advanced analytics is not nearly as user friendly and accessible as it could be, and hopefully will be in the future. Visualization/analytics tools like Tableau are a good start, but we need more better (as in smarter in AI/machine learning terms) automation. Eventually I see analytics useful not only to businesses but even individuals, as a way of makin
    • Actually, while I was also irked by the buzzword-compliance of TFA, I think the point about linking virtualization and the cloud with giving small businesses access to data tools is actually quite valid. Storage and processing are commodities now thanks to these technologies, which significantly reduces the staff and overhead required for a startup or small company to utilize large data sets. I work for a small web design and hosting company and we certainly wouldn't be considering scaling up our data manag

    • "99.999999999% of the rest of the world do other things as their primary business model. Small businesses aren't going to do this because it requires a staff that KNOWS how to work with this software and get the data out."

      Of course, 99.999999999% of the world doesn't have electricity as their primary business model. Does this mean that small business are going to stay with candles and bonfires? Because, you know, they won't have the needed staff for producing and distributing their own electricity.

      This ne

  • Rummaging in the Bitlocker

    Starring everybody's favorite...

    Peta Bites

    and costarring...

    Bare Bones

    and making his professional debut:

    Big Data!

  • you need to know what to look for. In order to know what to look for you need to know what's meaningful and that requires some sort of useful model. Accumulating data in itself isn't that interesting.

  • EVAR?

    The so-called computational scientists I used to work for through an entire alphabet soup of FFRDCs were barely able to program in FORTRAN, much less something as sophisticated at Hadoop.

    Notice that the article skirts this issue -- yes, they work with "Big Data" but they don't use any dev tools developed post-1963 to do it in, believe me.

  • For the most part, Google has moved onto Caffeine and GFS2 for their support. Apparently, Big Table was taking too long to regenerate the entire index, forcing Google to refresh only part of their index frequently. The new Caffeine framework supposedly lets Google get closer-to-real-time search results because newly-indexed/crawled data can be continuously tossed into the search database without requiring an entire batch process. Perhaps that's why quotes from Slashdot comments show up in Google so quickly.

  • @CmdrTaco, et al., You might go to the 'in-depth guide" Olhorst mentions [http://www.pwc.com/us/en/technology-forecast/2010/issue3/index.jhtml], and assess that separately. We did a lot of research with the CIO and the rest of the C-suite in mind as a target audience. Of course Google has moved on beyond Bigtable, etc.... According to @royans [http://www.royans.net/arch/pregel-googles-other-data-processing-infrastructure/], Google uses Pregel to mine graphs. Allegedly 20% of their data they mine with Prege

There is nothing so easy but that it becomes difficult when you do it reluctantly. -- Publius Terentius Afer (Terence)

Working...