Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Bug Cloud Security IT

Amazon Forced To Reboot EC2 To Patch Bug In Xen 94

Bismillah writes AWS is currently emailing EC2 customers that it will need to reboot their instances for maintenance over the next few days. The email doesn't explain why the reboots are being done, but it is most likely to patch for the embargoed XSA-108 bug in Xen. ZDNet takes this as a spur to remind everyone that the cloud is not magical. Also at The Register.
This discussion has been archived. No new comments can be posted.

Amazon Forced To Reboot EC2 To Patch Bug In Xen

Comments Filter:
  • Compared to Azure (Score:3, Informative)

    by Anonymous Coward on Thursday September 25, 2014 @10:11AM (#47993553)

    It's funny for me to read that Amazon is notifying its users of an impending reboot.

    I've been suffering with Azure for over a year now, and the only thing that's constant is rebooting....

    My personal favorite Azure feature, is that SQL Azure randomly drops database connections by design.

    Let that sink in for a while. You are actually required to program your application to expect failed database calls.

    I've never seen such a horrible platform, or a less reliable database server...

    • by CodeReign ( 2426810 ) on Thursday September 25, 2014 @10:20AM (#47993669)

      You are actually required to program your application to expect failed database calls.

      Yes, of course you are. Only an idiot would expect 100% of db calls to be successful.

      • hahaha

        so you've never worked on serious computer systems? The mainframe and vms clusters I've used had databases working for years (over a decade in one case as new hardware joined sequentially to cluster as old retired).

        failures very occasional, to say the least

        even where I am now the main database is oracle on virtualized linux servers, it's been up for 3+ years

        Not everything is apache server hooking to single mysql instance....

        • by Shados ( 741919 )

          We're not talking about the thing going down here, just database connection sometimes failing. If you have a 100% failure proof network and you can replicate it, go tell Google, Amazon, etc. They have a job for you.

        • by Yebyen ( 59663 )

          So what you're saying is, there were occasional failures.

        • No, some things are load-balanced banks of apache servers hooked to Galera MySQL clusters.

          Really, though. Unless Oracle has been spending a LOT more time on version compatibility than IBM or PostgreSQL, I have to wonder if those 3+ years don't mean that the database is something like 9i still running. And Oracle DEFINITELY knows how to break things in their Financials product from major release to major release.

        • so you've never worked on vertical computer systems?

          Fixed that for you. You're conflating vertically scaled monoliths with "serious systems". That's quaint. While there are certainly still use cases for that kind of bulletproof all-your-eggs-in-one-basket architecture, that's a niche compared to the number of applications where horizontally scaled eventually consistent architecture is more appropriate.

          The mainframe and vms clusters I've used had databases working for years (over a decade in one case as new hardware joined sequentially to cluster as old retired).

          Undoubtedly, and the distributed clusters I've used where you can make progress as long as at least some reasonable subset of nodes are still alive have simila

          • You are confused, the architecture of google is utterly useless for most businesses cases, it does not and can not provide accurate answers to queries.

            • The architecture of Google is utterly useless for many businesses cases.There are many use cases where it'd be perfectly appropriate.

              it does not and can not provide accurate answers to queries.

              In most cases, businesses don't really care about accurate answers to queries; they want quick, more-or-less correct answers. For example, suppose Amazon has a dashboard that shows their book sales on an hourly basis. Timeliness is more important than exactness here, and answers more precise than the pixel resolution of the graph on the big TV are wasted. A "big data" style qu

        • I mean I'm an Oracle FMW developer working with several Oracle servers having serious uptime and SLAs but even then, hiccups happen. A good developer programs with the expectation that not everything will work smoothly and so long as not everything breaks at once I could have a DB fall off the face of the earth or a server get shot and we'd still chug along with minimal perceived downtime.

      • My DB servers have a 0.07% failure rate. I imagine the parent is seeing a far higher percentage than that.

    • by Anonymous Coward

      You are actually required to program your application to expect failed database calls.

      Shouldn't you be doing that anyway? Handling failed DB calls sounds like a best practice to me.

      I get it being annoying that it drops DB connections but decent code should be able to handle that.

      I do sympathize in a way though, I hate when the language/platform/etc... forces me to do something "their" way. Java says everything must be object, fuck you Java not everything needs to be/should be a object!

      • You can "handle" a dropped connection, but if you're in a transaction in the middle of updating data, it's probably not going to be transparent to the user.

        -E

        • by Shados ( 741919 ) on Thursday September 25, 2014 @10:47AM (#47993927)

          if you're in an transaction and it fails, you can just redo it. Thats the whole damn point.

          • That isn't the point of a transaction _at all_. The point is to ensure that all of the operations contained in the transaction block are atomic - they either all happen or none of them happen.

            • by Shados ( 741919 )

              Exactly. Either all happen OR NONE HAPPEN.

              OR NONE HAPPEN.
              OR NONE HAPPEN.
              OR NONE HAPPEN.

              Which means you can simply rerun any failed transaction safely.

              Thanks for repeating what I said in different words.

              • You said that the point of a transaction is to enable the ability to retry it. I said that was not the point. I don't see how we are saying the same thing.

      • by Anonymous Coward

        I do sympathize in a way though, I hate when the language/platform/etc... forces me to do something "their" way. Java says everything must be object, fuck you Java not everything needs to be/should be a object!

        What does Java have to do with this? My guess is the OP is using Python or some other script-kiddie language. You don't want to use objects in Java? Here ya go:

        byte[] myManagedMemory = new byte[1024 * 1024 * 1024 * 2] // 2GB RAM

        I wont hold my breath to see how your memory management is going to be better than what the JVM already provides.

        BTW: primitives are not Objects in Java.

    • by Junta ( 36770 )

      My personal favorite Azure feature, is that SQL Azure randomly drops database connections by design.

      I have seen that mentality in a few places beyond azure, I find it moderately annoying. I guess the theory is assuring that *some* failure will happen to you soon even if you don't properly test so you don't go too long without failure and get surprised. However it tends to lead to stacks that occasionally spaz out for a particular user and accepting that as ok because the user can just retry.

      You are actually required to program your application to expect failed database calls.

      On the other hand, you should always design your application to expect failed database calls. There might be some

    • You are actually required to program your application to expect failed database calls.

      The problem with programming is that most programmers consider basic error handling and sanity checking to be optional.

    • When hosting your app in the cloud, regardless of provider, it is considered best practice to design for failure. That means your code should anticipate any/all stack layers to become unavailable. If you're doing it right, a service failure should be detected and automatic failover executed. Alternatively, a new instance should be provisioned, bootstrapped and thrown into production. Think: infrastructure as code. Welcome to the 21-st century.

      • Re:Compared to Azure (Score:4, Informative)

        by Just Some Guy ( 3352 ) <kirk+slashdot@strauser.com> on Thursday September 25, 2014 @11:30AM (#47994403) Homepage Journal

        When hosting your app in the cloud, regardless of provider, it is considered best practice to design for failure.

        Netflix goes so far as to randomly kill services [netflix.com] throughout the day. Their idea is that it's better to find systems that aren't auto-healing correctly by testing recovery during routine operations than to be surprised by it at 3AM. It's successful to the point that you generally don't know that the streaming server you were connected to has been killed and a peer took over for it. That is how you make reliable cloud services.

        • Netflix still cocks up randomly on a stream and forces retries. I suspect it's not as rosy as they like to say and that the random death of services is more disruptive than they notice or acknowledge.

          Meanwhile, even with their 'kill stuff randomly' methodology, the wrong thing still dies ever so often and brings the whole thing to a screeching halt.

          • Netflix certainly isn't perfect, but they're Pretty Darn Good (tm). I haven't experienced any more glitches with streaming Netflix than I have with Comcast breaking other downloads.

            Meanwhile, even with their 'kill stuff randomly' methodology, the wrong thing still dies ever so often and brings the whole thing to a screeching halt.

            The whole idea behind Chaos Monkey is to make sure there's no such "the wrong thing" single point of failure. Having talked to their SREs, I think such outages are exceedingly rare.

    • by tlhIngan ( 30335 )

      And unless you run a small website, that can happen way too easily.

      Every e-commerce site has database failures usually around peak shopping periods - it's usually the weakest point because no matter how many instances you run, it's the bottleneck as the database's view of the has to be consistent across all database servers.

      And sometimes, well, the sheer crunch of users buying stuff topples that.

      Even a good /.'ing in the past would return errors of the form "Could not connect to database".

      Anyhow, I thought

  • by Anonymous Coward

    If your design has issues with instances going up & down, you're doing it wrong and shouldn't be using cloud services to begin with.

  • email from ec2... (Score:3, Insightful)

    by Connie_Lingus ( 317691 ) on Thursday September 25, 2014 @10:38AM (#47993841) Homepage

    "we will be re-booting the cloud today,,,in order to protect your 3,2 petabytes of data, you should download it to local storage in case of a fail event. thanks for using cloud storage on computing. have a great day."

    • "we will be re-booting the cloud today,,,in order to protect your 3,2 petabytes of data, you should download it to local storage in case of a fail event. thanks for using cloud storage on computing. have a great day."

      That this inane post is moderated as "3, Insightful" is why I do not visit /. anymore.

  • Does this mean the open source release of Xen doesn't have the diff applied? Do customers of large corporate clouds now have a security advantage over other users?

  • Seriously, if you ran your own server, you think you would never have to reboot it?

    Yes, the cloud will have downtime. Just like we sometimes have blackouts/brownouts from an electricity outage.

    BUT, chances are that downtime is LESS than the downtime you'd have running things on your own.

    In every company I've worked in, there have been days the internet goes down, some intranet app goes down, exchange goes down... things need to updated and are down for a few hours.

  • AWS has been around long enough this shouldn't be an issue. If a given architecture cannot survive downtime from a server, or an availability zone, then the risk is no different than if the servers were in a locally-managed datacenter.

    In short, if you don't take advantage of what the cloud has to offer in terms of redundancy, then don't expect zero downtime.

  • I really don't get it, every virtualization technology has the possibility to live migrate the virtual machine to a different physical host, vmware, kvm, openvz, xen, everyone has it, for at least three of them you don't need to have shared storage. Why don't they use it?

  • Just migrate the instance to a host running the fixed version of Xen, reboot the host with the broken version when it's empty.

    • Oh:

      Given that what’s underlying EC2 are ordinary physical servers running virtualization without a live migration technology in use,

      EC2 doesn't do migration.

      Low-life.

Top Ten Things Overheard At The ANSI C Draft Committee Meetings: (10) Sorry, but that's too useful.

Working...