Amazon Forced To Reboot EC2 To Patch Bug In Xen 94
Bismillah writes AWS is currently emailing EC2 customers that it will need to reboot their instances for maintenance over the next few days. The email doesn't explain why the reboots are being done, but it is most likely to patch for the embargoed XSA-108 bug in Xen. ZDNet takes this as a spur to remind everyone that the cloud is not magical. Also at The Register.
Re: (Score:1, Offtopic)
I saw in the Mpls Star Tribune the other day that Amazon are going to start charging (MN residents) sales tax as from 1st October.
I don't know if this will apply to digital content as well but if it does then I will have to cut back on buying books, magazines, and music from them as well.
The only stuff we will be able to buy is clothes...
If Amazon is collecting sales tax, it means that you were supposed to have already been paying the sales tax, and you're practicing tax evasion if you haven't paying sales or use tax on your purchases.
http://www.revenue.state.mn.us... [state.mn.us]
Re: (Score:1)
AKA Amazon's business model.
Compared to Azure (Score:3, Informative)
It's funny for me to read that Amazon is notifying its users of an impending reboot.
I've been suffering with Azure for over a year now, and the only thing that's constant is rebooting....
My personal favorite Azure feature, is that SQL Azure randomly drops database connections by design.
Let that sink in for a while. You are actually required to program your application to expect failed database calls.
I've never seen such a horrible platform, or a less reliable database server...
Re:Compared to Azure (Score:5, Insightful)
You are actually required to program your application to expect failed database calls.
Yes, of course you are. Only an idiot would expect 100% of db calls to be successful.
Re:Compared to Azure (Score:5, Insightful)
Be sure to thank Microsoft for teaching you the value of robust error checking. Assume any other host you need to talk to was nuked from orbit five seconds ago. Write your code to bounce back from that to the degree possible.
At the very least, DB *connections* should be assumed to have evaporated since the last time you accessed them. Use some sort of pooling library that can deal with that transparently if you like, or just catch & retry if necessary.
Seriously though, sounds like the environments you’ve worked in have been simple enough with low enough transaction volume that you got lucky & everything just worked. DB & app server on the same box maybe? Dealing with temporarily unavailable external hosts is just part of writing multi-tier code.
Re: (Score:2)
Which makes it an excellent dev environment, but terrible for production use...
While you want your code to be able to cope with database instability, when that code goes into production you also want to minimise the chances that it will ever have to.
Re: (Score:1)
It might be a shitty platform. but you are a shitty programmer. And it does not matter if it drops every time or never - your code deals with the problem every time it arises. are you retarded or in HR?
Re: (Score:2)
It might be a shitty platform. but you are a shitty programmer.
Apply water to burned area.
Re: (Score:2)
hahaha
so you've never worked on serious computer systems? The mainframe and vms clusters I've used had databases working for years (over a decade in one case as new hardware joined sequentially to cluster as old retired).
failures very occasional, to say the least
even where I am now the main database is oracle on virtualized linux servers, it's been up for 3+ years
Not everything is apache server hooking to single mysql instance....
Re: (Score:3)
We're not talking about the thing going down here, just database connection sometimes failing. If you have a 100% failure proof network and you can replicate it, go tell Google, Amazon, etc. They have a job for you.
Re: (Score:2)
So what you're saying is, there were occasional failures.
Re: (Score:2)
No, some things are load-balanced banks of apache servers hooked to Galera MySQL clusters.
Really, though. Unless Oracle has been spending a LOT more time on version compatibility than IBM or PostgreSQL, I have to wonder if those 3+ years don't mean that the database is something like 9i still running. And Oracle DEFINITELY knows how to break things in their Financials product from major release to major release.
Re: (Score:2)
so you've never worked on vertical computer systems?
Fixed that for you. You're conflating vertically scaled monoliths with "serious systems". That's quaint. While there are certainly still use cases for that kind of bulletproof all-your-eggs-in-one-basket architecture, that's a niche compared to the number of applications where horizontally scaled eventually consistent architecture is more appropriate.
The mainframe and vms clusters I've used had databases working for years (over a decade in one case as new hardware joined sequentially to cluster as old retired).
Undoubtedly, and the distributed clusters I've used where you can make progress as long as at least some reasonable subset of nodes are still alive have simila
Re: (Score:2)
You are confused, the architecture of google is utterly useless for most businesses cases, it does not and can not provide accurate answers to queries.
Re: (Score:3)
The architecture of Google is utterly useless for many businesses cases.There are many use cases where it'd be perfectly appropriate.
it does not and can not provide accurate answers to queries.
In most cases, businesses don't really care about accurate answers to queries; they want quick, more-or-less correct answers. For example, suppose Amazon has a dashboard that shows their book sales on an hourly basis. Timeliness is more important than exactness here, and answers more precise than the pixel resolution of the graph on the big TV are wasted. A "big data" style qu
Re: (Score:2)
I mean I'm an Oracle FMW developer working with several Oracle servers having serious uptime and SLAs but even then, hiccups happen. A good developer programs with the expectation that not everything will work smoothly and so long as not everything breaks at once I could have a DB fall off the face of the earth or a server get shot and we'd still chug along with minimal perceived downtime.
Re: (Score:2)
My DB servers have a 0.07% failure rate. I imagine the parent is seeing a far higher percentage than that.
Re: (Score:1)
You are actually required to program your application to expect failed database calls.
Shouldn't you be doing that anyway? Handling failed DB calls sounds like a best practice to me.
I get it being annoying that it drops DB connections but decent code should be able to handle that.
I do sympathize in a way though, I hate when the language/platform/etc... forces me to do something "their" way. Java says everything must be object, fuck you Java not everything needs to be/should be a object!
Re: (Score:2)
You can "handle" a dropped connection, but if you're in a transaction in the middle of updating data, it's probably not going to be transparent to the user.
-E
Re:Compared to Azure (Score:5, Insightful)
if you're in an transaction and it fails, you can just redo it. Thats the whole damn point.
Re: (Score:1)
That isn't the point of a transaction _at all_. The point is to ensure that all of the operations contained in the transaction block are atomic - they either all happen or none of them happen.
Re: (Score:2)
Exactly. Either all happen OR NONE HAPPEN.
OR NONE HAPPEN.
OR NONE HAPPEN.
OR NONE HAPPEN.
Which means you can simply rerun any failed transaction safely.
Thanks for repeating what I said in different words.
Re: (Score:1)
You said that the point of a transaction is to enable the ability to retry it. I said that was not the point. I don't see how we are saying the same thing.
Re: (Score:1)
You are right, I am failing to adequately communicate what I am saying. That a transaction can be retried is a byproduct of the atomicity requirement that a transaction fills. Retrying a transaction, because there is an ongoing problem with the database system dropping connections, is a sloppy hack.
Re: (Score:1)
I do sympathize in a way though, I hate when the language/platform/etc... forces me to do something "their" way. Java says everything must be object, fuck you Java not everything needs to be/should be a object!
What does Java have to do with this? My guess is the OP is using Python or some other script-kiddie language. You don't want to use objects in Java? Here ya go:
byte[] myManagedMemory = new byte[1024 * 1024 * 1024 * 2] // 2GB RAM
I wont hold my breath to see how your memory management is going to be better than what the JVM already provides.
BTW: primitives are not Objects in Java.
Re: (Score:2)
My personal favorite Azure feature, is that SQL Azure randomly drops database connections by design.
I have seen that mentality in a few places beyond azure, I find it moderately annoying. I guess the theory is assuring that *some* failure will happen to you soon even if you don't properly test so you don't go too long without failure and get surprised. However it tends to lead to stacks that occasionally spaz out for a particular user and accepting that as ok because the user can just retry.
You are actually required to program your application to expect failed database calls.
On the other hand, you should always design your application to expect failed database calls. There might be some
Re: (Score:2)
The problem with programming is that most programmers consider basic error handling and sanity checking to be optional.
Re: (Score:2)
When hosting your app in the cloud, regardless of provider, it is considered best practice to design for failure. That means your code should anticipate any/all stack layers to become unavailable. If you're doing it right, a service failure should be detected and automatic failover executed. Alternatively, a new instance should be provisioned, bootstrapped and thrown into production. Think: infrastructure as code. Welcome to the 21-st century.
Re:Compared to Azure (Score:4, Informative)
When hosting your app in the cloud, regardless of provider, it is considered best practice to design for failure.
Netflix goes so far as to randomly kill services [netflix.com] throughout the day. Their idea is that it's better to find systems that aren't auto-healing correctly by testing recovery during routine operations than to be surprised by it at 3AM. It's successful to the point that you generally don't know that the streaming server you were connected to has been killed and a peer took over for it. That is how you make reliable cloud services.
Netflix is not perfect... (Score:2)
Netflix still cocks up randomly on a stream and forces retries. I suspect it's not as rosy as they like to say and that the random death of services is more disruptive than they notice or acknowledge.
Meanwhile, even with their 'kill stuff randomly' methodology, the wrong thing still dies ever so often and brings the whole thing to a screeching halt.
Re: (Score:2)
Netflix certainly isn't perfect, but they're Pretty Darn Good (tm). I haven't experienced any more glitches with streaming Netflix than I have with Comcast breaking other downloads.
Meanwhile, even with their 'kill stuff randomly' methodology, the wrong thing still dies ever so often and brings the whole thing to a screeching halt.
The whole idea behind Chaos Monkey is to make sure there's no such "the wrong thing" single point of failure. Having talked to their SREs, I think such outages are exceedingly rare.
Re: (Score:2)
Re: (Score:2)
VMWare's fault tolerance is decent, but nothing that will recover in milliseconds. Even with vMotion and HA, it will take some time for the machine to reboot.
Of course, there is the FT mode of VMWare... but it has a lot of limitations, such as only allowing 1 vCPU, but it does run two VMs in lockstep so if the heartbeat drops, the downtime is in seconds, not minutes as with a machine restarting.
Re: (Score:3)
How much longer would it take to migrate the existing vms to patched version. (even if you only have 10% unutilized resources it'd only take at most nine swaps) I agree it's a bad solution to move every machine over night but it's better than forcing an outage.
AWS can't live migrate VM's.
Re: (Score:2)
Re: (Score:2)
that's Xen and Xen != VMWare but it works for about 99% of the workloads out there. From what I received it's a cold restart not a simple reboot. Periodically they upgrade hardware/software and once a month we go through a cold restart an all of our AWS instances. It's easy with the right tools.
Re: (Score:2)
I'm not saying migrate to another facility but to another machine. If that's what you also meant would you be able to provide a source? That seems like a very very big oversite.
Re: (Score:2)
That seems like a very very big oversite.
It's nature of the beast. Live migrations without shared storage are really not commonplace. Amazon does not bother with shared storage and thus cannot live migrate. Even if they did have the ability to live migrate with no shared storage, the time to live migrate such a workload would be impractical.
In short, EC2 strives for cheap and no migration is part of 'cheap'.
Re: (Score:2)
A lot of people want the convenience of a virtual server, but not the price tag or hassle of several servers and a load balancer. They don't "get" why they would pay for lots of small machines when one big one would do. Once you do convince them to go with several small servers and a load balancer, they don't understand why their FTP changes take a moment to show up online. Then they don't don't want to invest in someone to setup the system with puppet or ansible or the like... The list goes on, but it usually comes down to people not having the money or desire(usually both) to do things "the cloud way."
Most of these small players would be happier with a single 2-drive RAID-1 server in their closet, except they are too cheap to shell out for a decent machine in the first place as well as business tier internet (they usually don't have the traffic to warrant it, but is required for ISPs to be OK with it). $5/month for a VPS is much more palatable, even if what they get is a lot less powerful then they could have in their office.
There's no business tier small office internet that's going to give users the same uptime as a cheap VPS somewhere. No business that wants to maintain a 24x7 internet presence should be running their server on a small server in their closet.
Re: (Score:2)
Re: (Score:2)
Amazon doesn't have the capacity to failover all the vm's to other hardware (maybe some but not all or big ones). Or they don't want to bother and force the work on to their customers.
I think you meant "and charge customers for the much larger infrastructure required". Amazon is cheap, and they are clear that what you're buying from them is just a bunch of machines. If you want reliability, use multiple AZ's and regions. Some of their VM's come with a TB or more of instance storage, that's a lot of data to live-migrate when they want to reboot a physical host machine.
If you want live migration, check out Google Compute Engine, but if availability is important to you, you're better off ar
Re: (Score:2)
Re: (Score:2)
How much longer would it take to migrate the existing vms to patched version. (even if you only have 10% unutilized resources it'd only take at most nine swaps) I agree it's a bad solution to move every machine over night but it's better than forcing an outage.
AWS can't live migrate VM's.
Xen can.
Well, actually, for about 100ms, the system isn't technically running, but the point is that you can bounce a VM from one host to another without rebooting it.
Re: (Score:2)
How much longer would it take to migrate the existing vms to patched version. (even if you only have 10% unutilized resources it'd only take at most nine swaps) I agree it's a bad solution to move every machine over night but it's better than forcing an outage.
AWS can't live migrate VM's.
Xen can.
Well, actually, for about 100ms, the system isn't technically running, but the point is that you can bounce a VM from one host to another without rebooting it.
Xen is software, not AWS, AWS is an entire infrastructure, and they can not (or will not) live migrate customer VM's.
They are very clear in their documentation that customers should be able to tolerate VM restarts and to use multiple AZ's and regions to help mitigate downtime. I have several hundred instances scheduled for reboot, but they are doing one AZ at a time.
Re: (Score:2)
Xen is software, not AWS, AWS is an entire infrastructure, and they can not (or will not) live migrate customer VM's.
They are very clear in their documentation that customers should be able to tolerate VM restarts and to use multiple AZ's and regions to help mitigate downtime. I have several hundred instances scheduled for reboot, but they are doing one AZ at a time.
Since Xen is rumored to be the VM host for AWS (or at least large parts of it), I'd have to think it's "will not".
Re: (Score:2)
Xen is software, not AWS, AWS is an entire infrastructure, and they can not (or will not) live migrate customer VM's.
They are very clear in their documentation that customers should be able to tolerate VM restarts and to use multiple AZ's and regions to help mitigate downtime. I have several hundred instances scheduled for reboot, but they are doing one AZ at a time.
Since Xen is rumored to be the VM host for AWS (or at least large parts of it), I'd have to think it's "will not".
I can believe it's "can not", since amazon provides gigabytes (or terabytes) of local instance storage for most of their instance types - that's a lot of data to live migrate. Even if the underlying Xen software technically *can* live migrate VM's, that doesn't mean their infrastructure can support migrating thousands of customer instances.
Re: (Score:2)
Xen is software, not AWS, AWS is an entire infrastructure, and they can not (or will not) live migrate customer VM's.
They are very clear in their documentation that customers should be able to tolerate VM restarts and to use multiple AZ's and regions to help mitigate downtime. I have several hundred instances scheduled for reboot, but they are doing one AZ at a time.
Since Xen is rumored to be the VM host for AWS (or at least large parts of it), I'd have to think it's "will not".
I can believe it's "can not", since amazon provides gigabytes (or terabytes) of local instance storage for most of their instance types - that's a lot of data to live migrate. Even if the underlying Xen software technically *can* live migrate VM's, that doesn't mean their infrastructure can support migrating thousands of customer instances.
Except that in a cloud, storage is part of the cloud, not part of the server. The only thing that has to physically move is the RAM image of the running VM from one host to another. And it's almost certainly going to be faster to replicate that than to destroy and rebuild it (reboot).
Re: (Score:2)
Xen is software, not AWS, AWS is an entire infrastructure, and they can not (or will not) live migrate customer VM's.
They are very clear in their documentation that customers should be able to tolerate VM restarts and to use multiple AZ's and regions to help mitigate downtime. I have several hundred instances scheduled for reboot, but they are doing one AZ at a time.
Since Xen is rumored to be the VM host for AWS (or at least large parts of it), I'd have to think it's "will not".
I can believe it's "can not", since amazon provides gigabytes (or terabytes) of local instance storage for most of their instance types - that's a lot of data to live migrate. Even if the underlying Xen software technically *can* live migrate VM's, that doesn't mean their infrastructure can support migrating thousands of customer instances.
Except that in a cloud, storage is part of the cloud, not part of the server. The only thing that has to physically move is the RAM image of the running VM from one host to another. And it's almost certainly going to be faster to replicate that than to destroy and rebuild it (reboot).
No, Amazon says that instance storage is directly attached to the host machine, so if they live-migrate a VM, they'd have to carry along the instance storage.
http://docs.aws.amazon.com/AWS... [amazon.com]
Many Amazon EC2 instance types can access disk storage from disks that are physically attached to the host computer. This disk storage is referred to as instance store.
And there's no evidence that they use any type of shared SAN for instance storage -- instance storage only stays around for as long as the machine is running (or rebooted). If you stop the machine (as opposed to rebooting), or if Amazon has to migrate to a new physical host, you lose the instance store.
Re: (Score:2)
Not trying to be contentious here, but if you wanted optimal resource usage, you'd be looking more at blade-style compute nodes with no local drives. It defeats the purpose if every compute node has a fixed amount of local disk space attached to it. There's no elasticity. Some compute nodes might max out, some might be using only a fraction of the drive. The whole reason for virtualizing everything was that there were too many machines burning up tons of resources while sitting more or less idle.
IIRC, Amazo
Re: (Score:2)
Not trying to be contentious here, but if you wanted optimal resource usage, you'd be looking more at blade-style compute nodes with no local drives.
Who would you be contentious with? I'm just telling you what Amazon says in their published docs. If you don't believe what they say, or if you think they could do it better you can bring it up with them, or start your own cloud service that does things "right".
But I can tell you that some use cases are perfect for Amazon's model of providing locally attached instance storage since I/O rates are much better than we can get with EBS volumes.
Re: (Score:1)
Re: (Score:2)
Not trying to be contentious here, but if you wanted optimal resource usage, you'd be looking more at blade-style compute nodes with no local drives.
Who would you be contentious with? I'm just telling you what Amazon says in their published docs. If you don't believe what they say, or if you think they could do it better you can bring it up with them, or start your own cloud service that does things "right".
But I can tell you that some use cases are perfect for Amazon's model of providing locally attached instance storage since I/O rates are much better than we can get with EBS volumes.
The days when just anyone could enter the market as an ISP are long since passed. The "back bedroom" ISP I started with has been through at least 4 layers of acquisition. I myself stopped providing hosting services before the millenium came. The economies of scale were not available to me and I don't have deep enough pockets - nor rich enough friends - to set up anything even remotely competitive.
So I'll settle for holding Amazon's feet to the fire.
I don't host anymore, but I do work with cloud services int
Reboot (Score:1)
If your design has issues with instances going up & down, you're doing it wrong and shouldn't be using cloud services to begin with.
email from ec2... (Score:3, Insightful)
"we will be re-booting the cloud today,,,in order to protect your 3,2 petabytes of data, you should download it to local storage in case of a fail event. thanks for using cloud storage on computing. have a great day."
Re: (Score:2)
"Please note thatn $106,000 will be added to your next bill for the egress bandwidth."
Interesting thought: Cloud providers should offer a low-cost option for getting your petabyte of data out of their system.
Perhaps mailing you a drive (or series of drives) with the data on it? If they allow you to run zfs or other filesystem with snapshot capability, create a snapshot and request that it be mailed to you. Maybe they'll even link in the available drives that can handle the data and you pick which one(s) you want.
email from ec2... (Score:2, Funny)
"we will be re-booting the cloud today,,,in order to protect your 3,2 petabytes of data, you should download it to local storage in case of a fail event. thanks for using cloud storage on computing. have a great day."
That this inane post is moderated as "3, Insightful" is why I do not visit /. anymore.
Emabargoed Bug? (Score:2)
Does this mean the open source release of Xen doesn't have the diff applied? Do customers of large corporate clouds now have a security advantage over other users?
It is still magical enough (Score:2)
Seriously, if you ran your own server, you think you would never have to reboot it?
Yes, the cloud will have downtime. Just like we sometimes have blackouts/brownouts from an electricity outage.
BUT, chances are that downtime is LESS than the downtime you'd have running things on your own.
In every company I've worked in, there have been days the internet goes down, some intranet app goes down, exchange goes down... things need to updated and are down for a few hours.
Why is this a story? (Score:2)
AWS has been around long enough this shouldn't be an issue. If a given architecture cannot survive downtime from a server, or an availability zone, then the risk is no different than if the servers were in a locally-managed datacenter.
In short, if you don't take advantage of what the cloud has to offer in terms of redundancy, then don't expect zero downtime.
I don't get it (Score:2)
I really don't get it, every virtualization technology has the possibility to live migrate the virtual machine to a different physical host, vmware, kvm, openvz, xen, everyone has it, for at least three of them you don't need to have shared storage. Why don't they use it?
Why do they need a reboot? (Score:2)
Just migrate the instance to a host running the fixed version of Xen, reboot the host with the broken version when it's empty.
Re: (Score:2)
Oh:
Given that what’s underlying EC2 are ordinary physical servers running virtualization without a live migration technology in use,
EC2 doesn't do migration.
Low-life.