IT Infrastructure As a House of Cards 216
snydeq writes "Deep End's Paul Venezia takes up a topic many IT pros face: 'When you've attached enough Band-Aids to the corpus that it's more bandage than not, isn't it time to start over?' The constant need to apply temporary fixes that end up becoming permanent are fast pushing many IT infrastructures beyond repair. Much of the blame falls on the products IT has to deal with. 'As processors have become faster and RAM cheaper, the software vendors have opted to dress up new versions in eye candy and limited-use features rather than concentrate on the foundation of the application. To their credit, code that was written to run on a Pentium-II 300MHz CPU will fly on modern hardware, but that code was also written to interact with a completely different set of OS dependencies, problems, and libraries. Yes, it might function on modern hardware, but not without more than a few Band-Aids to attach it to modern operating systems,' Venezia writes. And yet breaking this 'vicious cycle of bad ideas and worse implementations' by wiping the slate clean is no easy task. Especially when the need for kludges isn't apparent until the software is in the process of being implemented. 'Generally it's too late to change course at that point.'"
All comes down to budget (Score:5, Informative)
Take responsibility and stop the magical thinking (Score:4, Informative)
Well, sure, IT departments place the blame there. The problem, though, is not so much with the products that IT "has to deal with" as with the fact that IT departments either actively choose the penny-wise-but-pount-foolish course of action of applying band-aids rather than dealing with problems properly in the first place, or because -- when the decision is not theirs -- they simply fail to properly advise the units that are making decisions of the cost and consequence of such a short-sighted approach.
When IT units don't take responsibility for assuring the quality of the IT infrastructure, surprisingly enough, the IT infrastructure, over time, becomes an unstable house of cards, with the IT unit pointing fingers everywhere else.
If your process -- whether its for development or procurement -- doesn't discover holes before it is too late to do anything but apply "temporary" workarounds, then your process is broken, and you need to fix it so you catch problems when you can more effectively address them.
If your process leaves those interim workarounds fixes in place once they are established without initiating and following through on a permanent resolution, then, again, your process is broken and needs fixed.
You don't fix the problems with your infrastructure that have resulted from your broken processes by "wiping the slate clean" on your infrastructure and starting over. You fix the problems by, first, improving your processes so your attempts to address the holes you've built into your infrastructure don't create two more holes for every one you fix, then by attacking the holes themselves.
If you try to through the whole thing out because its junk -- blaming the situation on the environment and the infrastructure without addressing your process -- then:
(a) you'll waste time redoing work that has already been done, and
(b) you'll probably make just as many mistakes rebuilding the infrastructure from scratch as you made building it the first time, whether they are the same or different mistakes.
Magical thinking like "wipe the slate clean" doesn't fix problems. Problems are fixed by identifying them and attacking them directly.
Re:All comes down to budget (Score:5, Informative)
The problem is not with these fixes, it's that nobody ever documents what they did, and documentation is not readily available when needed. So, these kludges become tribal knowledge, and people only know about them because they were around when they were implemented or they've heard stories. When this happens, these wacky fixes can come back and bite you in the ass later when something mysteriously crashes and no one can get it to work like it did because nobody remembers what was done to make it work before. As people come and go, and institutional knowledge of older systems slowly erodes, we end up in a situation where everyone thinks the current system is crap, nobody knows why it was built that way, and everyone figures the only way out is to nuke the site from orbit and start over. The trick is keeping it from getting to that point.
Of course, nobody likes jumping through all these hoops like filing change control requests or writing (and especially maintaining!) documentation, so it gets dropped. IT management is more worried about getting things done quickly than documenting things properly, so there's no incentive for anyone to do any of it. Before long, you get a mass of crap that some people know parts of, but nobody knows all of, and nobody knows how or where to get information about any of it except by knowing that John Geek is the "network guru" and Jane Nerd is the "linux guru".
We will never get hardware and software that works together exactly the way we want them to. We will always have to tweak things to get them to work right for us. Citing lack of budgets or bug-ridden software may be perfectly valid, but those problems are never really going to be solved. Having our own house in order does not mean fixing all the bugs or being able to refresh our technology every 6 months. Having our own house in order means we know exactly what we did to make each system work right, we can repeat what we did, and everyone knows how to find information on what we did and why.
pay off your credit cards? (Score:5, Informative)
This the essence of technical debt [wikipedia.org]. Whether you're programming or deploying IT infrastructure, it's inescapable that sometimes you're going to have to include kludges to work around edge conditions, a vocal 1% of your users, or whatever. These kludges are eyesores, and fragile, but they're also as far as you could go with the time and budget you had.
Sometimes, accruing debt like this enhances your liquidity and ability to respond to change, so avoiding all kludges introduces other more obvious costs that slow you down and make you seem unresponsive to users or customers. But you can't just go on letting your debt grow all the time and not eventually come up technically bankrupt. Let it grow when you have to, but just as importantly make time to pay it down. A lot of this stuff can be paid down a little at a time, as you come across it a few months later. The pay-off if you're vigilant is that the next ridiculously urgent fix to that system can often be handled much more easily, without dipping down further... with patience and attention to maintaining this balance, you can reduce your technical debt and make the whole system hum.
The downside is that there isn't a quick fix when you find yourself deep in technical debt. You can't just spend all your time reducing it; your highest aspiration at that point should be maintaining the level of technical debt, rather than letting it grow, but it's generally been my experience that altering the curve of debt growth even a little can set you on the right path.
Re:Software = untouchable mentality (Score:4, Informative)
Ah yes, the sunk cost fallacy [wikipedia.org].
Re:As a non-developer, this is what I see (Score:2, Informative)
The network it was running was not a small network. Not at all. It was a travesty that this poor switch was running the network. Well over 200 devices plugged into other 2548s all bridged back to the poor "core" switch.
Re:like bubblegum under a desk... (Score:4, Informative)
Yeah, I saw that line and immediately thought about some of the "temporary solutions" people have proposed over the years. The statement is an oxymoron. It's either not a solution to the problem, or it's not temporary.
We've got less of those being made now, because I've taken to listing the previous "temporary solutions" every time someone proposes a new one.
Re:I was torn between modding this up and commenti (Score:3, Informative)
Some of the concurrency stuff needs a complete rewrite - acquiring synchronization primitives is painful, the new 'amazingly fast' locking that they use for GCD is marginally better than a FreeBSD mutex, and between one and three orders of magnitude (depending on load) faster than a Darwin mutex. Part of this is a userspace problem (not optimising for the uncontended case, which is the most common in good code), but a lot of it comes from the route down through the myriad kernel layers when sleeping a thread.
That problem in Mach is part of what gave microkernels a bad name. QNX, which is a real microkernel (about 65K of code) does thread dispatching, locking, and message passing very fast, in constant time, and without long interrupt lockouts. Those are the functions which must go fast in a microkernel, because they're used so much. In QNX, locking a mutex in the uncontested case is about three instructions in-line, with no system call. Those three functions are most of what the QNX kernel really does. In Mach, they were an afterthought, written on top of BSD.
This really belongs in the "when is it time to rewrite" thread.