Man, I love being in computers.

nadams
nadams Posts: 5,877
edited January 2008 in The Clubhouse
Realised, at around 7:30pm last night, that I couldn't get into our systems from home. Verified with my boss that she couldn't, either... then headed in to see what was up.

Got to the office (luckily, it's less than 2 miles from my home), and found that some lights came on nice, others were dim and flickering, and the emergency lighting was barely on. I've been here ever since.

We finally got good power back at 2:10am, and I've been bringing the systems back online since then. Everything went down hard when the UPS batteries died, so they're all running checks on startup, and each server has to come online one at a time, at least for the major ones.

Thank God for overtime.

I might just forget about going home, and catch a few hours sleep on my office floor and be here for morning in.... 5 hours.

But first, I've got to get the rest of these systems up.

End cause was one of the incoming phases of power to the building had burnt a wire somehow (overloaded, or maybe just its time???), which was giving about 35 volts on that phase. Once they replaced it, everything's worked great.
Ludicrous gibs!
Post edited by nadams on

Comments

  • PolkThug
    PolkThug Posts: 7,532
    edited January 2008
    Don't get sucked in like Tron!
  • Jstas
    Jstas Posts: 14,806
    edited January 2008
    nadams wrote: »
    Realised, at around 7:30pm last night, that I couldn't get into our systems from home. Verified with my boss that she couldn't, either... then headed in to see what was up.

    Got to the office (luckily, it's less than 2 miles from my home), and found that some lights came on nice, others were dim and flickering, and the emergency lighting was barely on. I've been here ever since.

    Dude, get software that will page your or send you an email or something when the UPS jumps on in the off-hours so you don't run into a problem like that again. We have software at work that the operators monitor and when something goes bad, the first thing they do is page the on-call person. Usually it's just a quick outage of a brown-out but if it's enough to flip teh breaker in a UPS or awaken one of the beasts of a generator we have for emergency backups, it's better to know about it and shut stuff down clean than spend an entire night of your last day of vacation fixing the problems it all caused.

    As for doing the craptastic part of the job well, hey, it happens. Welcome to IT. That's how it goes dude. Sorry you had to deal with it but everyone has put in those late nights or early mornings at one time or another. If they say they haven't they are either lying or they haven't REALLY done IT work.

    You should check your backups and see when the last one ran. That will give you a good idea of when the power outage happened. Also, see if they are complete or bogus. Either way, once they are back up and running, start off a backup if you can.
    Expert Moron Extraordinaire

    You're just jealous 'cause the voices don't talk to you!
  • Kris Siegel
    Kris Siegel Posts: 309
    edited January 2008
    Jstas wrote: »
    Dude, get software that will page your or send you an email or something when the UPS jumps on in the off-hours so you don't run into a problem like that again. We have software at work that the operators monitor and when something goes bad, the first thing they do is page the on-call person.
    This was my first thought as well. I have a large UPS at my home for my server and it has the ability to send me a notice regarding a power outage or even cleanly shut down my system.
  • Systems
    Systems Posts: 14,873
    edited January 2008
    xxxxxxxxxxxxxxxxxx
    Testing
    Testing
    Testing
  • wingnut4772
    wingnut4772 Posts: 7,519
    edited January 2008
    PolkThug wrote: »
    Don't get sucked in like Tron!



    Yes. Please don't.:D:D
    Sharp Elite 70
    Anthem D2V 3D
    Parasound 5250
    Parasound HCA 1000 A
    Parasound HCA 1000
    Oppo BDP 95
    Von Schweikert VR4 Jr R/L Fronts
    Von Schweikert LCR 4 Center
    Totem Mask Surrounds X4
    Hsu ULS-15 Quad Drive Subwoofers
    Sony PS3
    Squeezebox Touch

    Polk Atrium 7s on the patio just to keep my foot in the door.
  • danger boy
    danger boy Posts: 15,722
    edited January 2008
    nadams, WAKE UP! :p


    sorry you had an emergency at work.. i'm not in IT, but i get called in at all hours of the night too some times. I hate it.
    PolkFest 2012, who's going>?
    Vancouver, Canada Sept 30th, 2012 - Madonna concert :cheesygrin:
  • Jstas
    Jstas Posts: 14,806
    edited January 2008
    Yeah, the home stuff is nice but when you have a data center with multiple racks full of equipment, you usually have a central UPS system that powers the entire computer room. If you are going to do individual UPS systems for each server you have and you have 10 racks of servers averaging about 15-20 servers per rack, you are not only talking a logistics headache but you are severely limiting your flexibility in where you can place systems and such.

    Having a central UPS system page you when millions of dollars worth of equipment and millions of dollars worth of company depend on the equipment it supports is many times better than just shutting down systems for every power outage.

    Every hour your systems are down is an hour of work lost, productivity gone and costs rising through the roof. For a home system, the automatic shutdown is fine and dandy and works like a champ. When you are managing a data center for a profit driven entity and downtime costs in more ways than one, shutting down every system for a power outage that takes and whole 10 minutes to come back on is pretty much regarded as unsatisfactory. Especially when it can take two hours to bring everything back up properly. A 10 minute power hiccup turns into a 2 hour cost leading event. Not good nor is it a good way to ensure continued employment.

    Where I work, we have banks of 450kVA generators that kick on at a moments notice and provide up to 12 hours of up time before the fuel runs out. Each one is powered by a Cummins 855 TurboDiesel and has a 1,000 gallon fuel tank. Those are attached to equally large UPS systems that provide between an hour and a half hour of up time to the systems when the fuel runs out. If we can't get power restored in 12-14 hours, there are probably bigger problems going on.
    Expert Moron Extraordinaire

    You're just jealous 'cause the voices don't talk to you!
  • DollarDave
    DollarDave Posts: 2,575
    edited January 2008
    I can do you all better. I have a 6.5 million dollar data center with a server room that is as nice as any. Dual power distribution units (with UPS), dual air handlers, a 300 KVA generator, a 150 KVA generator, Saphire fire suppression, etc, etc. Servers out the wazzo and over 100TB of storage on EMC drive arrays. Everything is setup to alert someone if it fails. I sleep at night. Except this Monday night, that is - I always run the year-end processing for the bank and that took 20 hours. I was real tired by 6:30 yesterday afternoon..
  • Strong Bad
    Strong Bad Posts: 4,277
    edited January 2008
    We have a HUGE 1000 gallon CAT diesel generator out back of our building that kicks in when power goes off. It's designed to power our data centers, cooling units in the data centers, security locks on doors and a few lights.

    Anything happens, the boss and a few others get paged to get their arses in here! Thankfully i'm not on that list! :D

    Yep, get setup with an alert system that either pages you and/or sends an email alert.

    John
    No excuses!
  • nadams
    nadams Posts: 5,877
    edited January 2008
    We will be reviewing our check systems after this one. There are several outside companies that monitor their equipment that's on our premises, and none of them contacted us when it went down. Our maintenance department had no idea anything was wrong...

    The generators, although they only run emergency lighting anyway (they're small, natural gas), never kicked on, because they were monitoring the two phases that were still up.

    It was really just an unfortunate series of events that could've been avoided, I agree. However, if it costs money to implement, it's a tough sell around here. Maybe when they get my overtime slip they'll think about it :)
    Ludicrous gibs!
  • Jstas
    Jstas Posts: 14,806
    edited January 2008
    DaveMuell wrote: »
    I can do you all better. I have a 6.5 million dollar data center with a server room that is as nice as any. Dual power distribution units (with UPS), dual air handlers, a 300 KVA generator, a 150 KVA generator, Saphire fire suppression, etc, etc. Servers out the wazzo and over 100TB of storage on EMC drive arrays. Everything is setup to alert someone if it fails. I sleep at night. Except this Monday night, that is - I always run the year-end processing for the bank and that took 20 hours. I was real tired by 6:30 yesterday afternoon..

    Why do you have to turn it into a pissing contest? You don't want to go there.

    If I'm goin' to get blamed for a pissing contest, that was not my intent. nadams had an issue where I have a great deal of insight. Others chimed in with equally valuable insight but due to the size and complexity of nadams' environment, not really applicable. Rather than being labeled an arrogant, know-it-all again, I gave good reason why. No harm meant. No intention of making anyone else feel bad.
    Expert Moron Extraordinaire

    You're just jealous 'cause the voices don't talk to you!
  • Jstas
    Jstas Posts: 14,806
    edited January 2008
    nadams wrote: »
    However, if it costs money to implement, it's a tough sell around here. Maybe when they get my overtime slip they'll think about it :)

    That's the drag about IT. If you did your job right, it's invisible to the user no matter how much effort you have to put in to cobble it together. It won't be an issue until a catastrophic failure like you had where more than 50% of your systems go down and work grinds to a standstill.

    Think your overtime check is big? Tell them to tally up the wasted hours of however many programmers and engineers and other users who use the systems racked up. They apply that to the bottom line for just the month of December. That's when things change and you get money and time to do what you need to fix the systems you've been ranting about for months. It's just a shame that it comes down to a failure to get them to change. 'Cause innumerably, you will be the scapegoat in the end. IT is always a cost center in all businesses. Business types don't see value, only numbers. The only time they see value in IT is when it fails or is dealing with a failure and productivity drops like a stone and costs go up like an Atlas V. Modern business would be nothing without IT. So few beancounters actually understand that.
    Expert Moron Extraordinaire

    You're just jealous 'cause the voices don't talk to you!
  • DollarDave
    DollarDave Posts: 2,575
    edited January 2008
    Jstas wrote: »
    Why do you have to turn it into a pissing contest? You don't want to go there.

    If I'm goin' to get blamed for a pissing contest, that was not my intent. nadams had an issue where I have a great deal of insight. Others chimed in with equally valuable insight but due to the size and complexity of nadams' environment, not really applicable. Rather than being labeled an arrogant, know-it-all again, I gave good reason why. No harm meant. No intention of making anyone else feel bad.


    No pissing contest at all. Just trying to show that in my environment IT is valued and the investment in it shows. I can see how you might see my post as one uping yours, but I didn't mean it that way.
  • nadams
    nadams Posts: 5,877
    edited January 2008
    Jstas wrote: »
    That's the drag about IT. If you did your job right, it's invisible to the user no matter how much effort you have to put in to cobble it together. It won't be an issue until a catastrophic failure like you had where more than 50% of your systems go down and work grinds to a standstill.

    Think your overtime check is big? Tell them to tally up the wasted hours of however many programmers and engineers and other users who use the systems racked up. They apply that to the bottom line for just the month of December. That's when things change and you get money and time to do what you need to fix the systems you've been ranting about for months. It's just a shame that it comes down to a failure to get them to change. 'Cause innumerably, you will be the scapegoat in the end. IT is always a cost center in all businesses. Business types don't see value, only numbers. The only time they see value in IT is when it fails or is dealing with a failure and productivity drops like a stone and costs go up like an Atlas V. Modern business would be nothing without IT. So few beancounters actually understand that.

    I hear you, Jstas. Unfortunately, my overtime check won't be big enough to make them change a damn thing. The only thing we can do sort it out internally as cheaply as possible.

    Some of the systems only stayed up 6 minutes before the UPSs failed (according to the event logs), but one was up for at least an hour on battery. All the UPSs are similar, and under similar distributed load, so that tells me we have some failing batteries, too.

    My best estimate is that 75% of our network was down for at least 14 hours, the other 25% was inaccessable to anyone after the main switch went down. Only one of the servers was managed by its UPS and shut down gracefully... Luckily we had no hardware failures from the hard shutdowns.
    Ludicrous gibs!
  • Jstas
    Jstas Posts: 14,806
    edited January 2008
    nadams wrote: »
    Luckily we had no hardware failures from the hard shutdowns.

    Dodged a bullet there! :cool:
    Expert Moron Extraordinaire

    You're just jealous 'cause the voices don't talk to you!
  • jwhitakr
    jwhitakr Posts: 568
    edited January 2008
    Jstas wrote: »
    That's the drag about IT. If you did your job right, it's invisible to the user no matter how much effort you have to put in to cobble it together. It won't be an issue until a catastrophic failure like you had where more than 50% of your systems go down and work grinds to a standstill.

    Think your overtime check is big? Tell them to tally up the wasted hours of however many programmers and engineers and other users who use the systems racked up. They apply that to the bottom line for just the month of December. That's when things change and you get money and time to do what you need to fix the systems you've been ranting about for months. It's just a shame that it comes down to a failure to get them to change. 'Cause innumerably, you will be the scapegoat in the end. IT is always a cost center in all businesses. Business types don't see value, only numbers. The only time they see value in IT is when it fails or is dealing with a failure and productivity drops like a stone and costs go up like an Atlas V. Modern business would be nothing without IT. So few beancounters actually understand that.

    I agree with a lot of what you said ... there are a LOT of upper management types (especially in finance) who have yet to grasp the value that IT brings to the table. But, I would say that it varies a whole lot between industries.

    For example, I think most financial service companies are way ahead of the curve in terms of recgonizing and utilizing IT. I think those companies dedicate the needed funds to IT infrastructure and proactively use technology to their benefit. By contrast, I think a lot of manufacturing and traditionally "old school" companies are still playing catch up when it comes to IT adoption. They try to nickel-and-dime their IT departments and don't invest any money until something goes wrong.

    That's been my experience, at least. YMMV. :p
    My HT
    HDTV: Panasonic PT-61LCX65 61" Rear Proj. LCD
    AVR: Harman Kardon AVR 235
    Video: 80GB PS3, Toshiba HD-XA1 HD DVD
    Fronts: Polk Audio RTi8
    Center: Polk Audio CSi3
    Amp: Emotiva LPA-1
    Surrounds: Polk Audio R150
    Sub: HSU STF-3


    The only true barrier to knowledge is the assumption that you already have it. - C.H. Dodd
  • Jstas
    Jstas Posts: 14,806
    edited January 2008
    DaveMuell wrote: »
    No pissing contest at all. Just trying to show that in my environment IT is valued and the investment in it shows. I can see how you might see my post as one uping yours, but I didn't mean it that way.

    I'm glad your environment is well supported by your management and finance departments. You are the exception to the norm unfortunately. The smallest of the multiple datacenters my department manages has a price tag in the hundreds of millions. We still manage 98%+ uptimes but even with the cost, the environment is still under-supported financially.

    We have computing systems that will rival some of the biggest out there from the likes of IBM and such and supercomputer level clusters with processors number in the thousands and one in the tens of thousands. One of our database servers is an Oracle 10g RAC cluster with 90TB all by itself. The EMC equipment was costing so much money that we designed our own SAN using Apple Xserves and Veritas Cluster Services and it has been more reliable and stable than the EMC. The biggest drawback is when we have a problem, we gotta fix it ourselves instead of calling EMC.

    But my point from the start is that cost doesn't matter. There is always something bigger and badder out there. You can have all the high-dollar equipment you want and it can look nice and neat and as pretty as you want but if you don't have good, dedicated people behind it, all that fanciness is about the same as putting makeup on a moose.

    nadams has a good datacenter. He had a problem, he found it and fixed it as fast as possible. Lucky for him no hardware failures and it happened over a holiday when next to no work was happening anyway. Losses were minimal and he had most of it back up and running by morning so when those who are in the profit centers came in, they weren't waiting for anything to happen.

    I can throw numbers around all day. We are a very large company here and our yearly desktop system budget dwarfs what most companies spend on IT as a whole. But I would honestly trade all of the high dollar equipment for a staff of good, dedicated people who knew what they were doing and were creative in their solution. You could run ancient equipment and if you give me a good staff to support it, you'll rarely see downtime. When ya gotta argue just to get the money to buy a tank of diesel to make sure the generator stays running, there's problems there that you just don't need. Ya know?

    But every one I talk to, every place I've been, it's the same story. You're lucky Dave. You don't seem to have to deal with that and you should be happy. I'd say half the stress of this job comes from dealing with the money and begging for it. If you get it readily then you probably have it not necessarily easy but not nearly as difficult as it could be. So I'll ask you a question...you guys hiring? ;)
    Expert Moron Extraordinaire

    You're just jealous 'cause the voices don't talk to you!
  • Jstas
    Jstas Posts: 14,806
    edited January 2008
    jwhitakr wrote: »
    I agree with a lot of what you said ... there are a LOT of upper management types (especially in finance) who have yet to grasp the value that IT brings to the table. But, I would say that it varies a whole lot between industries.

    For example, I think most financial service companies are way ahead of the curve in terms of recgonizing and utilizing IT. I think those companies dedicate the needed funds to IT infrastructure and proactively use technology to their benefit. By contrast, I think a lot of manufacturing and traditionally "old school" companies are still playing catch up when it comes to IT adoption. They try to nickel-and-dime their IT departments and don't invest any money until something goes wrong.

    That's been my experience, at least. YMMV. :p

    Oh, no, I agree completely! The industry depends on the reaction to IT. I think financial houses are big proponents of IT but only because they have been using IT extensively for so long that they got burned by neglecting it a long time ago. One thing about finance people. They live on money and getting more of it. That is their whole job. When you hit a snag like nadams hit and it costs thousands to tens of millions in a matter of hours, well that's a hell of a burr in the saddle there. Finance people let that happen once and then, never again. So that's good there.

    But honestly, the driving force in IT is not finance. They exploit tested technologies. It's the houses that do research and development and try new things that make the biggest difference. Medical insurance is a huge factor in driving security. Thank the federal privacy and medical laws for that. Defense contractors are also driving security well past what the medical field is doing but you won't see any of those innovations any time soon. Pharmaceuticals are driving database technology. One of the biggest sources for ideas for Oracle for the longest time was Smith-Klein-Beecham. Oracle has also been working with Defense Contractors to tighten up security on its products constantly.

    It's those places where the IT departments have to fight the hardest for money. Mainly because those are the places with the innovative ideas to handle new issues that pop up due to innovative products that they are making. When you have to convince management that your idea has merit and then put a value on it and beg for that amount of money, it makes it just that much more difficult. Don't get me wrong, there is innovation everywhere. But finance guys are working with a fairly unchanging environment. Money is money and the only time rules really change is when laws change and they can find a loophole to make more money. When you're pushing the laws of physics and our understanding of the world around us, your IT department has to be adaptable. Starving them for management support and money makes it that much more difficult.

    But innovation comes in many forms. Not only is it about developing a new idea but also taking existing technology and applying it creatively to a complex problem and developing a solution. Innovation comes more often when you have resources to support it. I've had the opportunity to do IT work for several different kinds of companies and departments so my experience is varied. IT guys have to be a jack of all trades. You have to technically astute and knowledgeable in your environment. You have to be a people person and know how to deal with users of all kinds. You have to have at least a basic idea of the kind of work your users are doing so you can understand their problems and give them effective solutions that they need and want. But above all, you got to be a business guy and that last part right there is most of the stress in IT. The rest is users! ;)
    Expert Moron Extraordinaire

    You're just jealous 'cause the voices don't talk to you!
  • sucks2beme
    sucks2beme Posts: 5,600
    edited January 2008
    There's nothing like a big failure to drive you nuts. I've seen more than a few.
    Telecom here, but normally IS is scrambling around with me when this stuff happens.
    You'd be surprized how often UPS battery maintenance is a
    low priority. The only thing better than that is a good lightning strike.
    Or maybe fire protection tripping off. :D
    "The legitimate powers of government extend to such acts only as are injurious to others. But it does me no injury for my neighbour to say there are twenty gods, or no god. It neither picks my pocket nor breaks my leg." --Thomas Jefferson
  • nadams
    nadams Posts: 5,877
    edited January 2008
    I was a bit nervous about the halon system, myself. Power was dipping and spiking and setting off the alarm. Luckily, it reset itself when the power came back up. ADT never called to see what was going on, which is a little concerning.
    Ludicrous gibs!
  • sucks2beme
    sucks2beme Posts: 5,600
    edited January 2008
    I could go my whole life without having another "fire suppression" event!
    Ever have anyone dump the computer center emergency power shutoff? That's always good.
    I have seen several places where the button for this was right next to the door release.
    They could at least put a plexiglass cover over it!
    "The legitimate powers of government extend to such acts only as are injurious to others. But it does me no injury for my neighbour to say there are twenty gods, or no god. It neither picks my pocket nor breaks my leg." --Thomas Jefferson
  • DollarDave
    DollarDave Posts: 2,575
    edited January 2008
    Jstas wrote: »
    I'm glad your environment is well supported by your management and finance departments. You are the exception to the norm unfortunately. ....... So I'll ask you a question...you guys hiring? ;)

    It hasn't always been well supported in terms of physical environment. So, unfortunately, I have been in his situation and it sucked. We have grown through acquisition and one of them brought me this data center. We kept putting off the investment knowing that eventually one of the banks that we bought would have what we needed.

    Yes, we are hiring one additional person. I have a very small staff of IT professionals that support over 1,500 users in 130 locations, so I can appreciate your comments about quality over quantity. Quantity only adds to our management burdon. So, if you or anyone else wants to work in a fast-paced non-micro-managed IT environment, PM me. There is always one drawback - our datacenter is located in Sugar Land just southwest of Houston. So, if you enjoy cold weather, you won't like it here...

    Sorry for the thread derail.
  • Jstas
    Jstas Posts: 14,806
    edited January 2008
    sucks2beme wrote: »
    I could go my whole life without having another "fire suppression" event!
    Ever have anyone dump the computer center emergency power shutoff? That's always good.
    I have seen several places where the button for this was right next to the door release.
    They could at least put a plexiglass cover over it!

    That is nothing! I had a guy actually standing next to one ON THE PHONE and he was wrapping the phone cord AROUND THE BUTTON! When he pulled the phone cord off, EVERYTHING went down and the generator did not come on. I never heard silence so deafening and horrific in my life. Then my department lead yelled from across the room "WTF ARE YOU DOING!"

    We lost hardware in about 30 machines and had an entire 7.5 TB array go corrupt. It took 4 days to return normalcy.

    Haven't had fire suppression yet. We did have an ice storm which put about 3 inches of ice everywhere and then snow on top with a 45 degree day the next day. We ended up with 6.5 inches of water in one of the server rooms over a weekend. A bunch of people brought in shop vacs from home and we had like 9 of them sucking water out of the raised floor for like 4 or 5 hours before facilities management showed up with a proper pump. Yeah, that was scary. Twelve 400+ kVA service leads from the municipal lines would have made half the structure of the building go live if the water flooded high enough. Even if the emergency cut-off tripped, it could still get high enough to flood the feeds from the street power and then we'd have to rely on the power company's trips. But then we'd take out an entire grid from that building anyway. 3 floors, each about 10,000 sq. ft. and the entire first floor was computer rooms and HVAC equipment.
    Expert Moron Extraordinaire

    You're just jealous 'cause the voices don't talk to you!
  • PolkWannabie
    PolkWannabie Posts: 2,763
    edited January 2008
    For critical applications it's best to have at least the following ...

    - Generator backup and fuel for at least 12 hours. 24 - 48 hours worth is better ... Better yet is independent building connections to multiple autonomous power grids with automatic failover.
    - Independent connections to multiple autonomous communication carriers with different infrastructure.
    - Full and automated failover of systems to other locations. Once you have 3 or more cooperative data centers this is no longer cost prohibitive.
    - Fully functioning disaster recovery sites where procedures are tested regularly by those not particularly knowledgable in the hardware, operating systems or applications so that when those that are get blown up ... business carries on ...
  • sucks2beme
    sucks2beme Posts: 5,600
    edited January 2008
    Jstas wrote: »
    That is nothing! I had a guy actually standing next to one ON THE PHONE and he was wrapping the phone cord AROUND THE BUTTON! When he pulled the phone cord off, EVERYTHING went down and the generator did not come on. I never heard silence so deafening and horrific in my life. Then my department lead yelled from across the room "WTF ARE YOU DOING!"
    We lost hardware in about 30 machines and had an entire 7.5 TB array go corrupt. It took 4 days to return normalcy.
    QUOTE]
    I've seen the dump before. At that site it was called the million dollar
    switch, since that was the cost of someone hitting it. The water
    one is always good. A couple of decades ago, I saw a flood like you described in a
    huge IBM computer center. That was on the second floor, and the water
    did find some holes to run out of. But what a mess!
    I've also seen a total power failure at American Airlines's data center.
    "The legitimate powers of government extend to such acts only as are injurious to others. But it does me no injury for my neighbour to say there are twenty gods, or no god. It neither picks my pocket nor breaks my leg." --Thomas Jefferson