Importance of Baselines

one of these things is not like the other
In my world, having a standard baseline across servers is important. For example, I am responsible for maintaining a few hundred servers spread across a geographical area. By having a standard baseline across the servers its easy to maintain updates, troubleshoot issues, and in general keep things running smoothly.

Well, recently we received a hardware refresh consisting of new servers, storage, and networking equipment. While installing the equipment we made a system baseline for our sister sites. The theory is that each site would have the same hardware, the same software, and then if one site has issues or upgrades something all sites apply the change.

This is in stark contrast to my site and other sister sites all being islands of development and maintenance. Lessons learned didn’t get shared and when they did they got morphed into site vs site competition.

So one would think we are forming a standard baseline. Sadly that idea, that concept, that goal has failed so far. The organization responsible for providing the hardware decided to change things between our site and the next. Where we have a Cisco 6500 switch they have a 4500. Where we have a set of blades and another set of standalone servers, they only have the blades. Where our blade chassis have one type of built in switch, they have another brand which supports different features that we are using.

So in review. If your environment looks like this:

  • Multiple subnets/VLANs on each server for traffic segregation and security
  • Bonding/LACP at the server and between switches for redundancy/throughput

And you can’t hire enough smart network engineers that understand trunking/bonding/VLANs then don’t switch your hardware mid-game, and then don’t call the site that got it working asking for help.

Written by ruckc | Tags: , ,

50% blade failure

servers or something close

So at work, we needed to have some work performed on one of the SANs. To do the work we went ahead an powered our blade servers utilizing the SAN off. After the SAN was repaired (through means of a cold boot) 16 of our 32 BL25p blades refused to power on. There was no pattern to it, no single blade enclosure, no unique software. When you pressed the power button on the blades they just refused to turn on. It was more like they thought about turning on for a moment and then replied “Hell No.” Luckily we made due with some spares and reallocated blades from another similar setup elsewhere, but seriously how does this happen?

According to HP this is an issue that we should of fixed with upgrading iLO back in the spring. You gotta love quality engineering.

Written by ruckc | Tags: , , ,

The Beginning

once upon a time - typewriter
This is the beginning of the SQ blog in theory.  The intention is so that I describe my daily distractions from work.  I currently work 7 days a week, 12 hours a day so the distractions are sometimes frequent and many.

Written by ruckc | Tags: ,

WordPress | Aeros | Extplorer