Interruptions ITOps professionals are grateful to avoid

Interruptions ITOps professionals are grateful to avoid

Check out the on-demand sessions from the Low-Code/No-Code Summit to learn how to successfully innovate and achieve efficiency by upskilling and scaling citizen developers. Look now.

As we settle into the time of year when we reflect on what we are thankful for, we tend to focus on important basics like health, family and friends.

But on a professional level, IT operations practitioners (ITOps) are grateful to avoid catastrophic outages that can cause confusion, frustration, lost revenue and damaged reputations. The very The last thing the ITOps, network operations center (NOC) or site reliability engineering (SRE) team wants while eating the turkey and enjoying time with the family is to be contacted about a power outage. These can be extremely expensive – in fact $12,913 per minute, and up to $1.5 million per hour for larger organizations.

However, to understand the peace of mind that comes with avoiding downtime, you must have endured the pain and anxiety that comes with interruptions first hand. Here are a handful of the horror stories ITOps pros are thankful to avoid this season.

A case of janky command structure

A long-time IT professional was on shift with three others when the clock rolled around. The crew received a notification of an issue affecting the front-end user interface of the Global Traffic Management Unit. Luckily there was a logbook for it in a database so it looked like the problem would be solved quickly. One of the team members saw two things to type: A command and a secondary input. He entered the commands and, based on what the runbook looked like, waited on the command line to ask for input, such as “what do you want to restart?”


Intelligent Security Summit

Learn the critical role of AI and ML in cybersecurity and industry-specific case studies on December 8. Sign up for your free pass today.

Register now

The way the command structure was set up, if you didn’t give an input, the device itself would restart. He typed what he thought was the correct command – “bigstart, restart” – and the entire front-end global traffic manager was taken down.

Just a reminder, this took place in the early evening. The customer was a financial company, and the system went down right around the time businesses were closing and trying to do their books and other finance-related tasks. Terrible timing, to say the least.

Five minutes after the power outage, the ITOps team realized what was happening: The tool they were using for their runbook was using text wrapping by default, so what looked like two separate commands was actually just one. Although the power outage was relatively short, it came at a critical time and created a chain reaction of headaches. The lesson? Make sure the command structure is optimized.

When Google is your best friend in the middle of the night

For an IT veteran of over 15 years, what seemed like a quiet night shift quickly turned into an anxiety-filled nightmare. “I’ve never seen myself panic as quickly as when the remote terminal I was in suddenly ran out of space,” he said.

What he was trying to do was restart a service while working on a remote machine, but he inadvertently disabled the network socket in the process. Calling someone and waking them up in the middle of the night to tell them he’d “knocked” a network adapter was less than ideal, so he and his teammates did some digging.

After what he calls “not an inconsiderable amount of Googling,” he was able to find his way to a Dell server and restarted the network adapter from there. It took longer than it should have to fix, but the problem was eventually resolved.

His pro tip: “Don’t disable the network adapter on a machine you’re remote controlling in the middle of the night.” It may sound obvious, but the underlying lesson is to have a contingency plan in place should something go terribly wrong.

ITOps: Relying on email was great – until it wasn’t

When email was the main way NOC teams received alerts, one longtime IT pro recalls having a teammate whose sole job was essentially to send: Monitor emails and create tickets for incidents that needed attention now, and others for those they could get to later. The system worked well, but it was actually a time bomb waiting to explode considering that this was a large multinational company.

That fear was realized when the company’s entire data center went down.

This was its own set of problems in itself, but the incident generated so many email alerts that it also crashed the company’s Outlook server. “At that point, you’re really blind,” this IT hero recalled.

The event happened to take place in the middle of the night, so the guard team reluctantly had to start waking up teammates. After the problem was finally solved, the team developed a sense of humor about it. As they recalled: “We used to joke that we DDoS ourselves with our own alarm noise. Good times!”

In the end, the overall moral of the story is this: Every time a hand touches a keyboard, there is a risk that something could go wrong. Of course, this is unavoidable at times, but teams that are able to automate and simplify their IT operational processes as much as possible give themselves the best chance of avoiding costly outages – so they can enjoy their Thanksgiving celebration uninterrupted.

Mohan Kompella is vice president of product marketing at BigPanda.

Data Decision Makers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people involved in data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices and the future of data and data technology, join us at DataDecisionMakers.

You may even consider contributing an article of your own!

Read more from DataDecisionMakers

Leave a Reply

Your email address will not be published. Required fields are marked *