Troubleshoot like an Expert - A 10-Step Guide

Background

Captain Jack Sparrow's infinite wisdom never ceases to amaze. His famous quote illustrated in the image above has really impacted the way I interpret "problem situations". In this field, we often find ourselves in reactive situations -- with staff, managers, team-leads (I could go on...) breathing down your neck while you are trying to figure out what happened and get it corrected STAT. It's stressful. We are fighting the panic feeling, we are struggling with whether or not to fight or flight -- we've had better times. If only there was some kind of checklist or guideline to follow in situations like these. What if there was a list of ten steps to help better organize our troubleshooting procedure? There is! It's right here on Experts Exchange and you have front-row seats!

To begin, the strategy/steps I will be discussing have been adapted from The 10 step Universal Troubleshooting Process, the methodology I've selected to follow. This process is not specific to Information Technology and could be easily modified to work for just about any profession or trade. In this piece, I will be discussing its application in Information Technology.

The 10 Steps to Effective Troubleshooting

1. Prepare

This COULD mean getting required tools ready or pulling out product documentation. Though valuable, these are not my primary focus in a troubleshooting scenario. The preparation stage for me is mental -- getting my attitude in-check and shifting to a CCC mindset (cool, calm, collected). There IS a solution and I CAN fix this. Jumping straight in to a situation without any preparation can be catastrophic. Rebooting a server the SECOND the Internet goes down isn't the best approach. Get yourself prepared to tackle the situation first, and you just might realize that your network cable was unplugged.

2. Outline a Damage Control Plan

We're often under huge amounts of pressure to just get things back up-and-running. It's a mad-dash to the server room and anyone in our way is simply getting trampled. Kick open the door and hear the fire alarm going off... grab the water-bucket as outlined by the disaster-recovery steering committee and toss it into the server rack! Success! Just saved some lives. What you failed to realize was that it was just a fire-drill, and now the backup tapes from last week are water-logged. Alright fine that is a little extreme -- but the point is, critical data needs to be considered before jumping into corrective action. We know our networks, and we know what matters most. Consider the worst-case scenario for your actions, and have a backup plan.

3. Get the Symptom Description

This is an imperative step in the troubleshooting process. It might sound trivial, in that of course we need to know what's happening -- otherwise we don't know there's a problem! Though that may be true, also true is the fact that in our field, many of the issues we deal with can have symptoms similar to another problem, and we need to carefully distinguish these. When you walk in to your Physician's office to report that your ear hurts, chances are they are not going to give you antibiotics without first verifying the absence of a foreign-object.

Get as much information as possible. This is where a "script" or "flow-chart" comes in handy for those front-line staff who are taking problem calls. The quality of the information being passed to the people who are going to be troubleshooting has a direct result on the quality of the way the incident is handled.

Because this is an area that I feel is vitally critical to the troubleshooting process, I will urge you to read the full section here. In summary, however, the types of questions that need to be asked are as follows:

When did it start happening?
What else happened around that time?
Any installations or configuration changes done around that time?
The who/what/where/when and why

I'll reiterate. The quality of the information gathered during this step has a direct impact on the end result.

4. Reproduce the Symptom

This one is simple, but still vital. You can't possibly begin to implement corrective measures if you don't have a full understanding of the problem. Using the information gathered in Step 3, try and re-create the issue so it can be witnessed first-hand. Though sometimes this isn't possible, the alternative is to see it from the end-user perspective. Whether that's a walk to their office or by using remote assistance, you should see for yourself what you're working with. Many humans, by nature, develop a more solidified understanding of concepts and ideas by visual exposure, as opposed to by reading or hearing. Having this first-hand knowledge of the problem will greatly aid in carrying out Step 5.

5. Take Corrective Action

This step is what I believe to be the easiest, strange as it may sound. Steps 1-4 have you collecting appropriate information, developing your contingency/back-out/back-up plans, and generally just getting ready to tackle the task-at-hand -- solving the problem. A lot of the time, IT administrators and technicians jump straight to this step, setting themselves up for a world of potential new issues, not to mention the amount of lost time.

Based on the sound, detailed information you now have, this is where you make your best-informed decision regarding what the issue is, and what actions to take towards resolution. Here is where actions to resolve the problem are taken.

Image Courtesy of http://us.gmocloud.com/wp/wp-content/uploads/2012/08/Current-status-of-cloud-security.jpg

6. Narrow it Down (Isolate the Root-Cause)

During this step, I perform final validation in terms of what the problem was by reviewing the pertinent Event Logs, specific application logs, device-logs, and so on to try and iron out exactly what this issue was caused by. In the ITIL world, this step is critical to the Service Operation stage as it is brought to the table for review and root-cause analysis. The general idea here is to iron out a plan to A) stop this from occurring again and B) if it does happen again, because technology is unpredictable, how can we handle this better. This analysis is not done during this step, of course, because we still have work to do!

7. Replace or Repair Defective Equipment

The ugly truth of working in IT is that sometimes we don't have any clue of an issue until it explodes in our face. This stage is about making sure you don't get blindsided again. If faulty equipment or a bad config was the issue, correct that or install a replacement device -- whatever you need to do to decrease the chance of a repeat problem.

8. Test

Once the fire is out and the smoke has settled, it's time to reflect on the incident and verify that the correct response to the problem was taken. Ask the following questions:

Did the symptom go away?
Did the right symptom go away?
Did I fix the right cause?
Did I create any other problems?

Having just dealt with a crisis, we're starting to feel the relief and users are back to work as usual. We don't want any unexpected surprises surfacing as a result of the incident, so asking ourselves these questions and performing any corresponding validation will help keep those users happy, and will ultimately help prevent a relapse.

9. Take Pride

Though not directly linked to the troubleshooting process itself, this step is, indeed, vital. You've just been involved in a stressful situation, with people coming at you from every-which-way looking for updates and ETAs. Now that things are back up and running, you need to take some time to talk about the incident. Tell your co-workers and/or managers/team-leads the process you went through to arrive at the solution. Brag with your teammates, respectfully of course. In our field, the concept of Burnout is very real. (Burnout: physical or mental collapse caused by overwork or stress). This step is a great way to help prevent this from happening to you -- a chance to gloat, a chance to feel great about getting to the bottom of things. Always take this "debrief" period for your own mental sanity -- these situations are nerve-racking.

10. Prevent Future Occurrence

This stage is simply all about communication and documentation. Document the issue including initial symptoms, affected areas, affected systems, and any other pertinent details. Document your corrective measures and your root-cause analysis (if completed). Meet with your team and discuss the findings so that everyone is on the same page. Make sure that there is plenty of supporting detail, enough that you are confident that should this issue recur, your colleagues would have a much easier time with diagnosis and resolution.

Finale

The 10 steps outlined above were adapted from The Universal Troubleshooting Process which is the methodology I have been throughout my career, and also throughout my adult life. I have held many roles in the IT field since graduating from college -- and in each of those roles I have experienced "cart-before-horse" troubleshooting VERY often. The audience for this article is unrestricted. It does not matter if you are an Expert, a novice, intermediate, or a Plumber. These steps can be applied to any industry or field, which is one of the things I like most.

I am extremely grateful to my audience for taking time to read through this article. I only hope that I am able to give back to you by having this article and The 10 Steps to Effective Troubleshooting pop into your head the next time the fire-alarm in your server room engages.

Comments (2)

Albert Widjaja

IT Professional

CERTIFIED EXPERT

Commented: 2016-02-11

Great Article Matt.
thanks for sharing and looking forward for the update / follow up.

Matt Minor

Technical Systems Analyst

Author

Commented: 2022-03-31

Thank you so much. It’s great to see that this content is still relevant and helping people in 2022!