Monday, May 18, 2009

Problem Solving


System problems fall into several categories. The first category is difficult to describe and even more difficult to track down. For lack of a better word, I am going to use the word "glitch." Glitches are problems that occur infrequently and under circumstances that are not easily repeated. They can be caused by anything from users with fat fingers to power fluctuations that change the contents of memory.

Next are special circumstances in software that are detected by the CPU while it is in the process of executing a command. I discussed these briefly in the section on kernel internals. These problems are traps, faults, and exceptions, including such things as page faults. Many of these events are normal parts of system operation and are therefore expected. Other events, like following an invalid pointer, are unexpected and will usually cause the process to terminate.

Kernel Panics

What if the kernel causes a trap, fault, or exception? As I mentioned in the section on kernel internals, there are only a few cases when the kernel is allowed to do this. If this is not one of those cases, the situation is deemed so serious that the kernel must stop the system immediately to prevent any further damage. This is a panic.

When the system panics, using its last dying breath, the kernel runs a special routine that prints the contents of the internal registers onto the console. Despite the way it sounds, if your system is going to go down, this is the best way to do it. The rationale behind that statement is that when the system panics in this manner, at least there is a record of what happened.

If the power goes out on the system, it is not really a system problem, in the sense that it was caused by an outside influence, similar to someone pulling the plug or flipping the circuit breaker (which my father-in-law did to me once). Although this kind of problem can be remedied with a UPS, the first time the system goes down before the UPS is installed can make you question the stability of your system. There is no record of what happened and unless you know the cause was a power outage, it could have been anything.

Another annoying situation is when the system just "hangs." That is, it stops completely and does not react to any input. This could be the result of a bad hard disk controller, bad RAM, or an improperly written or corrupt device driver. Because there is no record of what was happening, trying to figure out what went wrong is extremely difficult, especially if this happens sporadically.

Because a system panic is really the only time you can easily track down the problem, I will start there. The first thing to think about is that as the system goes down, it does two things: writes the registers to the console screen and writes a memory image to the dump device. The fact that it does this as it's dying makes me think that this is something important, which it is.

The first thing to look at is the instruction pointer. This is actually composed of two registers: the CS (code segment) and EIP (instruction pointer) registers. This is the instruction that the kernel was executing at the time of the panic. By comparing the EIP of several different panics, you can make some assumptions about the problem. For example, if the EIP is consistent across several different panics, this indicates that there is a software problem. The assumption is made because the system was executing the same piece of code every time it panicked. This usually indicates a software problem.

On the other hand, if the EIP consistently changes, then this indicates that probably no one piece of code is the problem and it is therefore a hardware problem. This could be bad RAM or something else. Keep in mind, however, that a hardware problem could cause repeated EIP values, so this is not a hard -coded rule.

The problem with this approach is that the kernel is generally loaded the same way all the time. That is, unless you change something, it will occupy the same area of memory. Therefore, it's possible that bad RAM makes it look as though there is a bad driver. The way to verify this is to change where the kernel is physically loaded. You can do this by rearranging the order of your memory chips.

Keep in mind that this technique probably may not tell you what SIMM is bad, but only indicate that you may have a bad SIMM. The only sure-fire test is to swap out the memory. If the problem goes away with new RAM and returns with the old RAM, you have a bad SIMM.

Getting to the Heart of the Problem

Okay, so we know what types of problems can occur. How do we correct them? If you have a contract with a consultant, this might be part of that contract. Take a look at it and read it. Sometimes the consultant is not even aware of what is in his or her own contract. I have talked to customers who have had consultant charge them for maintenance or repair of hardware, insisting that it was an extra service. However, the customer could whip out the contract and show the contractor that these services were included.

If you are not fortunate to have such an expensive support contract, you will obviously have to do the detective work yourself. If the printer catches fire, it is pretty obvious where the problem is. However, if the printer just stops working, figuring out what is wrong is often difficult. Well, I like to think of problem solving the way Sherlock Holmes described it in The Seven Percent Solution (and maybe other places):

"Eliminate the impossible and whatever is left over, no matter how improbable, must be the truth."

Although this sounds like a basic enough statement, it is often difficult to know where to begin to eliminate things. In simple cases, you can begin by eliminating almost everything. For example, suppose your system was hanging every time you used the tape drive. It would be safe at this point to eliminate everything but the tape drive. So, the next big question is whether it is hardware problem or not.

Potentially, that portion of the kernel containing the tape driver was corrupt. In this case, simply rebuilding the kernel is enough to correct the problem. Therefore, when you relink, link in a new copy of the driver. If that is not sufficient, then restoring the driver from the distribution media is the next step. However, based on your situation, checking the hardware might be easier, depending on your access to the media.

If this tape drive requires its own controller and you have access to another controller or tape drive, you can swap components to see whether the behavior changes. However, just as you don't want to install multiple pieces of hardware at the same time, you don't want to swap multiple pieces. If you do and the problem goes away, how do you know whether it was the controller or the tape drive? If you swap out the tape drive and the problem goes away, that would indicate that the problem was in the tape drive. However, does the first controller work with a different tape drive? You may have two problems at once.

If you don't have access to other equipment that you can swap, there is little that you can do other than verify that it is not a software problem. I have had at least one case while in tech support in which a customer called in, insisting that our driver was broken because he couldn't access the tape drive. Because the tape drive worked under DOS and the tape drive was listed as supported, either the documentation was wrong or something else was. Relinking the kernel and replacing the driver had no effect. We checked the hardware settings to make sure there were no conflicts, but everything looked fine.

Well, we had been testing it using tar the whole time because tar is quick and easy when you are trying to do tests. When we ran a quick test using cpio, the tape drive worked like a champ. When we tried outputting tar to a file, it failed. Once we replaced the tar binary, everything worked correctly.

If the software behaves correctly, there is potential for conflicts. This only occurs when you add something to the system. If you have been running for some time and suddenly the tape drive stops working, then it is unlikely that there are conflicts; unless, of course, you just added some other piece of hardware. If problems arise after you add hardware, remove it from the kernel and see whether the problem goes away. If it doesn't go away, remove the hardware physically from the system.

Another issue that people often forget is cabling. It has happened to me a number of times when I had a new piece of hardware and after relinking and rebooting, something else didn't work. After removing it again, the other piece still didn't work. What happened? When I added the hardware, I loosened the cable on the other piece. Needless to say, pushing the cable back in fixed my problem.

I have also seen cases in which the cable itself is bad. One support engineer reported a case to me in which just pin 8 on a serial cable was bad. Depending on what was being done, the cable might work. Needless to say, this problem was not easy to track down.

Potentially, the connector on the cable is bad. If you have something like SCSI, on which you can change the order on the SCSI cable without much hassle, this is a good test. If you switch hardware and the problem moves from one device to the other, this could indicate one of two things: either the termination or the connector is bad.

If you do have a hardware problem, often times it is the result of a conflict. If your system has been running for a while and you just added something, it is fairly obvious what is causing the conflict. If you have trouble installing, it is not always as clear. In such cases, the best thing is to remove everything from your system that is not needed for the install. In other words, strip your machine to the "bare bones" and see how far you get. Then add one piece at a time so that once the problem re-occurs, you know you have the right piece.

As you try to track down the problem yourself, examine the problem carefully. Can you tell whether there is a pattern to when and/or where the problem occurs? Is the problem related to a particular piece of hardware? Is it related to a particular software package? Is it related to the load that is on the system? Is it related to the length of time the system has been up? Even if you cant tell what the pattern means, the support representative probably has one or more pieces of information to help track down the problem. Did you just add a new piece of hardware or SW? Does removing it correct the problem? Did you check to see whether there are any hardware conflicts such as base address, interrupt vectors, and DMA channels?

I have talked to customers who were having trouble with one particular command. They insist that it does not work correctly and therefore there is a bug in either the software or the doc. Because they were reporting a bug, we allowed them to speak with a support engineer even though they did not have the valid support contract. They kept saying that the documentation is bad because the software did not work the way it was described in the manual. After pulling some teeth, I discovered that the doc the customers used is for a product that was several years old. In fact, there had been three releases since then. They were using the latest software, but the doc was from the older release. No wonder the doc didn't match the software.

Collection information
Instead of a simple list, I suggest you create a mind map. Your brain works in a non-linear fashion, and unlike a simply list a mind map, helps you gather and analyse information the way your brain actaully works.
Work methodically and stay on track
Unless you have a very specific reason, don't jump to some other area before you complete the one you are working on. It is often a waste of time, not because that other area is not where the problem is, but rather "finding yourself" again in the original test area almost always requires a little bit of extra time ("Now where was I?"). Let your rest results in one area guide you to other areas even if that means jumping somewhere else before you are done. But make sure you have a reason.
Split the problem in pieces
Think of a chain that has a broken link. You can tie the end onto something, but when you pull nothing happens. Each link needs to be examined invidually. Also, the larger the pieces, the easier it is to overlook something.
Keep track of where you have been
"Been there done that." Keep a record of what you have done/tested and what the results where. This can save a lot of time whith complex problems with many different components.
Listen to the facts
One key concept I think you need to keep in mind is that appearances can be deceiving. The way the problem presents itself on the surface, may not the real problem at all. Especially when dealing with complex systems like Linux or networking, the problem may be buried under several different layers of "noise". Therefore, you should try not make too many assumptions and if you do, verify those assumptions before you go wandering off on the wrong path. Generally, if you can figure out the true nature of the problem then then finding the cause is usually very easy.
Be Aware of all limitation and restrictions
Maybe what you are trying to do is not possible given the current configuration or hardware. For example, maybe there is a firewall rule which prevents two machines from communicating. Maybe you are not authorized to use resources on a specific machines. You might be able to see machine using some tools (e.g. ping) but not with others (e.g. traceroute).
Read what is in front of you
Pay particular attention to error messages. I have had "experienced" system administrators reports problems to me and say that there was "some error message" on the screen. It's true that many errors are vague or come from the last link in the chain, but more often than not they provide valuable information. This also applies to the output of commands. Does the command report the information you expect it to?
Keep calm
Getting upset or angry will not help you solve the problem. In fact, just the opposite is true. You begin to be more concerned with your frustration or anger and forget about the true problem. Keep in mind that if the hardware or software is as buggy as you now think it is, the company would be out of business. (Obviously that statement does not apply to Microsoft products) It's probably one small point in the doc that you skipped over (if you even read the doc) or something else in the system is conflicting. Getting upset does nothing for you. In fact (speaking from experience), getting upset can cause you to miss some of the details for which you're looking.
Recreate the problem
As in many branches of science, you cause something to happen and then examine both the cause and results. This not only verifies your understanding of the situation, it also helps prevent wild gooses chases. Users with little or no technical experience tend to over dramatize problems. This often results in in comments like "I didn't do anything. It just stopped working." By recreating the problem yourself, you have ensured that the problem does not exist between the chair and the keyboard.
Stick with known tools
There are dozens (if not hundreds) of network tools available. The time to learn about their features is not necessarily when you are trying to solve a business critical problem. Find out what tools are already available and learn how to use them. I would also recommend using the tools that are available on all machines (or at least as many as possible). That way you don't need to spend time learing the specifics of each tool.
Don't forget the obvious
Cables can accidently get kicked out or damaged. I have seen cases where the cleaning crew turned off a monitor and the next day the user reported the computer didn't work because the screen was blank.

No comments:

Post a Comment