Repeat after me: When a problem becomes very strange, I will always check the basics.
Not that this mantra prevents me from routinely forgetting this in the middle of some crazy problem. Too often what happens is I’m trying to fix a problem, iterating over different tests and hypothesis, until the results stop making sense. How can this configuration work here perfectly, but not on the similar machine at all? How do I have visibility to this part of my network but not this part? Why does it show up intermittently? Just as I’ve exhausted every possibility, my mantra finally comes to my weary and exhausted brain: “Test your assumptions.”
Assumptions often get skipped. “Oh that is definitely working!” If you’re troubleshooting in a group it’s even more likely. “Is this working?” “Of course it is.” It can be socially risky to ask about certain things. For instance, when I did app support, you’d swear that everyone’s firewall was always configured incredibly. If you even asked about a firewall you’d get an earful.
The funny thing is that the likelihood of a problem with software resulting from a misconfigured firewall was inversely propportional to how fiercely someone denied it. Often the real task was getting the customer to actually look. When they did, very commonly you’d get a sheepish “Oh yeah, I guess it was” and everything would work. Sometimes people would get so stuck in their assumptions that they do anything to avoid it. They’d yell, question your competence, demand to speak to someone else. Imagine what it’s like to make all this noise and then your problem is the one you demanded was not the problem, could not be the problem.
And it happens all the time! It happens to me.
In tech, we value competence highly. We should. Our toxic addiction to confusing rightness with competence is a blind-spot. Knowing the answer to every question, and having everything done perfectly creates some of the weirdest working environments I’ve ever seen. It’s taken problems that should take 5 minutes to fix take 5 days, all because we are culturally trained in tech that if we are competent, we don’t have to question our assumptions.
The problem with that lies in the fact that though we work with computers, we are not computers; we are beautiful, fallible little engines and we should embrace that. When we don’t, our assumptions turn our inevitable errors into damaging blows against us. We need to make assumptions to move forward –we can’t give deep thought to everything in front of us– but making assumptions isn’t the crime. The crime is not circling back when there could be a problem. Normally the cost is so minimal to do this that it’s worth it to look. The cost of not doing so can be really high.
I’m having a network issue. Quick, before I do anything else: is the NIC card working on this machine? This is the type of thing we often assume is true, and don’t check until we’ve exhausted the more high-level options, finding only strange, perplexing results from our tests.
By often, of course, I mean this past weekend. By we, I mean me. The machine was in our DMZ and not in any of our monitoring systems yet, so I didn’t get any warning signs like I’d expect to.
My solution of course it to attempt to automate checks. I’d like to be able to, when having an issue, run a quick script that checks very common thing and then gives me quick feedback on the items in question. If I can make assumptions even more painless to test, I’m more willing to do them, and it’s easier to convince someone with heavy assumptions of their own. “This will take a minute and answer some basic questions, just in case.”
How do you circumvent or avoid assumptions in your troubleshooting?