Translated

talk/20220118 How curiosity helped me solve a hardware problem.md
This commit is contained in:
toknow-gh 2023-02-20 19:41:49 +08:00
parent 72b8cae485
commit 341f47405e
2 changed files with 99 additions and 101 deletions

View File

@ -1,101 +0,0 @@
[#]: subject: "How curiosity helped me solve a hardware problem"
[#]: via: "https://opensource.com/article/22/1/troubleshoot-hardware-sysadmin"
[#]: author: "David Both https://opensource.com/users/dboth"
[#]: collector: "lujun9972"
[#]: translator: "toknow-gh"
[#]: reviewer: " "
[#]: publisher: " "
[#]: url: " "
How curiosity helped me solve a hardware problem
======
Curiosity fuels the quest for knowledge and truth, whether it's about
hardware, open source software, programming, building a PC, optimizing
settings, or just learning a new application.
![Puzzle pieces coming together to form a computer screen][1]
I typically have a dozen computers up and running on my home network—yes, 12. And I am responsible for several more in other locations. With so many computers, there are always failures of various types, and I ultimately diagnose many of them as hardware problems. But it can be difficult to diagnose which hardware component is causing the issue.
Just this week, I had a perplexing problem that I misdiagnosed the cause of on my primary workstation—twice. This article takes you through the process I followed. I show you where and why I went down the wrong path and how easy it can be to do so.
### The first symptoms
I have been working on several projects. Recently, I had many applications open on multiple desktops and was just starting to work when the display went blank. Most (not all) of the fans in my primary workstation came to a stop, and I sucked in a deep breath. I'd never seen anything quite like this before, but I did know that my system was in trouble.
There were two primary clues I had to work with. The display went dark, and several fans had stopped. However, the front-panel power and disk activity LEDs were still on, although at a lower brightness level than usual. Most of the decorative RGB LED lights on my motherboard, memory DIMMs, and fans also went out.
I tried the power and reset buttons with no results. I turned off the power supply directly using the PSU rocker switch. Powering it back on resulted in the same set of symptoms.
### Initial thoughts
These symptoms and decades of experience with all kinds of failures pointed me to the power supply.
I removed the power supply and used my PSU tester to check it. The tester indicated that the PSU was good, and all voltages were within specs. However, I knew the tester could be wrong. PSU testers do not test under full load conditions such as those that exist when the computer is running and drawing a few hundred watts of power. I went with my gut and installed my spare 1000W power supply.
With an average of 12 computers in my home network, I have learned to keep plenty of spare parts on hand. It saves a lot of frustration that I don't have to run to the local computer store or order online and wait for delivery when things break—and things are always breaking with that many computers around.
That replacement power supply solved the problem despite the result I got from the PSU tester. Even though the tester has been correct many times in the past, my experience, my knowledge, and my gut told me differently.
Unfortunately, my gut instinct was wrong.
### Second thoughts
My workstation was exhibiting the same symptoms again. It is very unlikely that two different PSUs would fail exactly the same way.
Next idea: It had to be the motherboard. I don't keep spare motherboards around, so I ordered a new one online and figured that I could use extra memory I already had and move the CPU to the new motherboard along with its all-in-one liquid cooling unit.
### Disciplined troubleshooting
The new motherboard would take a couple of days to arrive, so I decided to prepare by removing the old one from the workstation. But before I unplugged the power feeds to the motherboard, my curiosity took over and forced me to power on the system with only the motherboard, CPU, and memory installed. I had disconnected everything else.
Good troubleshooting demands that you isolate all potential variables, and all I'd done so far was test the PSU. I had to test every component.
This process required me to disconnect the front panel cables for sound and the dashboard media panel that includes various USB, SATA, and memory card slots.
With just the motherboard connected, I got a surprise: Everything worked as normal!
The computer itself wouldn't boot because there were no connected storage drives, and nothing was displayed because I had removed the display adapter. But there were no symptoms of either power or motherboard failure. That piqued my curiosity even more. If the motherboard were truly bad, the symptoms would still exist.
So I started a sequence of powering off, reinstalling one of the removed components, and powering back on.
It turns out that the front panel media dashboard caused the symptoms.
I removed the media dashboard and plugged everything else back in. My workstation booted up properly and performed as expected. I had identified the culprit.
### How it started
Having figured out the actual problem, I immediately understood the root cause. It had started a couple of days previously. I was working with and testing several external USB devices, including various cameras, storage devices that I use for backups, and an external USB hub.
I picked up one USB cable and plugged it into a USB 2.0 slot on the media dashboard. Everything ground to a halt, and most of the lights and fans went out. I unplugged the USB cable, which was now very hot, and burned my fingers. I had inadvertently plugged the type C end into the USB 3.0 type A socket, which had shorted the power.
After unplugging the USB cable, everything went back to "normal"—except it didn't. The media dashboard lasted a few more days and then shorted out completely, having been weakened by my careless mistake.
### Jumping to conclusions
Knowledge and experience can sometimes count for more than tools like PSU testers. Except when they don't. I eventually found the actual cause of the problem, but I should have seen it sooner.
Although I was correct about this being a power problem, I was sidetracked by not correctly reading the symptoms and following that line of inquiry to its logical conclusion. I could have isolated the true cause of the problem sooner than I did and saved the time I spent configuring my laptop to be a temporary primary device until I could fix my primary workstation.
Sysadmins work with complex devices, and it can be easy to jump to conclusions. I have over 50 years of experience in the computer industry, and I still do it. I just need to remember to take a few deep [yoga breaths][2] and keep digging until I isolate the root cause of the problem.
### Curiosity
At least I followed my curiosity while waiting for the replacement motherboard to arrive. That allowed me to return things to normal much sooner than had I waited until the new motherboard arrived. And I might have discarded a perfectly good motherboard by not testing it further.
There is a saying about curiosity killing the cat. I hate that saying because it is all too frequently used by parents, colleagues, pointy-haired bosses, teachers, and others who just want us curious folk to leave them alone. In reality, curiosity fuels the quest for knowledge and truth, whether it's about hardware, open source software, programming, building a PC, optimizing settings, or just learning a new application. Feed your curiosity!
--------------------------------------------------------------------------------
via: https://opensource.com/article/22/1/troubleshoot-hardware-sysadmin
作者:[David Both][a]
选题:[lujun9972][b]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/dboth
[b]: https://github.com/lujun9972
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/puzzle_computer_solve_fix_tool.png?itok=U0pH1uwj (Puzzle pieces coming together to form a computer screen)
[2]: https://opensource.com/article/21/11/linux-yoga

View File

@ -0,0 +1,99 @@
[#]: subject: "How curiosity helped me solve a hardware problem"
[#]: via: "https://opensource.com/article/22/1/troubleshoot-hardware-sysadmin"
[#]: author: "David Both https://opensource.com/users/dboth"
[#]: collector: "lujun9972"
[#]: translator: "toknow-gh"
[#]: reviewer: " "
[#]: publisher: " "
[#]: url: " "
好奇心帮我解决了一个硬件问题
======
好奇心能激发人们对于知识和真理的渴求,这可能是关于硬件、开源软件、编程、组装个人电脑、优化系统设置或者学习使用新软件。
![Puzzle pieces coming together to form a computer screen][1]
我通常在家架设运行一个由一打计算机组成的家庭网络——没错,正好 12 台计算机。同时我还负责其它地方的一些计算机维护工作。由于计算机很多,总会出现各种各样的故障,其中有很多最终确诊为硬件问题。但是要判定出是哪个硬件设备引发的故障却是一件难事。
就在这周,我的主工作站上出现了一个棘手的问题,导致我两次误判了问题的原因。本文将带你跟随我排查问题的过程。我将告诉你我在哪一步做了错误的判断以及为什么,还有误判是多么容易发生。
### 最初的症状
我手上一直有好几个项目。最近,我在几台台式机上打开了很多应用程序,我刚开始工作,突然就黑屏了。主工作站的大多数(不是全部)风扇都停了,这让我倒吸一口凉气。我从来没有遇到过这种情况,但显然我的系统出问题了。
我有两条主要线索可以跟进:一是显示黑屏,二是有些风扇不转了。但是前面板上的电源和磁盘活动指示灯还是亮的,只是比平常要暗一点。大多数安装在主板、内存条和风扇上的 RGB 装饰灯也都灭了。
我试过按电源键和重启键,都没有反应。我直接按供电单元的船型开关关闭了电源。重新供电后还是出现了刚才的症状。
### 最初的猜想
问题现象和我数十年处理各类故障的经验将原因指向了供电问题。
我将供电单元拆了下来并用电源测试器对它进行了检查。结果是供电单元没有任何问题,各项电压都符合规范。当然测试器的结果也可能是错误的。测试器没有在满负荷状态下进行测试,比如计算机运行中耗电几百瓦的情况。我凭直觉更换了一个一千瓦的备用电源。
由于我的家庭网络中有 12 台计算机,我已经习惯准备了一些备用配件在身边。这样当有配件损坏时,我就不必非得跑一趟附近的电脑城或者网购后等快递了。由于计算机这么多,配件损坏是经常的事。
虽然电电源测试器告诉我电源没有问题,但更换电源后问题确实消失了。即便检测仪在过去都是正确的,我的经验、知识和直觉告诉我就是电源问题。
不幸的是,我的直觉错了。
### 第二个猜想
没过多久我的工作站再次又了相同的问题。但两个不同的供电单元有相同问题的可能性太低了。
我马上想到那一定是主板出问题了。我没有备用的主板,所以网购了一块新主板。我想到其实可以用上手上多余的内存条,然后把 CPU 连同一体水冷单元一起装到新主板上。
### 专业的故障排查
新主板需要几天天才能送到所以我决定先将工作站上的旧主板拆下来。就在拔掉主板供电之前我的好奇心显现并驱使我给只剩主板、CPU和内存的系统开机。我已经把其它的部分都拆掉了。
好的故障排除过程需要分离所有潜在变量,目前我只是对供电单元进行了测试。我需要对每个组件都进行测试。
这需要我先拔掉前面板上的扬声器和多功能面板连接线。多功能面板上集成了各种 USB、SATA 和内存卡插槽。
令人惊讶的是,当只有主板通电时竟然一切正常。
计算机本身并不会开机,因为根本没有连接存储器。也不会有显示输出,因为我已经把显卡拆掉了。但是没有电源或主板故障的迹象。这进一步激发了我的好奇心。如果主板真的有问题的话,故障现象应该仍然存在才对。
所以我开始一系列的重复试验:断电,安装一个已经拆掉的配件,重新上电。
最终发现问题上由前置多功能面板引发的。
我拆除了多功能面板并将其它零件全部装了回去。工作站开机正常,运行良好。终于让我逮到罪魁祸首了。
### 起因
弄清真正的问题之后,我立刻就明白了问题的根本原因。这还要从几天前说起。那时我正在测试一些外接 USB 设备,包括几种摄像头、几个用于备份的存储设备和一个外接 USB 集线器。
我把一根 USB 连接线插到了多功能面板上的一个 USB 2.0 插口中。所有东西都停摆了大部分灯熄灭了风扇也不转了。USB 连接线发热很严重,我拔掉它时还把手指烫伤了。原来我不小心将连接线的 type C 端插头插到了一个 USB 3.0 type A 插口里,导致了供电短路。
拔掉 USB 连接线之后,一切都恢复了“正常”——但事实并非如此。我粗心的错误对多功能面板造成了损伤,它在坚持了几天之后彻底短路了。
### 妄下结论
知识和经验有时候比电源测试器之类的工具更重要。当然知识跟经验有时候也不管用。我最终找到了问题的真正原因,但其实我本该早就发现的。
尽管我在问题跟供电有关这一点上是对的,但还是误入歧途了。原因是我没能正确解读问题现象并根据线索调查得出逻辑结论导致的。我本可以更早找出问题的根本原因的,这样就不至于在修好主工作站之前浪费那么多时间在将我的笔记本变成临时主要设备上了。
系统管理员总与复杂的设备打交道,过早下结论在所难免。我有超过 50 年的从业经验,还是犯了这样的错误。我只需记住做几个 [深呼吸][2],然后刨根问底直到找到问题的根本原因。
### 好奇心
至少在等待新主板到货期间,我跟随了自己的好奇心。这让我比等新主板到货要早得多将事情恢复正常。同时也避免了我在没有充分测试的情况下把一块完好的主板丢掉。
谚语说好奇心害死猫。我讨厌这个谚语,因为它被家长、学校、见识短浅的老板、老师和那些不想被我们这种好奇宝宝干扰的人用得太多了。事实上,好奇心激发了对于人们对于知识和真理的渴求。这可能是关于硬件、开源软件、编程、组装个人电脑、优化系统设置或者学习使用新软件。满足你的好奇心吧!
--------------------------------------------------------------------------------
via: https://opensource.com/article/22/1/troubleshoot-hardware-sysadmin
作者:[David Both][a]
选题:[lujun9972][b]
译者:[译者ID](https://github.com/toknow-gh)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/dboth
[b]: https://github.com/lujun9972
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/puzzle_computer_solve_fix_tool.png?itok=U0pH1uwj (Puzzle pieces coming together to form a computer screen)
[2]: https://opensource.com/article/21/11/linux-yoga