diff --git a/sources/tech/20171216 Sysadmin 101- Troubleshooting.md b/sources/tech/20171216 Sysadmin 101- Troubleshooting.md deleted file mode 100644 index cdb3a369c4..0000000000 --- a/sources/tech/20171216 Sysadmin 101- Troubleshooting.md +++ /dev/null @@ -1,125 +0,0 @@ -wxy applied - -Sysadmin 101: Troubleshooting -====== -I typically keep this blog strictly technical, keeping observations, opinions and the like to a minimum. But this, and the next few posts will be about basics and fundamentals for starting out in system administration/SRE/system engineer/sysops/devops-ops (whatever you want to call yourself) roles more generally. -Bear with me! - -"My web site is slow" - -I just picked the type of issue for this article at random, this can be applied to pretty much any sysadmin related troubleshooting. It's not about showing off the cleverest oneliners to find the most information. It's also not an exhaustive, step-by-step "flowchart" with the word "profit" in the last box. It's about general approach, by means of a few examples. -The example scenarios are solely for illustrative purposes. They sometimes have a basis in assumptions that doesn't apply to all cases all of the time, and I'm positive many readers will go "oh, but I think you will find…" at some point. -But that would be missing the point. - -Having worked in support, or within a support organization for over a decade, there is one thing that strikes me time and time again and that made me write this; -**The instinctive reaction many techs have when facing a problem, is to start throwing potential solutions at it.** - -"My website is slow" - - * I'm going to try upping `MaxClients/MaxRequestWorkers/worker_connections` - * I'm going to try to increase `innodb_buffer_pool_size/effective_cache_size` - * I'm going to try to enable `mod_gzip` (true story, sadly) - - - -"I saw this issue once, and then it was because X. So I'm going to try to fix X again, it might work". - -This wastes a lot of time, and leads you down a wild goose chase. In the dark. Wearing greased mittens. -InnoDB's buffer pool may well be at 100% utilization, but that's just because there are remnants of a large one-off report someone ran a while back in there. If there are no evictions, you've just wasted time. - -### Quick side-bar before we start - -At this point, I should mention that while it's equally applicable to many roles, I'm writing this from a general support system adminstrator's point of view. In a mature, in-house organization or when working with larger, fully managed or "enterprise" customers, you'll typically have everything instrumented, measured, graphed, thresheld (not even word) and alerted on. Then your approach will often be rather different. We're going in blind here. - -If you don't have that sort of thing at your disposal; - -### Clarify and First look - -Establish what the issue actually is. "Slow" can take many forms. Is it time to first byte? That's a whole different class of problem from poor Javascript loading and pulling down 15 MB of static assets on each page load. Is it slow, or just slower than it usually is? Two very different plans of attack! - -Make sure you know what the issue reported/experienced actually is before you go off and do something. Finding the source of the problem is often difficult enough, without also having to find the problem itself. -That is the sysadmin equivalent of bringing a knife to a gunfight. - -### Low hanging fruit / gimmies - -You are allowed to look for a few usual suspects when you first log in to a suspect server. In fact, you should! I tend to fire off a smattering of commands whenever I log in to a server to just very quickly check a few things; Are we swapping (`free/vmstat`), are the disks busy (`top/iostat/iotop`), are we dropping packets (`netstat/proc/net/dev`), is there an undue amount of connections in an undue state (`netstat`), is something hogging the CPUs (`top`), is someone else on this server (`w/who`), any eye-catching messages in syslog and `dmesg`? - -There's little point to carrying on if you have 2000 messages from your RAID controller about how unhappy it is with its write-through cache. - -This doesn't have to take more than half a minute. If nothing catches your eye - continue. - -### Reproduce - -If there indeed is a problem somewhere, and there's no low hanging fruit to be found; - -Take all steps you can to try and reproduce the problem. When you can reproduce, you can observe. **When you can observe, you can solve.** Ask the person reporting the issue what exact steps to take to reproduce the issue if it isn 't already obvious or covered by the first section. - -Now, for issues caused by solar flares and clients running exclusively on OS/2, it's not always feasible to reproduce. But your first port of call should be to at least try! In the very beginning, all you know is "X thinks their website is slow". For all you know at that point, they could be tethered to their GPRS mobile phone and applying Windows updates. Delving any deeper than we already have at that point is, again, a waste of time. - -Attempt to reproduce! - -### Check the log! - -It saddens me that I felt the need to include this. But I've seen escalations that ended mere minutes after someone ran `tail /var/log/..` Most *NIX tools these days are pretty good at logging. Anything blatantly wrong will manifest itself quite prominently in most application logs. Check it. - -### Narrow down - -If there are no obvious issues, but you can reproduce the reported problem, great. So, you know the website is slow. Now you've narrowed things down to: Browser rendering/bug, application code, DNS infrastructure, router, firewall, NICs (all eight+ involved), ethernet cables, load balancer, database, caching layer, session storage, web server software, application server, RAM, CPU, RAID card, disks. -Add a smattering of other potential culprits depending on the set-up. It could be the SAN, too. And don't forget about the hardware WAF! And.. you get my point. - -If the issue is time-to-first-byte you'll of course start applying known fixes to the webserver, that's the one responding slowly and what you know the most about, right? Wrong! -You go back to trying to reproduce the issue. Only this time, you try to eliminate as many potential sources of issues as possible. - -You can eliminate the vast majority of potential culprits very easily: Can you reproduce the issue locally from the server(s)? Congratulations, you've just saved yourself having to try your fixes for BGP routing. -If you can't, try from another machine on the same network. If you can - at least you can move the firewall down your list of suspects, (but do keep a suspicious eye on that switch!) - -Are all connections slow? Just because the server is a web server, doesn't mean you shouldn't try to reproduce with another type of service. [netcat][1] is very useful in these scenarios (but chances are your SSH connection would have been lagging this whole time, as a clue)! If that's also slow, you at least know you've most likely got a networking problem and can disregard the entire web stack and all its components. Start from the top again with this knowledge (do not collect $200). Work your way from the inside-out! - -Even if you can reproduce locally - there's still a whole lot of "stuff" left. Let's remove a few more variables. Can you reproduce it with a flat-file? If `i_am_a_1kb_file.html` is slow, you know it's not your DB, caching layer or anything beyond the OS and the webserver itself. -Can you reproduce with an interpreted/executed `hello_world.(py|php|js|rb..)` file? If you can, you've narrowed things down considerably, and you can focus on just a handful of things. If `hello_world` is served instantly, you've still learned a lot! You know there aren't any blatant resource constraints, any full queues or stuck IPC calls anywhere. So it's something the application is doing or something it's communicating with. - -Are all pages slow? Or just the ones loading the "Live scores feed" from a third party? - -**What this boils down to is; What 's the smallest amount of "stuff" that you can involve, and still reproduce the issue?** - -Our example is a slow web site, but this is equally applicable to almost any issue. Mail delivery? Can you deliver locally? To yourself? To ? Test with small, plaintext messages. Work your way up to the 2MB campaign blast. STARTTLS and no STARTTLS. Work your way from the inside-out. - -Each one of these steps takes mere seconds each, far quicker than implementing most "potential" fixes. - -### Observe / isolate - -By now, you may already have stumbled across the problem by virtue of being unable to reproduce when you removed a particular component. - -But if you haven't, or you still don't know **why** ; Once you've found a way to reproduce the issue with the smallest amount of "stuff" (technical term) between you and the issue, it's time to start isolating and observing. - -Bear in mind that many services can be ran in the foreground, and/or have debugging enabled. For certain classes of issues, it is often hugely helpful to do this. - -Here's also where your traditional armory comes into play. `strace`, `lsof`, `netstat`, `GDB`, `iotop`, `valgrind`, language profilers (cProfile, xdebug, ruby-prof…). Those types of tools. - -Once you've come this far, you rarely end up having to break out profilers or debugers though. - -[`strace`][2] is often a very good place to start. -You might notice that the application is stuck on a particular `read()` call on a socket file descriptor connected to port 3306 somewhere. You'll know what to do. -Move on to MySQL and start from the top again. Low hanging fruit: "Waiting_for comic core.md Dict.md lctt2014.md lctt2016.md LCTT翻译规范.md LICENSE Makefile published README.md sign.md sources translated 选题模板.txt 中文排版指北.md lock", deadlocks, max_connections.. Move on to: All queries? Only writes? Only certain tables? Only certain storage engines?… - -You might notice that there's a `connect()` to an external API resource that takes five seconds to complete, or even times out. You'll know what to do. - -You might notice that there are 1000 calls to `fstat()` and `open()` on the same couple of files as part of a circular dependency somewhere. You'll know what to do. - -It might not be any of those particular things, but I promise you, you'll notice something. - -If you're only going to take one thing from this section, let it be; learn to use `strace`! **Really** learn it, read the whole man page. Don 't even skip the HISTORY section. `man` each syscall you don't already know what it does. 98% of troubleshooting sessions ends with strace. - --------------------------------------------------------------------------------- - -via: http://northernmost.org/blog/troubleshooting-101/index.html - -作者:[Erik Ljungstrom][a] -译者:[lujun9972](https://github.com/lujun9972) -校对:[校对者ID](https://github.com/校对者ID) - -本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出 - -[a]:http://northernmost.org -[1]:http://nc110.sourceforge.net/ -[2]:https://linux.die.net/man/1/strace diff --git a/translated/tech/20171216 Sysadmin 101- Troubleshooting.md b/translated/tech/20171216 Sysadmin 101- Troubleshooting.md new file mode 100644 index 0000000000..560f07c3b4 --- /dev/null +++ b/translated/tech/20171216 Sysadmin 101- Troubleshooting.md @@ -0,0 +1,136 @@ +系统管理员入门:排除故障 +====== + +我通常会严格保持此博客的技术性,将观察、意见等内容保持在最低限度。但是,这篇和接下来的几篇文章将介绍刚进入系统管理/SRE/系统工程师/sysops/devops-ops(无论你想称自己是什么)角色的常见的基础知识。 + +请跟我来! + +> “我的网站很慢!” + +我只是随机选择了本文的问题类型,这也可以应用于任何与系统管理员相关的故障排除。我并不是要炫耀那些可以发现最多的信息的最聪明的“金句”。它也不是一个详尽的、一步步指导的、并在最后一个方框中导向“利润”一词的“流程图”。 + +我会通过一些例子展示常规的方法。 + +示例场景仅用于说明本文目的。它们有时会做一些不适用于所有情况的假设,而且肯定会有很多读者在某些时候说“哦,但我觉得你会发现……”。 + +但那可能会让我们错失重点。 + +十多年来,我一直在从事于支持工作,或在支持机构工作,有一件事让我一次又一次地感到震惊,这促使我写下了这篇文章。 + +**有许多技术人员在遇到问题时的本能反应,就是不管三七二十一去尝试可能的解决方案。** + +*“我的网站很慢,所以”,* + +* 我将尝试增大 `MaxClients`/`MaxRequestWorkers`/`worker_connections` +* 我将尝试提升 `innodb_buffer_pool_size`/`effective_cache_size` +* 我打算尝试启用 `mod_gzip`(遗憾的是,这是真实的故事) + +*“我曾经看过这个问题,它是因为某种原因造成的 —— 所以我估计还是这个原因,它应该能解决这个问题。”* + +这浪费了很多时间,并会让你在黑暗中盲目乱撞,胡乱鼓捣。 + +你的 InnoDB 的缓冲池也许达到 100% 的利用率,但这可能只是因为有人运行了一段时间的一次性大型报告导致的。如果没有排除这种情况,那你就是在浪费时间。 + +### 开始之前 + +在这里,我应该说明一下,虽然这些建议同样适用于许多角色,但我是从一般的支持系统管理员的角度来撰写的。在一个成熟的内部组织中,或与规模较大的、规范管理的或“企业级”客户合作时,你通常会对一切都进行检测、测量、绘制、整理(甚至不是文字),并发出警报。那么你的方法也往往会有所不同。让我们在这里先忽略这种情况。 + +如果你没有这种东西,那就随意了。 + +### 澄清问题 + +首先确定实际上是什么问题。“慢”可以是多种形式的。是收到第一个字节的时间吗?从糟糕的 Javascript 加载和每页加载要拉取 15 MB 的静态内容,这是一个完全不同类型的问题。是慢,还是比通常慢?这是两个非常不同的解决方案! + +在你着手做某事之前,确保你知道实际报告和遇到的问题。找到问题的根源通常很困难,但即便找不到也必须找到问题本身。 + +否则,这相当于系统管理员带着一把刀去参加枪战。 + +### 唾手可得 + +首次登录可疑服务器时,你可以查找一些常见的嫌疑对象。事实上,你应该这样做!每当我登录到服务器时,我都会发出一些命令来快速检查一些事情:我们是否发生了页交换(`free` / `vmstat`),磁盘是否繁忙(`top` / `iostat` / `iotop`),是否有丢包(`netstat` / `proc` / `net` / `dev`),是否处于连接数过多的状态(`netstat`),有什么东西占用了 CPU(`top`),谁在这个服务器上(`w` / `who`),syslog 和 `dmesg` 中是否有引人注目的消息? + +如果你从 RAID 控制器得到 2000 条抱怨直写式缓存没有生效的消息,那么继续进行是没有意义的。 + +这用不了半分钟。如果什么都没有引起你的注意 —— 那么继续。 + +### 重现问题 + +如果某处确实存在问题,并且找不到唾手可得的信息。 + +那么采取所有步骤来尝试重现问题。当你可以重现该问题时,你就可以观察它。**当你能观察到时,你就可以解决。**如果在第一步中尚未显现出或覆盖了问题所在,询问报告问题的人需要采取哪些确切步骤来重现问题。 + +对于由太阳耀斑或只能运行在 OS/2 上的客户端引起的问题,重现并不总是可行的。但你的第一个停靠港应该是至少尝试一下!在一开始,你所知道的是“某人认为他们的网站很慢”。对于那些人,他们可能还在用他们的 GPRS 手机,也可能正在安装 Windows 更新。你在这里挖掘得再深也是浪费时间。 + +尝试重现! + +### 检查日志 + +我对于有必要包括这一点感到很难过。但是我曾经看到有人在运行 `tail /var/log/...` 之后几分钟就不看了。大多数 *NIX 工具都特别喜欢记录日志。任何明显的错误都会在大多数应用程序日志中显得非常突出。检查一下。 + +### 缩小范围 + +如果没有明显的问题,但你可以重现所报告的问题,那也很棒。所以,你现在知道网站是慢的。现在你已经把范围缩小到:浏览器的渲染/错误、应用程序代码、DNS 基础设施、路由器、防火墙、网卡(所有的)、以太网电缆、负载均衡器、数据库、缓存层、会话存储、Web 服务器软件、应用程序服务器、内存、CPU、RAID 卡、磁盘等等。 + +根据设置添加一些其他可能的罪魁祸首。它们也可能是 SAN,也不要忘记硬件 WAF!以及…… 你明白我的意思。 + +如果问题是接收到第一个字节的时间,你当然会开始对 Web 服务器去应用上已知的修复程序,就是它响应缓慢,你也觉得几乎就是它,对吧?但是你错了! + +你要回去尝试重现这个问题。只是这一次,你要试图消除尽可能多的潜在问题来源。 + +你可以非常轻松地消除绝大多数可能的罪魁祸首:你能从服务器本地重现问题吗?恭喜,你刚刚节省了自己必须尝试修复 BGP 路由的时间。 + +如果不能,请尝试使用同一网络上的其他计算机。如果可以的话,至少你可以将防火墙移到你的嫌疑人名单上,(但是要注意一下那个交换机!) + +是所有的连接都很慢吗?虽然服务器是 Web 服务器,但并不意味着你不应该尝试使用其他类型的服务进行重现问题。[netcat][1] 在这些场景中非常有用(但是你的 SSH 连接可能会一直有延迟,这可以作为线索)! 如果这也很慢,你至少知道你很可能遇到了网络问题,可以忽略掉整个 Web 软件及其所有组件的问题。用这个知识(我不收 200 美元)再次从顶部开始,按你的方式由内到外地进行! + +即使你可以在本地复现 —— 仍然有很多“因素”留下。让我们排除一些变量。你能用普通文件重现它吗? 如果 `i_am_a_1kb_file.html` 很慢,你就知道它不是数据库、缓存层或 OS 以外的任何东西和 Web 服务器本身的问题。 + +你能用一个需要解释或执行的 `hello_world.(py|php|js|rb..)` 文件重现问题吗?如果可以的话,你已经大大缩小了范围,你可以专注于少数事情。如果 `hello_world` 可以马上工作,你仍然学到了很多东西!你知道了没有任何明显的资源限制、任何满的队列或在任何地方卡住的 IPC 调用,所以这是应用程序正在做的事情或它正在与之通信的事情。 + +所有页面都慢吗?或者只是从第三方加载“实时分数数据”的页面慢? + +**这可以归结为:你仍然可以重现这个问题所涉及的最少量的“因素”是什么?** + +我们的示例是一个缓慢的网站,但这同样适用于几乎所有问题。邮件投递?你能在本地投递吗?能发给自己吗?能发给<常见的服务提供者>吗?使用小的、纯文本的消息进行测试。尝试直到遇到 2MB 拥堵时。使用 STARTTLS 和不使用 STARTTLS 呢?按你的方式由内到外地进行! + +这些步骤中的每一步都只需要几秒钟,远远快于实施大多数“可能的”修复方案。 + +### 隔离观察 + +到目前为止,当你去除特定组件时无法重现问题时,你可能已经偶然发现了问题所在。 + +但如果你还没有,或者你仍然不知道**为什么**:一旦你找到了一种方法来重现问题,你和问题之间的“东西”(某个技术术语)最少,那么就该开始隔离和观察了。 + +请记住,许多服务可以在前台运行和/或启用调试。对于某些类别的问题,执行此操作通常非常有帮助。 + +这也是你的传统武器库发挥作用的地方。`strace`、`lsof`、`netstat`、`GDB`、`iotop`、`valgrind`、语言分析器(cProfile、xdebug、ruby-prof ……)那些类型的工具。 + +一旦你走到这一步,你就很少能摆脱剖析器或调试器了。 + +[strace][2] 通常是一个非常好的起点。 + +你可能会注意到应用程序停留在某个连接到端口 3306 的套接字文件描述符上的特定 `read()` 调用上。你会知道该怎么做。 + +转到 MySQL 并再次从顶部开始。显而易见:“等待某某锁”、死锁、`max_connections` ……进而:是所有查询?还是只写请求?只有某些表?还是只有某些存储引擎?等等…… + +你可能会注意到调用外部 API 资源的 `connect()` 需要五秒钟才能完成,甚至超时。你会知道该怎么做。 + +你可能会注意到,在同一对文件中有 1000 个调用 `fstat()` 和 `open()` 作为循环依赖的一部分。你会知道该怎么做。 + +它可能不是那些特别的东西,但我保证,你会发现一些东西。 + +如果你只是从这一部分学到一点,那也不错;学习使用 `strace` 吧!**真的**学习它,阅读整个手册页。甚至不要跳过历史部分。`man` 每个你还不知道它做了什么的系统调用。98% 的故障排除会话以 `strace` 而终结。 + +--------------------------------------------------------------------- + +via: http://northernmost.org/blog/troubleshooting-101/index.html + +作者:[Erik Ljungstrom][a] +译者:[wxy](https://github.com/wxy) +校对:[wxy](https://github.com/wxy) + +本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出 + +[a]:http://northernmost.org +[1]:http://nc110.sourceforge.net/ +[2]:https://linux.die.net/man/1/strace