diff --git a/sources/tech/20220709 Monitoring tiny web services.md b/sources/tech/20220709 Monitoring tiny web services.md deleted file mode 100644 index 707775adde..0000000000 --- a/sources/tech/20220709 Monitoring tiny web services.md +++ /dev/null @@ -1,143 +0,0 @@ -[#]: subject: "Monitoring tiny web services" -[#]: via: "https://jvns.ca/blog/2022/07/09/monitoring-small-web-services/" -[#]: author: "Julia Evans https://jvns.ca/" -[#]: collector: "lujun9972" -[#]: translator: "geekpi" -[#]: reviewer: " " -[#]: publisher: " " -[#]: url: " " - -Monitoring tiny web services -====== - -Hello! I’ve started to run a few more servers recently ([nginx playground][1], [mess with dns][2], [dns lookup][3]), so I’ve been thinking about monitoring. - -It wasn’t initially totally obvious to me how to monitor these websites, so I wanted to quickly write up what how I did it. - -I’m not going to talk about how to monitor Big Serious Mission Critical websites at all, only tiny unimportant websites. - -### goal: spend approximately 0 time on operations - -I want the sites to mostly work, but I also want to spend approximately 0% of my time on the ongoing operations. - -I was initially very wary of running servers at all because at my last job I was on a 24⁄7 oncall rotation for some critical services, and in my mind “being responsible for servers” meant “get woken up at 2am to fix the servers” and “have lots of complicated dashboards”. - -So for a while I only made static websites so that I wouldn’t have to think about servers. - -But eventually I realized that any server I was going to write was going to be very low stakes, if they occasionally go down for 2 hours it’s no big deal, and I could just set up some very simple monitoring to help keep them running. - -### not having monitoring sucks - -At first I didn’t set up any monitoring for my servers at all. This had the extremely predictable outcome of – sometimes the site broke, and I didn’t find out about it until somebody told me! - -### step 1: an uptime checker - -The first step was to set up an uptime checker. There are tons of these out there, the ones I’m using right now are [updown.io][4] and [uptime robot][5]. I like updown’s user interface and [pricing][6] structure more (it’s per request instead of a monthly fee), but uptime robot has a more generous free tier. - -These - - 1. check that the site is up - 2. if it goes down, it emails me - - - -I find that email notifications are a good level for me, I’ll find out pretty quickly if the site goes down but it doesn’t wake me up or anything. - -### step 2: an end-to-end healthcheck - -Next, let’s talk about what “check that the site is up” actually means. - -At first I just made one of my healthcheck endpoints a function that returned `200 OK` no matter what. - -This is kind of useful – it told me that the server was on! - -But unsurprisingly I ran into problems because it wasn’t checking that the API was actually _working_ – sometimes the healthcheck succeeded even though the rest of the service had actually gotten into a bad state. - -So I updated it to actually make a real API request and make sure it succeeded. - -All of my services do very few things (the nginx playground has just 1 endpoint), so it’s pretty easy to set up a healthcheck that actually runs through most of the actions the service is supposed to do. - -Here’s what the end-to-end healthcheck handler for the nginx playground looks like. It’s very basic: it just makes another POST request (to itself) and checks if that request succeeds or fails. - -``` - - func healthHandler(w http.ResponseWriter, r *http.Request) { - // make a request to localhost:8080 with `healthcheckJSON` as the body - // if it works, return 200 - // if it doesn't, return 500 - client := http.Client{} - resp, err := client.Post("http://localhost:8080/", "application/json", strings.NewReader(healthcheckJSON)) - if err != nil { - log.Println(err) - w.WriteHeader(http.StatusInternalServerError) - return - } - if resp.StatusCode != http.StatusOK { - log.Println(resp.StatusCode) - w.WriteHeader(http.StatusInternalServerError) - return - } - w.WriteHeader(http.StatusOK) - } - -``` - -### healthcheck frequency: hourly - -Right now I’m running most of my healthchecks every hour, and some every 30 minutes. - -I run them hourly because updown.io’s pricing is per healthcheck, I’m monitoring 18 different URLs, and I wanted to keep my healthcheck budget pretty minimal at $5/year. - -Taking an hour to find out that one of these websites has gone down seems ok to me – if there is a problem there’s no guarantee I’ll get to fixing it all that quickly anyway. - -If it were free to run them more often I’d probably run them every 5-10 minutes instead. - -### step 3: automatically restart if the healthcheck fails - -Some of my websites are on fly.io, and fly has a pretty standard feature where I can configure a HTTP healthcheck for a service and restart the service if the healthcheck starts failing. - -“Restart a lot” is a very useful strategy to paper over bugs that I haven’t gotten around to fixing yet – for a while the nginx playground had a process leak where `nginx` processes weren’t getting terminated, so the server kept running out of RAM. - -With the healthcheck, the result of this was that every day or so, this would happen: - - * the server ran out of RAM - * the healthcheck started failing - * it get restarted - * everything was fine again - * repeat the whole saga again some number of hours later - - - -Eventually I got around to actually fixing the process leak, but it was nice to have a workaround in place that could keep things running while I was procrastinating fixing the bug. - -These healthchecks to decide whether to restart the service run more often: every 5 minutes or so. - -### this is not the best way to monitor Big Services - -This is probably obvious and I said this already at the beginning, but “write one HTTP healthcheck” is not the best approach for monitoring a large complex service. But I won’t go into that because that’s not what this post is about. - -### it’s been working well so far! - -I originally wrote this post 3 months ago in April, but I waited until now to publish it to make sure that the whole setup was working. - -It’s made a pretty big difference – before I was having some very silly downtime problems, and now for the last few months the sites have been up 99.95% of the time! - --------------------------------------------------------------------------------- - -via: https://jvns.ca/blog/2022/07/09/monitoring-small-web-services/ - -作者:[Julia Evans][a] -选题:[lujun9972][b] -译者:[译者ID](https://github.com/译者ID) -校对:[校对者ID](https://github.com/校对者ID) - -本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出 - -[a]: https://jvns.ca/ -[b]: https://github.com/lujun9972 -[1]: https://nginx-playground.wizardzines.com -[2]: https://messwithdns.net -[3]: https://dns-lookup.jvns.ca -[4]: https://updown.io/ -[5]: https://uptimerobot.com/ -[6]: https://updown.io/#pricing diff --git a/translated/tech/20220709 Monitoring tiny web services.md b/translated/tech/20220709 Monitoring tiny web services.md new file mode 100644 index 0000000000..36c3be3ef7 --- /dev/null +++ b/translated/tech/20220709 Monitoring tiny web services.md @@ -0,0 +1,143 @@ +[#]: subject: "Monitoring tiny web services" +[#]: via: "https://jvns.ca/blog/2022/07/09/monitoring-small-web-services/" +[#]: author: "Julia Evans https://jvns.ca/" +[#]: collector: "lujun9972" +[#]: translator: "geekpi" +[#]: reviewer: " " +[#]: publisher: " " +[#]: url: " " + +监测微型网络服务 +====== + +你好! 我最近又开始运行一些服务器([nginx playground][1]、[mess with dns][2]、[dns lookup][3]),所以我一直在考虑监控问题。 + +最初我并不完全清楚如何监控这些网站,所以我想快速写下我是如何做到的。 + +我根本不打算谈如何监控大型严肃的关键任务网站,只谈微型的不重要的网站。 + +### 目标:在操作上几乎不花时间 + +我希望网站大部分时间都能正常工作,但我也希望在持续的运营上几乎不花时间。 + +我最初对运行服务器非常警惕,因为在我的上一份工作中,我是 24/7 轮流值班,负责一些关键的服务,在我的印象中,“负责服务器”意味着“在凌晨 2 点被叫起来修理服务器”和“有很多复杂的仪表盘”。 + +所以有一段时间我只做静态网站,这样我就不用考虑服务器的问题。 + +但最终我意识到,我所要写的任何服务器的风险都很低,如果它们偶尔宕机 2 小时也没什么大不了的,我只需设置一些非常简单的监控来帮助它们保持运行。 + +### 没有监控很糟糕 + +起初,我根本没有为我的服务器设置任何监控。这样做的结果是非常可预见的:有时网站坏了,而我却没有发现,直到有人告诉我! + +### 步骤 1:uptime 检查器 + +第一步是建立一个 uptime 检查器。外面有很多这样的东西,我现在使用的是 [updown.io][4] 和 [uptime robot][5]。我更喜欢 updown 的用户界面和[定价][6]结构(它是按请求而不是按月收费),但u ptime robot 有一个更慷慨的免费套餐。 + +它们会: + + 1. 检查网站是否正常 + 2. 如果出现故障,它会给我发电子邮件 + + + +我发现电子邮件通知对我来说是一个很好的级别,如果网站宕机,我会很快发现,但它不会唤醒我或任何东西。 + +### 步骤 2:端到端的健康检查 + +接下来,让我们谈谈“检查网站是否正常”到底是什么意思。 + +起初,我只是把我的健康检查端点之一变成一个函数,无论如何都会返回 `200 OK`。 + +这倒是挺有用的 – 它告诉我服务器是启动着的! + +但不出所料,我遇到了问题,因为它没有检查 API 是否真的在_工作_ – 有时健康检查成功了,尽管服务的其他部分实际上已经进入了一个糟糕的状态。 + +所以我更新了它,让它真正地发出 API 请求,并确保它成功了。 + +我所有的服务都只做了很少的事情(nginx playground 只有一个端点),所以设置一个健康检查是非常容易的,它实际上贯穿了服务应该做的大部分动作。 + +下面是 nginx playground 的端到端健康检查处理程序的样子。它非常基本:它只是发出一个 POST 请求(给自己),并检查该请求是成功还是失败。 + +``` + + func healthHandler(w http.ResponseWriter, r *http.Request) { + // make a request to localhost:8080 with `healthcheckJSON` as the body + // if it works, return 200 + // if it doesn't, return 500 + client := http.Client{} + resp, err := client.Post("http://localhost:8080/", "application/json", strings.NewReader(healthcheckJSON)) + if err != nil { + log.Println(err) + w.WriteHeader(http.StatusInternalServerError) + return + } + if resp.StatusCode != http.StatusOK { + log.Println(resp.StatusCode) + w.WriteHeader(http.StatusInternalServerError) + return + } + w.WriteHeader(http.StatusOK) + } + +``` + +### 健康检查频率:每小时一次 + +现在,我大部分健康检查每小时运行一次,有些每 30 分钟运行一次。 + +我每小时运行一次,因为 updown.io 的定价是按健康检查次数计算的,我正在监控 18 个不同的 URL,而且我想把我的健康检查预算保持在 5 美元/年的最低水平。 + +花一个小时来发现这些网站中的一个出现故障,对我来说是可以的 – 如果有问题,我也不能保证能很快修复它。 + +如果可以更频繁地运行它们,我可能会每 5-10 分钟运行一次。 + +### 步骤 3:第三步:如果健康检查失败,自动重新启动 + +我的一些网站在 fly.io 上,fly 有一个相当标准的功能,我可以为一个服务配置一个 HTTP 健康检查,如果健康检查失败,就重新启动服务。 + +“经常重启”是一个非常有用的策略来弥补我尚未修复的 bug,有一段时间,nginx playground 有一个进程泄漏,`nginx` 进程没有被终止,所以服务器的内存一直在耗尽。 + +通过健康检查,其结果是,每隔一天左右就会发生这样的情况: + + * 服务器的内存用完了 + * 健康检查开始失败 + * 它被重新启动 + * 一切又正常了 + * 几个小时后再次重复整个传奇 + + + +最终,我开始实际修复进程泄漏,但很高兴有一个解决方法可以在我拖延修复 bug 时保持运行。 + +这些用于决定是否重新启动服务的运行状况检查更频繁地运行:每 5 分钟左右。 + +### 这不是监控大型服务的最佳方式 + +这可能很明显,我在一开始就已经说过了,但是“编写一个 HTTP 健康检查”并不是监控大型复杂服务的最佳方法。 但我不会深入讨论,因为这不是这篇文章的主题。 + +### 到目前为止一直运行良好! + +我最初在 3 个月前的四月写了这篇文章,但我一直等到现在才发布它以确保整个设置正常工作。 + +这带来了很大的不同 – 在我遇到一些非常愚蠢的停机问题之前,现在在过去的几个月里,网站的运行时间达到了 99.95%! + +-------------------------------------------------------------------------------- + +via: https://jvns.ca/blog/2022/07/09/monitoring-small-web-services/ + +作者:[Julia Evans][a] +选题:[lujun9972][b] +译者:[geekpi](https://github.com/geekpi) +校对:[校对者ID](https://github.com/校对者ID) + +本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出 + +[a]: https://jvns.ca/ +[b]: https://github.com/lujun9972 +[1]: https://nginx-playground.wizardzines.com +[2]: https://messwithdns.net +[3]: https://dns-lookup.jvns.ca +[4]: https://updown.io/ +[5]: https://uptimerobot.com/ +[6]: https://updown.io/#pricing