TranslateProject/sources/tech/20220709 Monitoring tiny web services.md
DarkSun 1901868567 选题[tech]: 20220709 Monitoring tiny web services
sources/tech/20220709 Monitoring tiny web services.md
2022-07-10 05:02:34 +08:00

6.8 KiB
Raw Blame History

Monitoring tiny web services

Hello! Ive started to run a few more servers recently (nginx playground, mess with dns, dns lookup), so Ive been thinking about monitoring.

It wasnt initially totally obvious to me how to monitor these websites, so I wanted to quickly write up what how I did it.

Im not going to talk about how to monitor Big Serious Mission Critical websites at all, only tiny unimportant websites.

goal: spend approximately 0 time on operations

I want the sites to mostly work, but I also want to spend approximately 0% of my time on the ongoing operations.

I was initially very wary of running servers at all because at my last job I was on a 247 oncall rotation for some critical services, and in my mind “being responsible for servers” meant “get woken up at 2am to fix the servers” and “have lots of complicated dashboards”.

So for a while I only made static websites so that I wouldnt have to think about servers.

But eventually I realized that any server I was going to write was going to be very low stakes, if they occasionally go down for 2 hours its no big deal, and I could just set up some very simple monitoring to help keep them running.

not having monitoring sucks

At first I didnt set up any monitoring for my servers at all. This had the extremely predictable outcome of sometimes the site broke, and I didnt find out about it until somebody told me!

step 1: an uptime checker

The first step was to set up an uptime checker. There are tons of these out there, the ones Im using right now are updown.io and uptime robot. I like updowns user interface and pricing structure more (its per request instead of a monthly fee), but uptime robot has a more generous free tier.

These

  1. check that the site is up
  2. if it goes down, it emails me

I find that email notifications are a good level for me, Ill find out pretty quickly if the site goes down but it doesnt wake me up or anything.

step 2: an end-to-end healthcheck

Next, lets talk about what “check that the site is up” actually means.

At first I just made one of my healthcheck endpoints a function that returned 200 OK no matter what.

This is kind of useful it told me that the server was on!

But unsurprisingly I ran into problems because it wasnt checking that the API was actually working sometimes the healthcheck succeeded even though the rest of the service had actually gotten into a bad state.

So I updated it to actually make a real API request and make sure it succeeded.

All of my services do very few things (the nginx playground has just 1 endpoint), so its pretty easy to set up a healthcheck that actually runs through most of the actions the service is supposed to do.

Heres what the end-to-end healthcheck handler for the nginx playground looks like. Its very basic: it just makes another POST request (to itself) and checks if that request succeeds or fails.


    func healthHandler(w http.ResponseWriter, r *http.Request) {
        // make a request to localhost:8080 with `healthcheckJSON` as the body
        // if it works, return 200
        // if it doesn't, return 500
        client := http.Client{}
        resp, err := client.Post("http://localhost:8080/", "application/json", strings.NewReader(healthcheckJSON))
        if err != nil {
            log.Println(err)
            w.WriteHeader(http.StatusInternalServerError)
            return
        }
        if resp.StatusCode != http.StatusOK {
            log.Println(resp.StatusCode)
            w.WriteHeader(http.StatusInternalServerError)
            return
        }
        w.WriteHeader(http.StatusOK)
    }

healthcheck frequency: hourly

Right now Im running most of my healthchecks every hour, and some every 30 minutes.

I run them hourly because updown.ios pricing is per healthcheck, Im monitoring 18 different URLs, and I wanted to keep my healthcheck budget pretty minimal at $5/year.

Taking an hour to find out that one of these websites has gone down seems ok to me if there is a problem theres no guarantee Ill get to fixing it all that quickly anyway.

If it were free to run them more often Id probably run them every 5-10 minutes instead.

step 3: automatically restart if the healthcheck fails

Some of my websites are on fly.io, and fly has a pretty standard feature where I can configure a HTTP healthcheck for a service and restart the service if the healthcheck starts failing.

“Restart a lot” is a very useful strategy to paper over bugs that I havent gotten around to fixing yet for a while the nginx playground had a process leak where nginx processes werent getting terminated, so the server kept running out of RAM.

With the healthcheck, the result of this was that every day or so, this would happen:

  • the server ran out of RAM
  • the healthcheck started failing
  • it get restarted
  • everything was fine again
  • repeat the whole saga again some number of hours later

Eventually I got around to actually fixing the process leak, but it was nice to have a workaround in place that could keep things running while I was procrastinating fixing the bug.

These healthchecks to decide whether to restart the service run more often: every 5 minutes or so.

this is not the best way to monitor Big Services

This is probably obvious and I said this already at the beginning, but “write one HTTP healthcheck” is not the best approach for monitoring a large complex service. But I wont go into that because thats not what this post is about.

its been working well so far!

I originally wrote this post 3 months ago in April, but I waited until now to publish it to make sure that the whole setup was working.

Its made a pretty big difference before I was having some very silly downtime problems, and now for the last few months the sites have been up 99.95% of the time!


via: https://jvns.ca/blog/2022/07/09/monitoring-small-web-services/

作者:Julia Evans 选题:lujun9972 译者:译者ID 校对:校对者ID

本文由 LCTT 原创编译,Linux中国 荣誉推出