mirror of
https://github.com/LCTT/TranslateProject.git
synced 2025-01-13 22:30:37 +08:00
Translated by qhwdw
This commit is contained in:
parent
c120bbf959
commit
480d4f1f18
@ -1,55 +0,0 @@
|
|||||||
Translating by qhwdw
|
|
||||||
Process Monitoring
|
|
||||||
======
|
|
||||||
|
|
||||||
Since forking the Mon project to [etbemon [1]][1] I've been spending a lot of time working on the monitor scripts. Actually monitoring something is usually quite easy, deciding what to monitor tends to be the hard part. The process monitoring script ps.monitor is the one I'm about to redesign.
|
|
||||||
|
|
||||||
Here are some of my ideas for monitoring processes. Please comment if you have any suggestions for how do do things better.
|
|
||||||
|
|
||||||
For people who don't use mon, the monitor scripts return 0 if everything is OK and 1 if there's a problem along with using stdout to display an error message. While I'm not aware of anyone hooking mon scripts into a different monitoring system that's going to be easy to do. One thing I plan to work on in the future is interoperability between mon and other systems such as Nagios.
|
|
||||||
|
|
||||||
### Basic Monitoring
|
|
||||||
```
|
|
||||||
ps.monitor tor:1-1 master:1-2 auditd:1-1 cron:1-5 rsyslogd:1-1 dbus-daemon:1- sshd:1- watchdog:1-2
|
|
||||||
```
|
|
||||||
|
|
||||||
I'm currently planning some sort of rewrite of the process monitoring script. The current functionality is to have a list of process names on the command line with minimum and maximum numbers for the instances of the process in question. The above is a sample of the configuration of the monitor. There are some limitations to this, the "master" process in this instance refers to the main process of Postfix, but other daemons use the same process name (it's one of those names that's wrong because it's so obvious). One obvious solution to this is to give the option of specifying the full path so that /usr/lib/postfix/sbin/master can be differentiated from all the other programs named master.
|
|
||||||
|
|
||||||
The next issue is processes that may run on behalf of multiple users. With sshd there is a single process to accept new connections running as root and a process running under the UID of each logged in user. So the number of sshd processes running as root will be one greater than the number of root login sessions. This means that if a sysadmin logs in directly as root via ssh (which is controversial and not the topic of this post - merely something that people do which I have to support) and the master process then crashes (or the sysadmin stops it either accidentally or deliberately) there won't be an alert about the missing process. Of course the correct thing to do is to have a monitor talk to port 22 and look for the string "SSH-2.0-OpenSSH_". Sometimes there are multiple instances of a daemon running under different UIDs that need to be monitored separately. So obviously we need the ability to monitor processes by UID.
|
|
||||||
|
|
||||||
In many cases process monitoring can be replaced by monitoring of service ports. So if something is listening on port 25 then it probably means that the Postfix "master" process is running regardless of what other "master" processes there are. But for my use I find it handy to have multiple monitors, if I get a Jabber message about being unable to send mail to a server immediately followed by a Jabber message from that server saying that "master" isn't running I don't need to fully wake up to know where the problem is.
|
|
||||||
|
|
||||||
### SE Linux
|
|
||||||
|
|
||||||
One feature that I want is monitoring SE Linux contexts of processes in the same way as monitoring UIDs. While I'm not interested in writing tests for other security systems I would be happy to include code that other people write. So whatever I do I want to make it flexible enough to work with multiple security systems.
|
|
||||||
|
|
||||||
### Transient Processes
|
|
||||||
|
|
||||||
Most daemons have a second process of the same name running during the startup process. This means if you monitor for exactly 1 instance of a process you may get an alert about 2 processes running when "logrotate" or something similar restarts the daemon. Also you may get an alert about 0 instances if the check happens to run at exactly the wrong time during the restart. My current way of dealing with this on my servers is to not alert until the second failure event with the "alertafter 2" directive. The "failure_interval" directive allows specifying the time between checks when the monitor is in a failed state, setting that to a low value means that waiting for a second failure result doesn't delay the notification much.
|
|
||||||
|
|
||||||
To deal with this I've been thinking of making the ps.monitor script automatically check again after a specified delay. I think that solving the problem with a single parameter to the monitor script is better than using 2 configuration directives to mon to work around it.
|
|
||||||
|
|
||||||
### CPU Use
|
|
||||||
|
|
||||||
Mon currently has a loadavg.monitor script that to check the load average. But that won't catch the case of a single process using too much CPU time but not enough to raise the system load average. Also it won't catch the case of a CPU hungry process going quiet (EG when the SETI at Home server goes down) while another process goes into an infinite loop. One way of addressing this would be to have the ps.monitor script have yet another configuration option to monitor CPU use, but this might get confusing. Another option would be to have a separate script that alerts on any process that uses more than a specified percentage of CPU time over it's lifetime or over the last few seconds unless it's in a whitelist of processes and users who are exempt from such checks. Probably every regular user would be exempt from such checks because you never know when they will run a file compression program. Also there is a short list of daemons that are excluded (like BOINC) and system processes (like gzip which is run from several cron jobs).
|
|
||||||
|
|
||||||
### Monitoring for Exclusion
|
|
||||||
|
|
||||||
A common programming mistake is to call setuid() before setgid() which means that the program doesn't have permission to call setgid(). If return codes aren't checked (and people who make such rookie mistakes tend not to check return codes) then the process keeps elevated permissions. Checking for processes running as GID 0 but not UID 0 would be handy. As an aside a quick examination of a Debian/Testing workstation didn't show any obvious way that a process with GID 0 could gain elevated privileges, but that could change with one chmod 770 command.
|
|
||||||
|
|
||||||
On a SE Linux system there should be only one process running with the domain init_t. Currently that doesn't happen in Stretch systems running daemons such as mysqld and tor due to policy not matching the recent functionality of systemd as requested by daemon service files. Such issues will keep occurring so we need automated tests for them.
|
|
||||||
|
|
||||||
Automated tests for configuration errors that might impact system security is a bigger issue, I'll probably write a separate blog post about it.
|
|
||||||
|
|
||||||
--------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
via: https://etbe.coker.com.au/2017/09/28/process-monitoring/
|
|
||||||
|
|
||||||
作者:[Andrew][a]
|
|
||||||
译者:[译者ID](https://github.com/译者ID)
|
|
||||||
校对:[校对者ID](https://github.com/校对者ID)
|
|
||||||
|
|
||||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
|
||||||
|
|
||||||
[a]:https://etbe.coker.com.au
|
|
||||||
[1]:https://doc.coker.com.au/projects/etbe-mon/
|
|
54
translated/tech/20170928 Process Monitoring.md
Normal file
54
translated/tech/20170928 Process Monitoring.md
Normal file
@ -0,0 +1,54 @@
|
|||||||
|
监视进程
|
||||||
|
======
|
||||||
|
|
||||||
|
由于 fork 了 Mon 项目到 [etbemon [1]][1] 中,我花了一些时间做监视脚本。事实上监视一些事情通常很容易,但是决定监视什么才是困难的部分。进程监视脚本 ps.monitor 是我重新设计过的一个。
|
||||||
|
|
||||||
|
对于进程监视我有一些思路。如果你对进程监视如何做的更好有任何建议,请通过评论区告诉我。
|
||||||
|
|
||||||
|
对于不使用 Mon 的人来说,如果一切 OK 监视脚本就返回 0,而如果有问题它会返回 1,并使用标准输出显示错误信息。虽然我并不知道有谁将 Mon 脚本挂进一个不同的监视系统中,但是,那样做其实很容易实现。我计划去做的一件事情就是,将来实现 mon 和其它的监视系统如 Nagios 之间的互操作性。
|
||||||
|
|
||||||
|
### 基本监视
|
||||||
|
```
|
||||||
|
ps.monitor tor:1-1 master:1-2 auditd:1-1 cron:1-5 rsyslogd:1-1 dbus-daemon:1- sshd:1- watchdog:1-2
|
||||||
|
```
|
||||||
|
|
||||||
|
我现在计划重写进程监视脚本的一些分类。现在的功能是在命令行上有一个进程名字的列表,它包含了有疑问的实例进程的最小和最大数量。上面的示例是一个监视器的配置。在这里有一些限制,在这个实例中的 "master" 进程引用到 Postfix 的主进程,但是其它的守护进程使用了相同的进程名(这是其中一个错误的名字,因为它太显眼了)。一个显而易见的解决方案是,给一个指定完整路径的选项,这样,那个 /usr/lib/postfix/sbin/master 就可以与其它命名为 “master” 的程序区分开了。
|
||||||
|
|
||||||
|
下一个问题是那些可能代表多个用户运行的进程。比如 sshd,它有一个以 root 身份运行的单独的进程去接受新的连接请求,以及在每个登入用户的 UID 下运行的进程。因此,作为 root 用户运行的 sshd 进程的数量将多于 root 会话的数量。这意味着如果一个系统管理员直接以 root 身份通过 ssh 登入系统(这是有争议的,但它不是本文的主题—— 只是有些人需要这样做,所以我们支持),然后 master 进程崩溃了(或者系统管理员意外或者故意杀死了它),这时对于进程丢失并不会产生警报。当然正确的做法是监视 22 号端口,查找字符串 "SSH-2.0-OpenSSH_"。有时候,守护进程的多个实例运行在需要单独监视的不同 UIDs 下面。因此,我们需要通过 UID 监视进程的能力。
|
||||||
|
|
||||||
|
在许多案例中,进程监视可以被替换为对服务端口的监视。因此,如果在 25 号端口上监视,那么有可能意味着,一个运行着 Postfix 的 “master",而不用去理会其它的 "master” 进程。但是对于我而言,我可以在多个监视中很方便地找到它,如果我得到一个关于无法向一个服务器发送邮件的 Jabber 消息,我可以通过这个来自服务器的 Jabber 消息断定 “master" 没有运行,而不需要挨个查找才能发现问题所在。
|
||||||
|
|
||||||
|
### SE Linux
|
||||||
|
|
||||||
|
我想要的一个功能就是,监视 SE Linux 进程上下文,就像监视 UIDs 一样。虽然我对为其它安全系统编写一个测试不感兴趣,但是,我很乐意将别人写好的代码包含进去。因此,不管我做什么,都希望它能与多个安全系统一起灵活地工作。
|
||||||
|
|
||||||
|
### 短暂进程
|
||||||
|
|
||||||
|
大多数守护进程在进程启动期间都有一个相同名字的次级进程(second process)。这意味着如果你为了精确地监视一个进程的实例,你或许会收到一个警报说,当 ”logrotate" 或者类似的守护进程重启时有两个进程运行。如果在重启期间,恰好在一个错误的时间进行检查,你也或许会收到一个警报说,有 0 个实例。我现在处理这种情况的方法是,在与 "alertafter 2" 指令一起的次级进程失败事件之前我的服务器不发出警报。当监视处于一个失败的状态时,"failure_interval" 指令允许指定检查的时间间隔,将其设置为一个低值时,意味着在等待一个次级进程失败结果时并不会使提示延迟太多。
|
||||||
|
|
||||||
|
为处理这种情况,我考虑让 ps.monitor 脚本在一个指定的延迟后再次进行自动检查。我认为使用一个单个参数的监视脚本来解决这个问题比起使用两个配置指令的 mon 要好一些。
|
||||||
|
|
||||||
|
### CPU 使用
|
||||||
|
|
||||||
|
Mon 现在有一个 loadavg.monitor 脚本,它用于检查平均负载。但是它并不能捕获一个单个进程使用了太多的 CPU 时间而没有使系统平均负载上升的情况。同样,也没有捕获一个渴望获得 CPU 的进程进入沉默(例如,在家用服务器上 SETI 运行变少)(译者注:SETI,由加州大学伯克利分校创建的一项利用全球的联网计算机的空闲计算资源来搜寻地外文明的科学实验计划)而其它的进程进入一个无限循环状态的情况。解决这种问题的一个方法是,让 ps.monitor 脚本也配置另外的一个选项去监视 CPU 的使用,但是这也可能会让人产生迷惑。另外的选择是,使用一个独立的脚本,它用来报警任何在它的生命周期或者最后几秒中,使用 CPU 时间超过指定百分比的进程,除非它在一个进程白名单中以及是一个豁免这种检查的用户。或者每个普通用户都应该豁免这种检查,因为当它们运行一个文件压缩程序时,你压根就不知道。这里还有一个包含排除的守护进程(像 BOINC)和系统进程(像 gzip,它是由几个定时任务运行的)的简短列表。
|
||||||
|
|
||||||
|
### 对例外的监视
|
||||||
|
|
||||||
|
一个常见的编程错误是在 setgid() 之前调用 setuid(),这意味着那个程序没有权限去调用 setgid()。如果没有检查返回代码(而犯这种低级错误的人往往不会去检查返回代码),那么进程会保持较高的权限。检查以 GID 0 而不是 UID 0 运行的进程是很方便的。顺利说一下,对一个 Debian/测试工作站运行的一个快速检查显示,一个使用 GID 0 的进程并没有获得较高的权限,但是可以使用一个 chmod 770 命令去改变它。
|
||||||
|
|
||||||
|
在一个 SE Linux 系统上,应该只有一个进程与 init_t 域一起运行。目前在运行守护进程(比如,mysqld 和 tor)的扩展系统中,并不会发生策略与守护进程服务文件所请求的 systemd 的最新功能不匹配的情况。这样的问题将会不断发生,我们需要对它进行自动化测试。
|
||||||
|
|
||||||
|
对配置错误的自动测试可能会影响系统安全,这是一个很大的问题,我将来或许写一篇关于这方面的单独的博客文章。
|
||||||
|
|
||||||
|
--------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
via: https://etbe.coker.com.au/2017/09/28/process-monitoring/
|
||||||
|
|
||||||
|
作者:[Andrew][a]
|
||||||
|
译者:[qhwdw](https://github.com/qhwdw)
|
||||||
|
校对:[校对者ID](https://github.com/校对者ID)
|
||||||
|
|
||||||
|
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||||
|
|
||||||
|
[a]:https://etbe.coker.com.au
|
||||||
|
[1]:https://doc.coker.com.au/projects/etbe-mon/
|
Loading…
Reference in New Issue
Block a user