mirror of
https://github.com/LCTT/TranslateProject.git
synced 2024-12-26 21:30:55 +08:00
Merge pull request #3161 from strugglingyouth/master
翻译完成 20150803 Troubleshooting with Linux Logs.md
This commit is contained in:
commit
d9bb72b1c5
@ -1,117 +0,0 @@
|
||||
translation by strugglingyouth
|
||||
Troubleshooting with Linux Logs
|
||||
================================================================================
|
||||
Troubleshooting is the main reason people create logs. Often you’ll want to diagnose why a problem happened with your Linux system or application. An error message or a sequence of events can give you clues to the root cause, indicate how to reproduce the issue, and point out ways to fix it. Here are a few use cases for things you might want to troubleshoot in your logs.
|
||||
|
||||
### Cause of Login Failures ###
|
||||
|
||||
If you want to check if your system is secure, you can check your authentication logs for failed login attempts and unfamiliar successes. Authentication failures occur when someone passes incorrect or otherwise invalid login credentials, often to ssh for remote access or su for local access to another user’s permissions. These are logged by the [pluggable authentication module][1], or pam for short. Look in your logs for strings like Failed password and user unknown. Successful authentication records include strings like Accepted password and session opened.
|
||||
|
||||
Failure Examples:
|
||||
|
||||
pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=10.0.2.2
|
||||
Failed password for invalid user hoover from 10.0.2.2 port 4791 ssh2
|
||||
pam_unix(sshd:auth): check pass; user unknown
|
||||
PAM service(sshd) ignoring max retries; 6 > 3
|
||||
|
||||
Success Examples:
|
||||
|
||||
Accepted password for hoover from 10.0.2.2 port 4792 ssh2
|
||||
pam_unix(sshd:session): session opened for user hoover by (uid=0)
|
||||
pam_unix(sshd:session): session closed for user hoover
|
||||
|
||||
You can use grep to find which users accounts have the most failed logins. These are the accounts that potential attackers are trying and failing to access. This example is for an Ubuntu system.
|
||||
|
||||
$ grep "invalid user" /var/log/auth.log | cut -d ' ' -f 10 | sort | uniq -c | sort -nr
|
||||
23 oracle
|
||||
18 postgres
|
||||
17 nagios
|
||||
10 zabbix
|
||||
6 test
|
||||
|
||||
You’ll need to write a different command for each application and message because there is no standard format. Log management systems that automatically parse logs will effectively normalize them and help you extract key fields like username.
|
||||
|
||||
Log management systems can extract the usernames from your Linux logs using automated parsing. This lets you see an overview of the users and filter on them with a single click. In this example, we can see that the root user logged in over 2,700 times because we are filtering the logs to show login attempts only for the root user.
|
||||
|
||||
![](http://www.loggly.com/ultimate-guide/wp-content/uploads/2015/05/Screen-Shot-2015-03-12-at-11.05.36-AM.png)
|
||||
|
||||
Log management systems also let you view graphs over time to spot unusual trends. If someone had one or two failed logins within a few minutes, it might be that a real user forgot his or her password. However, if there are hundreds of failed logins or they are all different usernames, it’s more likely that someone is trying to attack the system. Here you can see that on March 12, someone tried to login as test and nagios several hundred times. This is clearly not a legitimate use of the system.
|
||||
|
||||
![](http://www.loggly.com/ultimate-guide/wp-content/uploads/2015/05/Screen-Shot-2015-03-12-at-11.12.18-AM.png)
|
||||
|
||||
### Cause of Reboots ###
|
||||
|
||||
Sometimes a server can stop due to a system crash or reboot. How do you know when it happened and who did it?
|
||||
|
||||
#### Shutdown Command ####
|
||||
|
||||
If someone ran the shutdown command manually, you can see it in the auth log file. Here you can see that someone remotely logged in from the IP 50.0.134.125 as the user ubuntu and then shut the system down.
|
||||
|
||||
Mar 19 18:36:41 ip-172-31-11-231 sshd[23437]: Accepted publickey for ubuntu from 50.0.134.125 port 52538 ssh
|
||||
Mar 19 18:36:41 ip-172-31-11-231 23437]:sshd[ pam_unix(sshd:session): session opened for user ubuntu by (uid=0)
|
||||
Mar 19 18:37:09 ip-172-31-11-231 sudo: ubuntu : TTY=pts/1 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/sbin/shutdown -r now
|
||||
|
||||
#### Kernel Initializing ####
|
||||
|
||||
If you want to see when the server restarted regardless of reason (including crashes) you can search logs from the kernel initializing. You’d search for the facility kernel messages and Initializing cpu.
|
||||
|
||||
Mar 19 18:39:30 ip-172-31-11-231 kernel: [ 0.000000] Initializing cgroup subsys cpuset
|
||||
Mar 19 18:39:30 ip-172-31-11-231 kernel: [ 0.000000] Initializing cgroup subsys cpu
|
||||
Mar 19 18:39:30 ip-172-31-11-231 kernel: [ 0.000000] Linux version 3.8.0-44-generic (buildd@tipua) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #66~precise1-Ubuntu SMP Tue Jul 15 04:01:04 UTC 2014 (Ubuntu 3.8.0-44.66~precise1-generic 3.8.13.25)
|
||||
|
||||
### Detect Memory Problems ###
|
||||
|
||||
There are lots of reasons a server might crash, but one common cause is running out of memory.
|
||||
|
||||
When your system is low on memory, processes are killed, typically in the order of which ones will release the most resources. The error occurs when your system is using all of its memory and a new or existing process attempts to access additional memory. Look in your log files for strings like Out of Memory or for kernel warnings like to kill. These strings indicate that your system intentionally killed the process or application rather than allowing the process to crash.
|
||||
|
||||
Examples:
|
||||
|
||||
[33238.178288] Out of memory: Kill process 6230 (firefox) score 53 or sacrifice child
|
||||
[29923450.995084] select 5230 (docker), adj 0, size 708, to kill
|
||||
|
||||
You can find these logs using a tool like grep. This example is for Ubuntu:
|
||||
|
||||
$ grep “Out of memory” /var/log/syslog
|
||||
[33238.178288] Out of memory: Kill process 6230 (firefox) score 53 or sacrifice child
|
||||
|
||||
Keep in mind that grep itself uses memory, so you might cause an out of memory error just by running grep. This is another reason it’s a fabulous idea to centralize your logs!
|
||||
|
||||
### Log Cron Job Errors ###
|
||||
|
||||
The cron daemon is a scheduler that runs processes at specified dates and times. If the process fails to run or fails to finish, then a cron error appears in your log files. You can find these files in /var/log/cron, /var/log/messages, and /var/log/syslog depending on your distribution. There are many reasons a cron job can fail. Usually the problems lie with the process rather than the cron daemon itself.
|
||||
|
||||
By default, cron jobs output through email using Postfix. Here is a log showing that an email was sent. Unfortunately, you cannot see the contents of the message here.
|
||||
|
||||
Mar 13 16:35:01 PSQ110 postfix/pickup[15158]: C3EDC5800B4: uid=1001 from=<hoover>
|
||||
Mar 13 16:35:01 PSQ110 postfix/cleanup[15727]: C3EDC5800B4: message-id=<20150310110501.C3EDC5800B4@PSQ110>
|
||||
Mar 13 16:35:01 PSQ110 postfix/qmgr[15159]: C3EDC5800B4: from=<hoover@loggly.com>, size=607, nrcpt=1 (queue active)
|
||||
Mar 13 16:35:05 PSQ110 postfix/smtp[15729]: C3EDC5800B4: to=<hoover@loggly.com>, relay=gmail-smtp-in.l.google.com[74.125.130.26]:25, delay=4.1, delays=0.26/0/2.2/1.7, dsn=2.0.0, status=sent (250 2.0.0 OK 1425985505 f16si501651pdj.5 - gsmtp)
|
||||
|
||||
You should consider logging the cron standard output to help debug problems. Here is how you can redirect your cron standard output to syslog using the logger command. Replace the echo command with your own script and helloCron with whatever you want to set the appName to.
|
||||
|
||||
*/5 * * * * echo ‘Hello World’ 2>&1 | /usr/bin/logger -t helloCron
|
||||
|
||||
Which creates the log entries:
|
||||
|
||||
Apr 28 22:20:01 ip-172-31-11-231 CRON[15296]: (ubuntu) CMD (echo 'Hello World!' 2>&1 | /usr/bin/logger -t helloCron)
|
||||
Apr 28 22:20:01 ip-172-31-11-231 helloCron: Hello World!
|
||||
|
||||
Each cron job will log differently based on the specific type of job and how it outputs data. Hopefully there are clues to the root cause of problems within the logs, or you can add additional logging as needed.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: http://www.loggly.com/ultimate-guide/logging/troubleshooting-with-linux-logs/
|
||||
|
||||
作者:[Jason Skowronski][a1]
|
||||
作者:[Amy Echeverri][a2]
|
||||
作者:[Sadequl Hussain][a3]
|
||||
译者:[译者ID](https://github.com/译者ID)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创翻译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a1]:https://www.linkedin.com/in/jasonskowronski
|
||||
[a2]:https://www.linkedin.com/in/amyecheverri
|
||||
[a3]:https://www.linkedin.com/pub/sadequl-hussain/14/711/1a7
|
||||
[1]:http://linux.die.net/man/8/pam.d
|
117
translated/tech/20150803 Troubleshooting with Linux Logs.md
Normal file
117
translated/tech/20150803 Troubleshooting with Linux Logs.md
Normal file
@ -0,0 +1,117 @@
|
||||
在 Linux 中使用日志来排错
|
||||
================================================================================
|
||||
人们创建日志的主要原因是排错。通常你会诊断为什么问题发生在你的 Linux 系统或应用程序中。错误信息或一些列事件可以给你提供造成根本原因的线索,说明问题是如何发生的,并指出如何解决它。这里有几个使用日志来解决的样例。
|
||||
|
||||
### 登录失败原因 ###
|
||||
|
||||
如果你想检查你的系统是否安全,你可以在验证日志中检查登录失败的和登录成功但可疑的用户。当有人通过不正当或无效的凭据来登录时会出现认证失败,经常使用 SSH 进行远程登录或 su 到本地其他用户来进行访问权。这些是由[插入式验证模块][1]来记录,或 PAM 进行短期记录。在你的日志中会看到像 Failed 这样的字符串密码和未知的用户。成功认证记录包括像 Accepted 这样的字符串密码并打开会话。
|
||||
|
||||
失败的例子:
|
||||
|
||||
pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=10.0.2.2
|
||||
Failed password for invalid user hoover from 10.0.2.2 port 4791 ssh2
|
||||
pam_unix(sshd:auth): check pass; user unknown
|
||||
PAM service(sshd) ignoring max retries; 6 > 3
|
||||
|
||||
成功的例子:
|
||||
|
||||
Accepted password for hoover from 10.0.2.2 port 4792 ssh2
|
||||
pam_unix(sshd:session): session opened for user hoover by (uid=0)
|
||||
pam_unix(sshd:session): session closed for user hoover
|
||||
|
||||
你可以使用 grep 来查找哪些用户失败登录的次数最多。这些都是潜在的攻击者正在尝试和访问失败的账户。这是一个在 ubuntu 系统上的例子。
|
||||
|
||||
$ grep "invalid user" /var/log/auth.log | cut -d ' ' -f 10 | sort | uniq -c | sort -nr
|
||||
23 oracle
|
||||
18 postgres
|
||||
17 nagios
|
||||
10 zabbix
|
||||
6 test
|
||||
|
||||
由于没有标准格式,所以你需要为每个应用程序的日志使用不同的命令。日志管理系统,可以自动分析日志,将它们有效的归类,帮助你提取关键字,如用户名。
|
||||
|
||||
日志管理系统可以使用自动解析功能从 Linux 日志中提取用户名。这使你可以看到用户的信息,并能单个的筛选。在这个例子中,我们可以看到,root 用户登录了 2700 次,因为我们筛选的日志显示尝试登录的只有 root 用户。
|
||||
|
||||
![](http://www.loggly.com/ultimate-guide/wp-content/uploads/2015/05/Screen-Shot-2015-03-12-at-11.05.36-AM.png)
|
||||
|
||||
日志管理系统也让你以时间为做坐标轴的图标来查看使你更容易发现异常。如果有人在几分钟内登录失败一次或两次,它可能是一个真正的用户而忘记了密码。但是,如果有几百个失败的登录并且使用的都是不同的用户名,它更可能是在试图攻击系统。在这里,你可以看到在3月12日,有人试图登录 Nagios 几百次。这显然不是一个合法的系统用户。
|
||||
|
||||
![](http://www.loggly.com/ultimate-guide/wp-content/uploads/2015/05/Screen-Shot-2015-03-12-at-11.12.18-AM.png)
|
||||
|
||||
### 重启的原因 ###
|
||||
|
||||
|
||||
有时候,一台服务器由于系统崩溃或重启而宕机。你怎么知道它何时发生,是谁做的?
|
||||
|
||||
#### 关机命令 ####
|
||||
|
||||
如果有人手动运行 shutdown 命令,你可以看到它的身份在验证日志文件中。在这里,你可以看到,有人从 IP 50.0.134.125 上作为 ubuntu 的用户远程登录了,然后关闭了系统。
|
||||
|
||||
Mar 19 18:36:41 ip-172-31-11-231 sshd[23437]: Accepted publickey for ubuntu from 50.0.134.125 port 52538 ssh
|
||||
Mar 19 18:36:41 ip-172-31-11-231 23437]:sshd[ pam_unix(sshd:session): session opened for user ubuntu by (uid=0)
|
||||
Mar 19 18:37:09 ip-172-31-11-231 sudo: ubuntu : TTY=pts/1 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/sbin/shutdown -r now
|
||||
|
||||
#### 内核初始化 ####
|
||||
|
||||
如果你想看看服务器重新启动的所有原因(包括崩溃),你可以从内核初始化日志中寻找。你需要搜索内核设施和初始化 cpu 的信息。
|
||||
|
||||
Mar 19 18:39:30 ip-172-31-11-231 kernel: [ 0.000000] Initializing cgroup subsys cpuset
|
||||
Mar 19 18:39:30 ip-172-31-11-231 kernel: [ 0.000000] Initializing cgroup subsys cpu
|
||||
Mar 19 18:39:30 ip-172-31-11-231 kernel: [ 0.000000] Linux version 3.8.0-44-generic (buildd@tipua) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #66~precise1-Ubuntu SMP Tue Jul 15 04:01:04 UTC 2014 (Ubuntu 3.8.0-44.66~precise1-generic 3.8.13.25)
|
||||
|
||||
### 检测内存问题 ###
|
||||
|
||||
有很多原因可能导致服务器崩溃,但一个普遍的原因是内存用尽。
|
||||
|
||||
当你系统的内存不足时,进程会被杀死,通常会杀死使用最多资源的进程。当系统正在使用的内存发生错误并且有新的或现有的进程试图使用更多的内存。在你的日志文件查找像 Out of Memory 这样的字符串,内核也会发出杀死进程的警告。这些信息表明系统故意杀死进程或应用程序,而不是允许进程崩溃。
|
||||
|
||||
例如:
|
||||
|
||||
[33238.178288] Out of memory: Kill process 6230 (firefox) score 53 or sacrifice child
|
||||
[29923450.995084] select 5230 (docker), adj 0, size 708, to kill
|
||||
|
||||
你可以使用像 grep 这样的工具找到这些日志。这个例子是在 ubuntu 中:
|
||||
|
||||
$ grep “Out of memory” /var/log/syslog
|
||||
[33238.178288] Out of memory: Kill process 6230 (firefox) score 53 or sacrifice child
|
||||
|
||||
请记住,grep 也要使用内存,所以导致内存不足的错误可能只是运行的 grep。这是另一个分析日志的独特方法!
|
||||
|
||||
### 定时任务错误日志 ###
|
||||
|
||||
cron 守护程序是一个调度器只在指定的日期和时间运行进程。如果进程运行失败或无法完成,那么 cron 的错误出现在你的日志文件中。你可以找到这些文件在 /var/log/cron,/var/log/messages,和 /var/log/syslog 中,具体取决于你的发行版。cron 任务失败原因有很多。通常情况下,问题出在进程中而不是 cron 守护进程本身。
|
||||
|
||||
默认情况下,cron 作业会通过电子邮件发送信息。这里是一个日志中记录的发送电子邮件的内容。不幸的是,你不能看到邮件的内容在这里。
|
||||
|
||||
Mar 13 16:35:01 PSQ110 postfix/pickup[15158]: C3EDC5800B4: uid=1001 from=<hoover>
|
||||
Mar 13 16:35:01 PSQ110 postfix/cleanup[15727]: C3EDC5800B4: message-id=<20150310110501.C3EDC5800B4@PSQ110>
|
||||
Mar 13 16:35:01 PSQ110 postfix/qmgr[15159]: C3EDC5800B4: from=<hoover@loggly.com>, size=607, nrcpt=1 (queue active)
|
||||
Mar 13 16:35:05 PSQ110 postfix/smtp[15729]: C3EDC5800B4: to=<hoover@loggly.com>, relay=gmail-smtp-in.l.google.com[74.125.130.26]:25, delay=4.1, delays=0.26/0/2.2/1.7, dsn=2.0.0, status=sent (250 2.0.0 OK 1425985505 f16si501651pdj.5 - gsmtp)
|
||||
|
||||
你应该想想 cron 在日志中的标准输出以帮助你定位问题。这里展示你可以使用 logger 命令重定向 cron 标准输出到 syslog。用你的脚本来代替 echo 命令,helloCron 可以设置为任何你想要的应用程序的名字。
|
||||
|
||||
*/5 * * * * echo ‘Hello World’ 2>&1 | /usr/bin/logger -t helloCron
|
||||
|
||||
它创建的日志条目:
|
||||
|
||||
Apr 28 22:20:01 ip-172-31-11-231 CRON[15296]: (ubuntu) CMD (echo 'Hello World!' 2>&1 | /usr/bin/logger -t helloCron)
|
||||
Apr 28 22:20:01 ip-172-31-11-231 helloCron: Hello World!
|
||||
|
||||
每个 cron 作业将根据作业的具体类型以及如何输出数据来记录不同的日志。希望在日志中有问题根源的线索,也可以根据需要添加额外的日志记录。
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: http://www.loggly.com/ultimate-guide/logging/troubleshooting-with-linux-logs/
|
||||
|
||||
作者:[Jason Skowronski][a1]
|
||||
作者:[Amy Echeverri][a2]
|
||||
作者:[Sadequl Hussain][a3]
|
||||
译者:[strugglingyouth](https://github.com/strugglingyouth)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创翻译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a1]:https://www.linkedin.com/in/jasonskowronski
|
||||
[a2]:https://www.linkedin.com/in/amyecheverri
|
||||
[a3]:https://www.linkedin.com/pub/sadequl-hussain/14/711/1a7
|
||||
[1]:http://linux.die.net/man/8/pam.d
|
Loading…
Reference in New Issue
Block a user