写在前面

最近有用户反馈测试环境Java服务总在凌晨00:00左右挂掉,用户反馈Java服务没有定时任务,也没有流量突增的情况,Jvm配置也合理,莫名其妙就挂了

问题排查

问题复现

为了复现该问题,写了个springboot的demo部署在测试环境,其中demo里只做了hello world功能,应用类型为
web_tomcat (war包部署)
,基础镜像是
base_tomcat/java-centos6-jdk18-60-tom8050-ngx197
,镜像使用的Java版本是
1.8.0_60
,有了上次
MySQL被kill
的经验,盲猜是linux limit惹的祸,因此将打好的镜像分别部署了两批不同的机器,果不其然,新机器当晚挂掉了,老机器服务正常

看一下挂掉的limit设置

排查过程

Java进程会受到limits影响?

按理说Java进程是不会受到系统limit open files(系统最大句柄数)影响的,但是为了验证这个问题,我们将他修改为正常机器的值,由于demo是
web_tomcat
应用,没法修改启动脚本,因此我们通过
prlimit
修改java进程的limit

prlimit -p 32672 --nofile=1048576 

结果当晚00:00左右还是挂了,看来open files和java进程挂掉没关系,看
dmesg
也没发现什么问题

Java版本过低导致内存分配不合理?

通过寻求jdos研发组的帮助,jdos研发组的同学认为是java版本的问题,低版本可能没有限制住申请的内存大小,具体原因如下

https://blog.softwaremill.com/docker-support-in-new-java-8-finally-fd595df0ca54?gi=a0cc6736ed14

异常机器java内存情况

正常机器java内存情况

按照这个
文档
描述,使用docker cgroups限制内存可能会导致JVM进程被终止,原因是Java读取的还是宿主机的CPU,而不是docker cgroups限制的CPU,高版本的Java解决了这个问题,文档解决方案截图如下:

对此我们表示怀疑,因为我们的程序里设置了JVM参数

保持着试一试的心态,我们增加了一个实验组,实验组使用的Java版本是
11.0.8

结果当晚实验组的Java进程还是死了,看来和Java版本也没关系

容器上存在定时任务导致的?

由于基础镜像是jdos官方提供的镜像,所以之前从来没有怀疑过是定时任务的问题,但是现在别无他法了,检查下容器的定时任务

虽然有定时任务,但是这个执行的时间点和Java挂掉的时间对不上,为此我们决定删除定时任务试试

结果当晚Java进程还是挂了,并且这次有dmesg的日志,发现Java被kill的同时crond也被kill了,被kill的原因是crond内存过高导致oom

难道还有系统级cron任务?于是查了一下/etc/crontab,发现果然还有cron任务(这是谁打的镜像!!!)

这个时间点和Java进程挂掉的时间点吻合,但是问题来了,执行的任务并没有
logrotate.sh
这个脚本,应该不会出现问题才对

到底是不是定时任务的问题,我们修改下cron的时间验证下,调整时间为中午11:00,验证下Java进程是否会挂,同时使用
strace
打印进程trace log

果然Java进程在中午11.00挂了,看来真的是cron任务导致的,让我们一起看一下strace

19:59:01 close(3)                        = 0
19:59:01 stat("/etc/pam.d", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19:59:01 open("/etc/pam.d/crond", O_RDONLY) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=293, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "#\n# The PAM configuration file f"..., 4096) = 293
19:59:01 open("/lib64/security/pam_access.so", O_RDONLY) = 5
19:59:01 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\17\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(5, {st_mode=S_IFREG|0755, st_size=18552, ...}) = 0
19:59:01 mmap(NULL, 2113800, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fd769322000
19:59:01 mprotect(0x7fd769325000, 2097152, PROT_NONE) = 0
19:59:01 mmap(0x7fd769525000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x3000) = 0x7fd769525000
19:59:01 close(5) = 0
19:59:01 open("/etc/ld.so.cache", O_RDONLY) = 5
19:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=16203, ...}) = 0
19:59:01 mmap(NULL, 16203, PROT_READ, MAP_PRIVATE, 5, 0) = 0x7fd7707f8000
19:59:01 close(5) = 0
19:59:01 open("/lib64/libnsl.so.1", O_RDONLY) = 5
19:59:01 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0p@\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(5, {st_mode=S_IFREG|0755, st_size=113432, ...}) = 0
19:59:01 mmap(NULL, 2198192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fd769109000
19:59:01 mprotect(0x7fd76911f000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd76931e000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x15000) = 0x7fd76931e000
19:59:01 mmap(0x7fd769320000, 6832, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fd769320000
19:59:01 close(5) = 0
19:59:01 mprotect(0x7fd76931e000, 4096, PROT_READ) = 0
19:59:01 mprotect(0x7fd769525000, 4096, PROT_READ) = 0
19:59:01 munmap(0x7fd7707f8000, 16203) = 0
19:59:01 open("/etc/pam.d/password-auth", O_RDONLY) = 5
19:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=692, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)                     = 0x7fd770803000
19:59:01 read(5, "#%PAM-1.0\n# This file is auto-ge"..., 4096) = 692
19:59:01 open("/lib64/security/pam_unix.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240&\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=51960, ...}) = 0
19:59:01 mmap(NULL, 2196352, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd768ef0000
19:59:01 mprotect(0x7fd768efc000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd7690fb000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0xb000) = 0x7fd7690fb000
19:59:01 mmap(0x7fd7690fd000, 45952, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fd7690fd000
19:59:01 close(6)                       = 0
19:59:01 mprotect(0x7fd7690fb000, 4096, PROT_READ) = 0
19:59:01 read(5, "", 4096)              = 0
19:59:01 close(5) = 0
19:59:01 munmap(0x7fd770803000, 4096) = 0
19:59:01 open("/lib64/security/pam_loginuid.so", O_RDONLY) = 5
19:59:01 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220\t\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(5, {st_mode=S_IFREG|0755, st_size=10240, ...}) = 0
19:59:01 mmap(NULL, 2105480, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fd768ced000
19:59:01 mprotect(0x7fd768cef000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd768eee000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x1000) = 0x7fd768eee000
19:59:01 close(5) = 0
19:59:01 mprotect(0x7fd768eee000, 4096, PROT_READ) = 0
19:59:01 open("/etc/pam.d/password-auth", O_RDONLY) = 5
19:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=692, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770803000
19:59:01 read(5, "#%PAM-1.0\n# This file is auto-ge"..., 4096) = 692
19:59:01 open("/lib64/security/pam_keyinit.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\10\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=10224, ...}) = 0
19:59:01 mmap(NULL, 2105488, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0)                      = 0x7fd768aea000
19:59:01 mprotect(0x7fd768aec000, 2093056, PROT_NONE)                     = 0
19:59:01 mmap(0x7fd768ceb000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x1000) = 0x7fd768ceb000
19:59:01 close(6) = 0
19:59:01 mprotect(0x7fd768ceb000, 4096, PROT_READ) = 0
19:59:01 open("/lib64/security/pam_limits.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\20\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=18600, ...}) = 0
19:59:01 mmap(NULL, 2113848, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd7688e5000
19:59:01 mprotect(0x7fd7688e9000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd768ae8000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x3000) = 0x7fd768ae8000
19:59:01 close(6) = 0
19:59:01 mprotect(0x7fd768ae8000, 4096, PROT_READ) = 0
19:59:01 open("/lib64/security/pam_succeed_if.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\v\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=14384, ...}) = 0
19:59:01 mmap(NULL, 2109624, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd7686e1000
19:59:01 mprotect(0x7fd7686e4000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd7688e3000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x2000) = 0x7fd7688e3000
19:59:01 close(6) = 0
19:59:01 mprotect(0x7fd7688e3000, 4096, PROT_READ)                       = 0
19:59:01 read(5, "", 4096) = 0
19:59:01 close(5)                     = 0
19:59:01 munmap(0x7fd770803000, 4096) = 0
19:59:01 open("/etc/pam.d/password-auth", O_RDONLY)                      = 5
19:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=692, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)                      = 0x7fd770803000
19:59:01 read(5, "#%PAM-1.0\n# This file is auto-ge"..., 4096) = 692
19:59:01 open("/lib64/security/pam_env.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\r\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=18592, ...}) = 0
19:59:01 mmap(NULL, 2113776, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0)                       = 0x7fd7684dc000
19:59:01 mprotect(0x7fd7684e0000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd7686df000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x3000) = 0x7fd7686df000
19:59:01 close(6) = 0
19:59:01 mprotect(0x7fd7686df000, 4096, PROT_READ)                     = 0
19:59:01 open("/lib64/security/pam_deny.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\5\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=5952, ...}) = 0
19:59:01 mmap(NULL, 2101272, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0)                       = 0x7fd7682da000
19:59:01 mprotect(0x7fd7682db000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd7684da000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0)                      = 0x7fd7684da000
19:59:01 close(6) = 0
19:59:01 mprotect(0x7fd7684da000, 4096, PROT_READ) = 0
19:59:01 read(5, "", 4096) = 0
19:59:01 close(5) = 0
19:59:01 munmap(0x7fd770803000, 4096) = 0
19:59:01 read(3, "", 4096)             = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096)                      = 0
19:59:01 open("/etc/pam.d/other", O_RDONLY)                      = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=154, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)   = 0x7fd770804000
19:59:01 read(3, "#%PAM-1.0\nauth     required     "..., 4096) = 154
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC)   = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 uname({sys="Linux", node="host-11-159-73-176", ...}) = 0
19:59:01 open("/etc/security/access.conf", O_RDONLY) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=4620, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "# Login access control table.\n#\n"..., 4096) = 4096
19:59:01 read(3, " should get access from ipv4 net"..., 4096) = 524
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 getuid() = 0
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)  = 0x7fd770804000
19:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3)                       = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 geteuid() = 0
19:59:01 open("/etc/shadow", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG, st_size=901, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root:$6$4.53VPrJ$1wxMpbsWYp4VKea"..., 4096) = 901
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096)                      = 0
19:59:01 socket(PF_NETLINK, SOCK_RAW, 9)                       = 3
19:59:01 fcntl(3, F_SETFD, FD_CLOEXEC)   = 0
19:59:01 readlink("/proc/self/exe", "/usr/sbin/crond", 4096) = 15
19:59:01 sendto(3, "p\0\0\0M\4\5\0\1\0\0\0\0\0\0\0op=PAM:accountin"..., 112, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12)                      = 112
19:59:01 poll([{fd=3, events=POLLIN}], 1, 500)   = 1 ([{fd=3, revents=POLLIN}])
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\1\0\0\0\227\7\0\0\0\0\0\0p\0\0\0M\4\5\0\1\0\0\0"..., 8988, MSG_PEEK|MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\1\0\0\0\227\7\0\0\0\0\0\0p\0\0\0M\4\5\0\1\0\0\0"..., 8988, MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 close(3) = 0
19:59:01 open("/etc/security/pam_env.conf", O_RDONLY) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=2980, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "#\n# This is the configuration fi"..., 4096) = 2980
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3)                      = 0
19:59:01 munmap(0x7fd770804000, 4096)                       = 0
19:59:01 open("/etc/environment", O_RDONLY)   = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)               = 0x7fd770804000
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 socket(PF_NETLINK, SOCK_RAW, 9) = 3
19:59:01 fcntl(3, F_SETFD, FD_CLOEXEC) = 0
19:59:01 sendto(3, "p\0\0\0O\4\5\0\2\0\0\0\0\0\0\0op=PAM:setcred a"..., 112, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12)                       = 112
19:59:01 poll([{fd=3, events=POLLIN}], 1, 500)   = 1 ([{fd=3, revents=POLLIN}])
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\2\0\0\0\227\7\0\0\0\0\0\0p\0\0\0O\4\5\0\2\0\0\0"..., 8988, MSG_PEEK|MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\2\0\0\0\227\7\0\0\0\0\0\0p\0\0\0O\4\5\0\2\0\0\0"..., 8988, MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 close(3) = 0
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 open("/proc/self/loginuid", O_WRONLY|O_TRUNC|O_NOFOLLOW)        = 3
19:59:01 write(3, "0", 1) = 1
19:59:01 close(3) = 0
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 getuid() = 0
19:59:01 getgid() = 0
19:59:01 keyctl(0, 0xfffffffd, 0, 0, 0) = 496466385
19:59:01 keyctl(0, 0xfffffffb, 0, 0, 0x30) = 785702132
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 getrlimit(RLIMIT_CPU, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_FSIZE, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_DATA, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_CORE, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_RSS, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_NPROC, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_NOFILE, {rlim_cur=1073741816, rlim_max=1073741816}) = 0
19:59:01 getrlimit(RLIMIT_MEMLOCK, {rlim_cur=64*1024, rlim_max=64*1024}) = 0
19:59:01 getrlimit(RLIMIT_AS, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_LOCKS, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_SIGPENDING, {rlim_cur=883632, rlim_max=883632}) = 0
19:59:01 getrlimit(RLIMIT_MSGQUEUE, {rlim_cur=800*1024, rlim_max=800*1024}) = 0
19:59:01 getrlimit(RLIMIT_NICE, {rlim_cur=0, rlim_max=0}) = 0
19:59:01 getrlimit(RLIMIT_RTPRIO, {rlim_cur=0, rlim_max=0}) = 0
19:59:01 getpriority(PRIO_PROCESS, 0) = 20
19:59:01 open("/etc/security/limits.conf", O_RDONLY) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1835, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "# /etc/security/limits.conf\n#\n#E"..., 4096) = 1835
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 open("/etc/security/limits.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
19:59:01 getdents(3, /* 3 entries */, 32768) = 88
19:59:01 open("/usr/lib64/gconv/gconv-modules.cache", O_RDONLY)                       = 5
19:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=26060, ...}) = 0
19:59:01 mmap(NULL, 26060, PROT_READ, MAP_SHARED, 5, 0) = 0x7fd7707f5000
19:59:01 close(5)  = 0
19:59:01 getdents(3, /* 0 entries */, 32768) = 0
19:59:01 close(3) = 0
19:59:01 open("/etc/security/limits.d/90-nproc.conf", O_RDONLY) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=193, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "# Default limit for number of us"..., 4096) = 193
19:59:01 read(3, "", 4096)              = 0
19:59:01 close(3)                       = 0
19:59:01 munmap(0x7fd770804000, 4096)   = 0
19:59:01 setrlimit(RLIMIT_NPROC, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 setpriority(PRIO_PROCESS, 0, 0) = 0
19:59:01 getuid() = 0
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096)                     = 0
19:59:01 socket(PF_NETLINK, SOCK_RAW, 9)                      = 3
19:59:01 fcntl(3, F_SETFD, FD_CLOEXEC)                      = 0
19:59:01 sendto(3, "t\0\0\0Q\4\5\0\3\0\0\0\0\0\0\0op=PAM:session_o"..., 116, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 116
19:59:01 poll([{fd=3, events=POLLIN}], 1, 500) = 1 ([{fd=3, revents=POLLIN}])
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\3\0\0\0\227\7\0\0\0\0\0\0t\0\0\0Q\4\5\0\3\0\0\0"..., 8988, MSG_PEEK|MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\3\0\0\0\227\7\0\0\0\0\0\0t\0\0\0Q\4\5\0\3\0\0\0"..., 8988, MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 close(3) = 0
19:59:01 setgid(0) = 0
19:59:01 open("/proc/sys/kernel/ngroups_max", O_RDONLY) = 3
19:59:01 read(3, "65536\n", 31)         = 6
19:59:01 close(3)                       = 0
19:59:01 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
19:59:01 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
19:59:01 close(3) = 0
19:59:01 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
19:59:01 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110)                       = -1 ENOENT (No such file or directory)
19:59:01 close(3) = 0
19:59:01 open("/etc/group", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=497, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 lseek(3, 0, SEEK_CUR) = 0
19:59:01 read(3, "root:x:0:\nbin:x:1:bin,daemon\ndae"..., 4096) = 497
19:59:01 read(3, "", 4096)              = 0
19:59:01 close(3)                       = 0
19:59:01 munmap(0x7fd770804000, 4096)                     = 0
19:59:01 setgroups(1, [0]) = 0
19:59:01 setreuid(0, 4294967295) = 0
19:59:01 rt_sigaction(SIGCHLD, {SIG_DFL, [CHLD], SA_RESTORER|SA_RESTART, 0x7fd76fa316a0}, {0x558826e03b80, [], SA_RESTORER|SA_RESTART, 0x7fd76fa316a0}, 8) = 0
19:59:01 pipe([3, 5])                   = 0
19:59:01 pipe([6, 7])                   = 0
19:59:01 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fd7707fca70) = 1946
19:59:01 gettid()                     = 1943
19:59:01 open("/proc/self/task/1943/attr/exec", O_RDWR) = 8
19:59:01 write(8, NULL, 0) = -1 EINVAL (Invalid argument)
19:59:01 close(8) = 0
19:59:01 close(3) = 0
19:59:01 close(7) = 0
19:59:01 close(5) = 0
19:59:01 fcntl(6, F_GETFL)                       = 0 (flags O_RDONLY)
19:59:01 fstat(6, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)                     = 0x7fd770804000
19:59:01 lseek(6, 0, SEEK_CUR)                     = -1 ESPIPE (Illegal seek)
19:59:01 read(6, "/bin/bash: ./logrotate.sh: \346\262\241\346\234"..., 4096) = 55
19:59:01 uname({sys="Linux", node="host-11-159-73-176", ...}) = 0
19:59:01 getrlimit(RLIMIT_NOFILE, {rlim_cur=1073741816, rlim_max=1073741816}) = 0
19:59:01 mmap(NULL, 4294967296, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd6682da000
19:59:01 --- SIGCHLD (Child exited) @ 0 (0) ---
19:59:06 +++ killed by SIGKILL +++


可以看到最后用
mmap
一次分配了 4G 内存,然后就被kill了。

mmap
前调用了
getrlimit
,和上次
MySQL的问题
一样,都是根据系统资源限制来分配内存

为了确定就是cron导致java挂掉的元凶,我们把cron进程手动kill掉,这样就不会执行定时任务了,这次我们在验证下Java进程是否会挂掉

果不其然,Java进程并没有挂掉,看来真的是cron任务导致的

高版本CentOS是否也会出现类似问题?

按理说oom killer应该只kill掉占用内存最高的才对,Java进程占用内存又不是最高的,高版本的CentOS系统oom killer策略会不会有升级?

让我们来一起验证下高版本的CentOS系统是否有这个问题

当前镜像的CentOS版本是
CentOS release 6.6 (Final)
,为了验证高版本的CentOS是否也有类似的问题,我们将增加两个实验组,分别升级基础镜像至
CentOS release 6.10 (Final)

CentOS Linux release 7.9.2009 (Core)
,也添加相同的cron任务

结果发现
CentOS release 6.10 (Final)

CentOS Linux release 7.9.2009 (Core)
都没有kill掉Java进程,只kill掉了cron的子进程

结论

由于容器
limit open files(系统最大句柄数)
设置不合理导致cron执行任务时使容器内存飙升,存在内存溢出的风险,linux由于保护机制会kill掉占用内存高的进程,导致cron子任务进程和Java进程一起被kill(但是问题来了,这个jdos基础镜像为什么会执行一个完全不存在的shell脚本,而且还是执行两次???),高版本的CentOS系统不会kill java进程,猜测不同版本的CentOS的kill选择策略略有不同

问题分析

Cron任务执行逻辑

在Linux中,crontab工具是由croine软件包提供的,让我们一起看下cron的执行过程

其中child_process()执行了cron子进程,cron执行子进程时会有发送mail的动作

cron_popen在执行时会按照open files(系统最大句柄数)清除内存

综上,cron oom的原因找到了,是由于open files设置过大且cron任务没有标准输出,导致执行了发送mail逻辑,而清除的内存大小超出了容器本身内存的大小,导致oom。

croine 1.5.4 版本之后修复了该问题,如果想查看当前容器croine版本可执行如下命令:

rpm -q cronie

Linux内核OOM killer机制

Linux 内核有个机制叫OOM killer(Out Of Memory killer),该机制会监控那些占用内存过大,尤其是瞬间占用内存很快的进程,然后防止内存耗尽而自动把该进程杀掉。内核检测到系统内存不足、挑选并杀掉某个进程的过程可以参考内核源代码linux/mm/oom_kill.c,当系统内存不足的时候,
out_of_memory()
被触发,然后调用
select_bad_process()
选择一个”bad”进程杀掉。

以下是一些主要的进程选择策略:

  1. 内存使用情况:OOM Killer首先倾向于选择占用内存最多的进程,因为终止这些进程可以释放最多的内存。

  2. OOM分数:每个进程都有一个OOM分数,该分数是基于其内存使用情况和其他因素计算出来的。OOM Killer倾向于终止OOM分数最高的进程。

  3. 进程优先级:在选择要终止的进程时,OOM Killer通常会避免终止对系统至关重要的系统进程。这些进程通常具有较高的优先级,因此它们更不容易成为终止目标。

  4. 进程资源需求:OOM Killer还会考虑进程的资源需求。它倾向于终止那些请求较少资源的进程,以最小化影响其他进程的运行。

  5. 进程属性:某些进程可能被标记为不可终止,例如通过设置/proc/[PID]/oom_score_adj的值来调整OOM分数。这些进程通常不容易被OOM Killer终止。

注:不同版本的Linux oom killer机制可能会存在一些差异

解决方案

使用高版本稳定的CentOS系统,如果业务无法升级CentOS,则需要设置合理的
limit open files
数量,application_worker类型应用可以在启动脚本中手动修改limit,web_tomcat类型应用没法修改启动脚本,可以选择kill掉cron进程或删除系统cron任务,也可以手动升级
cronie
的版本至
1.5.7-5

写在后面

open files
这个坑很大,栽这个坑两次了,大家一定要检查自己服务对应容器的CentOS版本和limit设置是否合理,本次案例发生在测试环境,尚不会引起事故,如果在生产出现类似情况,后果不堪设想

由于测试环境新增的这批机器都存在这个问题,我们团队已经联系机器提供方上报了该问题,后续这批机器会由提供方统一修改系统最大句柄数,如果当前问题影响到了业务的正常使用,可以临时删除容器中
/etc/crontab
中的任务

参考文献

https://cloud.tencent.com/developer/article/1183262

https://github.com/cronie-crond/cronie

作者:京东零售 杨云龙

来源:京东云开发者社区 转载请注明来源

标签: none

添加新评论