修复 Jira 无法访问

更新

2019/04/10 调整定时任务频率以应对内存泄露过快问题
2019/03/30 增加 cron 任务重启 Unity Cache Server
2019/03/28 初次发布

问题

Jira 无法访问，表现为浏览器无法连接，输入地址后无法连接。

初步排查分析：

安装在服务器上的 Jira 无法访问，其他服务 Jenkins、GitLab 也无法访问。
服务器无法使用 SSH 连接，也无法在本地登录，处于卡死状态。
强制关机后再开机，访问 Jira 提示无法访问数据库错误。
在服务器连接的显示器上发现登录界面充满着磁盘读取错误，并且无法输入账号密码登录。

确认是磁盘读取错误导致以上故障。但是又是什么原因导致磁盘读取错误呢？！先修复磁盘读取错误再说。

查找问题过程

但是很奇怪的是，居然可以远程 SSH 连接到服务器，因此以下大部分操作都是远程使用 SSH 登录进行的：

查看 MySQL 服务状态，并未处于正常工作状态：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


# systemctl status mysqld
● mysqld.service - MySQL Server
   Loaded: loaded (/usr/lib/systemd/system/mysqld.service; enabled; vendor preset: disabled)
   Active: activating (start) since Thu 2019-03-28 11:18:57 CST; 3s ago
     Docs: man:mysqld(8)
           http://dev.mysql.com/doc/refman/en/using-systemd.html
  Process: 16135 ExecStartPre=/usr/bin/mysqld_pre_systemd (code=exited, status=0/SUCCESS)
  Control: 16158
   CGroup: /system.slice/mysqld.service
           └─16161 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid

Mar 28 11:18:57 centos systemd[1]: Starting MySQL Server...
Mar 28 11:19:00 centos mysqld[16158]: Initialization of mysqld failed: 0
Mar 28 11:19:00 centos systemd[1]: mysqld.service: control process exited, code=exited status=1

尝试连接数据库，提示：

1
2
3


# mysql -u root -p
Enter password:
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 104

重启 MySQL

1
2
3


# systemctl stop mysqld
# systemctl start mysqld
Job for mysqld.service failed because the control process exited with error code. See "systemctl status mysqld.service" and "journalctl -xe" for details.

运行 journalctl -xe 会显示以下错误，默认情况下输出会用彩色输出，重点信息会使用红色字体显示。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


Mar 28 11:48:26 centos kernel: ata1.00: exception Emask 0x0 SAct 0xc000000 SErr 0x0 action 0x0
Mar 28 11:48:26 centos kernel: ata1.00: irq_stat 0x40000008
Mar 28 11:48:26 centos kernel: ata1.00: failed command: READ FPDMA QUEUED
Mar 28 11:48:26 centos kernel: ata1.00: cmd 60/08:d0:b8:35:dd/00:00:6f:00:00/40 tag 26 ncq 4096 in
                                        res 41/40:08:b8:35:dd/00:00:6f:00:00/00 Emask 0x409 (media error) <F>
Mar 28 11:48:26 centos kernel: ata1.00: status: { DRDY ERR }
Mar 28 11:48:26 centos kernel: ata1.00: error: { UNC }
Mar 28 11:48:26 centos kernel: ata1.00: configured for UDMA/133
Mar 28 11:48:26 centos kernel: sd 0:0:0:0: [sda] tag#26 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Mar 28 11:48:26 centos kernel: sd 0:0:0:0: [sda] tag#26 Sense Key : Medium Error [current] [descriptor]
Mar 28 11:48:26 centos kernel: sd 0:0:0:0: [sda] tag#26 Add. Sense: Unrecovered read error - auto reallocate failed
Mar 28 11:48:26 centos kernel: sd 0:0:0:0: [sda] tag#26 CDB: Read(10) 28 00 6f dd 35 b8 00 00 08 00
Mar 28 11:48:26 centos kernel: blk_update_request: I/O error, dev sda, sector 1876768184
Mar 28 11:48:26 centos kernel: ata1: EH complete
Mar 28 11:48:26 centos mysqld[6370]: Initialization of mysqld failed: 0
Mar 28 11:48:26 centos systemd[1]: mysqld.service: control process exited, code=exited status=1
Mar 28 11:48:27 centos systemd[1]: Failed to start MySQL Server.

核心错误信息：blk_update_request: I/O error, dev sda, sector 1876768184，日志中磁盘访问的多次错误指向的都是同一个磁盘位置。

反复重启 MySQL 多次都是一样的错误。

环境

macOS 10.14.3
CentOS 7.6 1810
SystemRescueCD 6.0.2

修复磁盘读取错误

修复尝试

MySQL 默认安装到系统盘，数据文件并未修改位置因此也在系统盘。

输出磁盘信息：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


# lsblk
NAME            MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda               8:0    0 931.5G  0 disk
├─sda1            8:1    0   200M  0 part /boot/efi
├─sda2            8:2    0     1G  0 part /boot
└─sda3            8:3    0 930.3G  0 part
  ├─centos-root 253:0    0    50G  0 lvm  /
  ├─centos-swap 253:1    0   7.8G  0 lvm  [SWAP]
  └─centos-home 253:2    0 872.6G  0 lvm  /home
sdb               8:16   0   477G  0 disk
└─sdb1            8:17   0   477G  0 part /data
# lsscsi
[0:0:0:0]    disk    ATA      ST1000DM010-2EP1 CC43  /dev/sda
[1:0:0:0]    disk    ATA      Samsung SSD 860  1B6Q  /dev/sdb

检查磁盘 SMART 状态：

1
2
3
4
5
6


# yum install smartmontools
# smartctl --all /dev/sda
...
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
...

閱讀者: 自我監控分析報告技術監控工具 smartmontools

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55


# lvscan
  ACTIVE            '/dev/centos/swap' [7.75 GiB] inherit
  ACTIVE            '/dev/centos/home' [<872.56 GiB] inherit
  ACTIVE            '/dev/centos/root' [50.00 GiB] inherit
# lvdisplay
  --- Logical volume ---
  LV Path                /dev/centos/swap
  LV Name                swap
  VG Name                centos
  LV UUID                F3L08d-MoMj-i95k-rfAq-plaA-XTmp-CZSaey
  LV Write Access        read/write
  LV Creation host, time localhost, 2019-01-22 15:16:07 +0800
  LV Status              available
  # open                 2
  LV Size                7.75 GiB
  Current LE             1984
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:1

  --- Logical volume ---
  LV Path                /dev/centos/home
  LV Name                home
  VG Name                centos
  LV UUID                fvtd5E-06YL-Sj8K-xTrE-USHa-IorI-Wa2nRm
  LV Write Access        read/write
  LV Creation host, time localhost, 2019-01-22 15:16:08 +0800
  LV Status              available
  # open                 1
  LV Size                <872.56 GiB
  Current LE             223375
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:2

  --- Logical volume ---
  LV Path                /dev/centos/root
  LV Name                root
  VG Name                centos
  LV UUID                sh4wLi-nSsR-PwXs-Pn4U-g0BR-CyeI-6URLuu
  LV Write Access        read/write
  LV Creation host, time localhost, 2019-01-22 15:16:12 +0800
  LV Status              available
  # open                 1
  LV Size                50.00 GiB
  Current LE             12800
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:0

blk_update_request I/O error, dev sde, sector 0 - Red Hat Customer Portal

尝试手动执行一部分命令修复磁盘提示：

1
2
3
4
5


# xfs_repair /dev/centos/root
xfs_repair: /dev/centos/root contains a mounted filesystem
xfs_repair: /dev/centos/root contains a mounted and writable filesystem

fatal error -- couldn't initialize XFS library

由于无法直接修复正在使用的磁盘，因此尝试使用 Linux 启动U盘离线修复，参考以下文章的做法，使用 SystemRescueCD。

Centos7异常断电宕机无法启动修复记 | 勇敢的心

下载镜像

下载 SystemRescueCd 最新版本

SystemRescueCd - Installing SystemRescueCd on a USB stick

制作启动 U 盘

在 macOS 终端中执行：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


$ diskutil list
/dev/disk0 (internal):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                         500.3 GB   disk0
   1:                        EFI EFI                     314.6 MB   disk0s1
   2:                 Apple_APFS Container disk1         500.0 GB   disk0s2

/dev/disk1 (synthesized):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      APFS Container Scheme -                      +500.0 GB   disk1
                                 Physical Store disk0s2
   1:                APFS Volume Macintosh HD            171.4 GB   disk1s1
   2:                APFS Volume Preboot                 45.3 MB    disk1s2
   3:                APFS Volume Recovery                517.0 MB   disk1s3
   4:                APFS Volume VM                      4.3 GB     disk1s4

/dev/disk2 (external, physical):
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:     FDisk_partition_scheme                        *3.9 GB     disk2
   1:             Windows_FAT_32 JUSTIN                  3.9 GB     disk2s1

必须先卸载 U 盘，注意：网上其他文章介绍的方法 sudo umount /dev/disk2 无法生效，会提示 umount: /dev/disk2: not currently mounted

1
2
3
4
5
6
7
8
9


$ diskutil unmountDisk /dev/disk2
Unmount of all volumes on disk2 was successful

$ sudo dd if=~/Downloads/systemrescuecd-6.0.2.iso of=/dev/rdisk2 bs=1m
871+0 records in
871+0 records out
913309696 bytes transferred in 167.405588 secs (5455670 bytes/sec)

$ diskutil eject /dev/disk2

MacOS 系统使用 dd 命令创建 Linux 启动U盘 | StarryLand

注意：在设置名称前面增加 r 可以显著提升速度，因为这样会跳过操作系统的磁盘缓存。

Note the “r” prefix on the device names – this bypasses the OS disk buffers, and in my experience makes dd run much faster.

macos - Time Machine size explodes when copied to new drive - Super User

注意：Windows 用户可以自行参考官方文档制作启动 U 盘

修复

修改机器的启动顺序，将 U 盘放到第一位。然后插入启动 U 盘启动。

在 SystemRescueCD 启动菜单中选择第一项 default boot options 后等待启动完成。

由于机器默认能启动，只是访问不了某些文件，因此在尝试挂载时发现可以成功挂载

1
2
3


# mount /dev/centos/root /mnt
# ls /mnt
...

执行修复

1

# xfs_repair /dev/centos/root

输出中并未有什么明显的提示说明出错，只是提到修复了某些东西。

移除 U 盘后重启进入系统，可以看到系统正常工作，MySQL、Jira、GitLab、Jenkins 也都已恢复正常工作。

修复内存占用过高

查看内存占用

在后续使用 Jira 的过程中，突然再次发现 Jira 响应极慢，马上 SSH 登录服务器，发现 used 内存过高，几乎占据了所有内存，而且结果中出现了正常不会有的虚拟内存信息：

1
2
3
4


# free -h
              total        used        free      shared  buff/cache   available
Mem:            15G         14G        152M        131M        573M        214M
Swap:          7.7G        2.2G        5.6G

注意：默认情况下不会使用基于磁盘的虚拟内存，一旦使用就会造成系统响应速度极大下降，当出现卡顿时优先考虑内存问题。

查看进程内存大小

尝试按照占用内存大小输出进程信息：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


# ps aux --sort -rss
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
unity     5946  0.0 14.3 4166772 2310272 ?     Sl   Mar28   0:31 /usr/bin/node /bin/unity-cache-server --NODE_CONFIG_DIR=/data/unity-cache-server/config
root      5873  1.1 13.9 6510320 2255416 ?     Sl   Mar28  19:23 /usr/bin/java -Djava.util.logging.config.file=/opt/atlassian/jira/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Xms384m -Xmx2048m -XX:InitialCodeCacheSize=32m -XX:ReservedCodeCacheSize=512m -javaagent:/opt/
unity     5927  0.0 13.1 3648680 2113632 ?     Sl   Mar28   0:57 /usr/bin/node /bin/unity-cache-server --NODE_CONFIG_DIR=/data/unity-cache-server/config
unity     5934  0.0 11.1 3208120 1797668 ?     Sl   Mar28   0:28 /usr/bin/node /bin/unity-cache-server --NODE_CONFIG_DIR=/data/unity-cache-server/config
jenkins   6031  0.6  7.0 7925472 1132532 ?     Ssl  Mar28  10:54 /etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Djava.awt.headless=true -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=5000 --debu
unity     5916  0.0  6.5 2170168 1061960 ?     Sl   Mar28   0:33 /usr/bin/node /bin/unity-cache-server --NODE_CONFIG_DIR=/data/unity-cache-server/config
git       6310  2.5  3.8 1027228 626400 ?      Ssl  Mar28  43:04 sidekiq 5.2.5 gitlab-rails [0 of 25 busy]
git      17623  0.0  2.9 798756 472748 ?       Sl   Mar28   1:14 unicorn worker[1] -D -E production -c /var/opt/gitlab/gitlab-rails/etc/unicorn.rb /opt/gitlab/embedded/service/gitlab-rails/config.ru

终止问题进程

发现排在前面的是 Unity Cache Server，很奇怪 Unity 缓存服务器怎么会占用这么多的内存？停止服务后可以看到内存占用降低了一半！

1
2
3
4
5


# systemctl stop unity-cache-server
# free -h
              total        used        free      shared  buff/cache   available
Mem:            15G        7.0G        7.8G        131M        576M        7.9G
Swap:          7.7G        2.0G        5.7G

升级 Unity Cache Server

查看 Unity Cache Server 版本，发现是 v6.2.4，去官网发布页面 Releases · Unity-Technologies/unity-cache-server 可以看到最新版本已经是 v6.3.0 了，而且在 v6.2.5 版本中修复了工作者进程占用内存过高的 Bug：

Reduced high memory usage in worker processes (#95)

Releases · Unity-Technologies/unity-cache-server

果断升级：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


# systemctl stop unity-cache-server
# unity-cache-server --version
6.2.4

# curl -sL https://rpm.nodesource.com/setup_10.x | bash -
# sudo yum install -y nodejs

# npm update -g unity-cache-server
/usr/bin/unity-cache-server -> /usr/lib/node_modules/unity-cache-server/main.js
/usr/bin/unity-cache-server-cleanup -> /usr/lib/node_modules/unity-cache-server/cleanup.js
/usr/bin/unity-cache-server-import -> /usr/lib/node_modules/unity-cache-server/import.js
+ unity-cache-server@6.3.0
updated 2 packages in 7.147s

# unity-cache-server --version
6.3.0
# systemctl start unity-cache-server

注意：v6.3.0 同时升级了 Nodejs 版本

启动 Unity Cache Server

1

# systemctl start unity-cache-server

依然存在内存泄露

虽然 v6.2.5 版本的更新日志中说明修复了占用内存过高的 Bug，但是 v6.3.0 版本在使用的时候依然会出现内存占用过高的问题，由于 Unity Cache Server 并不需要连续运行，而且就算是 Unity 在使用时重启也只会导致当前资源下载失败，然后 Unity 会重新导入资源。

1
2
3


crontab -e

30 */6 * * * systemctl restart unity-cache-server

使用 cron 定时任务在 00:30 重启服务器，之后每隔 6 小时重启一次。选择这个时间主要是考虑到中午休息的时候并不会有很多人在用，一天 4 次的频率也是考虑到内存增长的实际情况，如果超过 6 小时不清理一次可能内存就占用就会涨得太多了。

文章目录

更新

问题