У нас есть 10 OSD. Все они используют одну и ту же модель дисков Intel SSD DC S3520. OSD работают на системах PVE без VM и мониторов. Сетевой коммутатор – 10G. Сетевые карты на узлах – 1G. У меня есть 2 тестовые KVM, работающие на Ceph. Из VM вне Ceph мы тестируем с помощью rsync к одной и резервное копирование Dovecot к другой VM. Примерно через 30 минут системы зависают. Из syslog:
Code: Mar 30 14:55:45 ceph-test1 kernel: [2089.832107] sd 2:0:0:1: [sdb] abort
Mar 30 14:57:01 ceph-test1 kernel: [2165.630246] sd 2:0:0:1: [sdb] abort
Mar 30 14:58:55 ceph-test1 kernel: [2280.028104] INFO: task jbd2/sdb1-8:725 blocked for more than 120 seconds.
Mar 30 14:58:55 ceph-test1 kernel: [2280.028532] Not tainted 3.16.0-4-amd64 #1
Mar 30 14:58:55 ceph-test1 kernel: [2280.028742] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 30 14:58:55 ceph-test1 kernel: [2280.029143] jbd2/sdb1-8 D ffff88001a6c3a88 0 725 2 0x00000000
Mar 30 14:58:55 ceph-test1 kernel: [2280.029147] ffff88001a6c3630 0000000000000046 0000000000012f40 ffff88001004bfd8
Mar 30 14:58:55 ceph-test1 kernel: [2280.029149] 0000000000012f40 ffff88001a6c3630 ffff88001fc137f0 ffff88001ff9e3f0
Mar 30 14:58:55 ceph-test1 kernel: [2280.029150] 0000000000000002 ffffffff8113ee30 ffff88001004bbd0 ffff88001004bcb8
Mar 30 14:55:53 ceph-test2 kernel: [2090.856057] sd 2:0:0:1: [sdb] abort
Mar 30 14:59:02 ceph-test2 kernel: [2280.028096] INFO: task kworker/u2:0:6 blocked for more than 120 seconds.
Mar 30 14:59:02 ceph-test2 kernel: [2280.028799] Not tainted 3.16.0-4-amd64 #1
Mar 30 14:59:02 ceph-test2 kernel: [2280.029147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 30 14:59:02 ceph-test2 kernel: [2280.029828] kworker/u2:0 D ffff88001e5be4a8 0 6 2 0x00000000
Mar 30 14:59:02 ceph-test2 kernel: [2280.029852] Workqueue: scsi_tmf_2 scmd_eh_abort_handler [scsi_mod]
Mar 30 14:59:02 ceph-test2 kernel: [2280.029855] ffff88001e5be050 0000000000000046 0000000000012f40 ffff88001e5d7fd8
Mar 30 14:59:02 ceph-test2 kernel: [2280.029857] 0000000000012f40 ffff88001e5be050 ffff88001e5d7dc8 ffff88001e5d7d60
Mar 30 14:59:02 ceph-test2 kernel: [2280.029860] ffff88001e5d7dc0 ffff88001e5be050 0000000000002003 0000000000000040
*от системы отправка данных на test2 систему: Code: # doveadm backup -A remote:10.1.3.105
dsync-local(user1): Error: dsync(localhost.localdomain): I/O stalled, no activity for 600 seconds
dsync-local(user1): Error: Timeout during state=sync_mails (send=mails recv=recv_last_common)
dsync-local(user1): Error: Remote command process isn't dying, killing it
kvm conf:
boot: c
bootdisk: scsi0
cores: 1
memory: 512
name: ceph-test1
net0: virtio=1A:64:14:A6:16:3A,bridge=vmbr0,tag=3
numa: 0
ostype: l26
protection: 1
scsi0: ceph-kvm:vm-9001-disk-1,discard=on,size=4G
scsi1: ceph-kvm:vm-9001-disk-2,discard=on,size=200G
scsihw: virtio-scsi-pci
smbios1: uuid=9035f5f9-60e8-42fa-b6ff-5ab38a160365
sockets: 1
boot: c
bootdisk: scsi0
cores: 1
memory: 512
name: ceph-test2
net0: virtio=E2:20:3B:C0:72:F1,bridge=vmbr0,tag=3
numa: 0
ostype: l26
protection: 1
scsi0: ceph-kvm:vm-9002-disk-1,discard=on,size=4G
scsi1: ceph-kvm:vm-9002-disk-2,discard=on,size=50G
scsihw: virtio-scsi-pci
smbios1: uuid=9035f5f9-60e8-42fa-b6ff-5ab38a160365
sockets: 1
ceph: 10.2.6-1~bpo80+1
Системы – 4-дисковые Supermicro X10SLM и X9SCi-LN4F. У них 32 ГБ ECC RAM. У кого-нибудь есть какие-нибудь предложения, как решить эту проблему?
Code: Mar 30 14:55:45 ceph-test1 kernel: [2089.832107] sd 2:0:0:1: [sdb] abort
Mar 30 14:57:01 ceph-test1 kernel: [2165.630246] sd 2:0:0:1: [sdb] abort
Mar 30 14:58:55 ceph-test1 kernel: [2280.028104] INFO: task jbd2/sdb1-8:725 blocked for more than 120 seconds.
Mar 30 14:58:55 ceph-test1 kernel: [2280.028532] Not tainted 3.16.0-4-amd64 #1
Mar 30 14:58:55 ceph-test1 kernel: [2280.028742] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 30 14:58:55 ceph-test1 kernel: [2280.029143] jbd2/sdb1-8 D ffff88001a6c3a88 0 725 2 0x00000000
Mar 30 14:58:55 ceph-test1 kernel: [2280.029147] ffff88001a6c3630 0000000000000046 0000000000012f40 ffff88001004bfd8
Mar 30 14:58:55 ceph-test1 kernel: [2280.029149] 0000000000012f40 ffff88001a6c3630 ffff88001fc137f0 ffff88001ff9e3f0
Mar 30 14:58:55 ceph-test1 kernel: [2280.029150] 0000000000000002 ffffffff8113ee30 ffff88001004bbd0 ffff88001004bcb8
Mar 30 14:55:53 ceph-test2 kernel: [2090.856057] sd 2:0:0:1: [sdb] abort
Mar 30 14:59:02 ceph-test2 kernel: [2280.028096] INFO: task kworker/u2:0:6 blocked for more than 120 seconds.
Mar 30 14:59:02 ceph-test2 kernel: [2280.028799] Not tainted 3.16.0-4-amd64 #1
Mar 30 14:59:02 ceph-test2 kernel: [2280.029147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 30 14:59:02 ceph-test2 kernel: [2280.029828] kworker/u2:0 D ffff88001e5be4a8 0 6 2 0x00000000
Mar 30 14:59:02 ceph-test2 kernel: [2280.029852] Workqueue: scsi_tmf_2 scmd_eh_abort_handler [scsi_mod]
Mar 30 14:59:02 ceph-test2 kernel: [2280.029855] ffff88001e5be050 0000000000000046 0000000000012f40 ffff88001e5d7fd8
Mar 30 14:59:02 ceph-test2 kernel: [2280.029857] 0000000000012f40 ffff88001e5be050 ffff88001e5d7dc8 ffff88001e5d7d60
Mar 30 14:59:02 ceph-test2 kernel: [2280.029860] ffff88001e5d7dc0 ffff88001e5be050 0000000000002003 0000000000000040
*от системы отправка данных на test2 систему: Code: # doveadm backup -A remote:10.1.3.105
dsync-local(user1): Error: dsync(localhost.localdomain): I/O stalled, no activity for 600 seconds
dsync-local(user1): Error: Timeout during state=sync_mails (send=mails recv=recv_last_common)
dsync-local(user1): Error: Remote command process isn't dying, killing it
kvm conf:
boot: c
bootdisk: scsi0
cores: 1
memory: 512
name: ceph-test1
net0: virtio=1A:64:14:A6:16:3A,bridge=vmbr0,tag=3
numa: 0
ostype: l26
protection: 1
scsi0: ceph-kvm:vm-9001-disk-1,discard=on,size=4G
scsi1: ceph-kvm:vm-9001-disk-2,discard=on,size=200G
scsihw: virtio-scsi-pci
smbios1: uuid=9035f5f9-60e8-42fa-b6ff-5ab38a160365
sockets: 1
boot: c
bootdisk: scsi0
cores: 1
memory: 512
name: ceph-test2
net0: virtio=E2:20:3B:C0:72:F1,bridge=vmbr0,tag=3
numa: 0
ostype: l26
protection: 1
scsi0: ceph-kvm:vm-9002-disk-1,discard=on,size=4G
scsi1: ceph-kvm:vm-9002-disk-2,discard=on,size=50G
scsihw: virtio-scsi-pci
smbios1: uuid=9035f5f9-60e8-42fa-b6ff-5ab38a160365
sockets: 1
ceph: 10.2.6-1~bpo80+1
Системы – 4-дисковые Supermicro X10SLM и X9SCi-LN4F. У них 32 ГБ ECC RAM. У кого-нибудь есть какие-нибудь предложения, как решить эту проблему?
