Привет всем, столкнулся с проблемой в кластере Ceph (версия 18.2.4 Reef, стабильная) на `ceph-node1`. Сервис `ceph-mgr` выбрасывает необработанное исключение в модуле `devicehealth` с ошибкой `disk I/O error`. Заметил, что ошибка [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.ceph-node1: disk I/O error появляется только на ceph-node1, когда ceph-node2 включен и подключен к кластеру. Когда тестировал с Ceph версией 19.2.1, ошибки не возникало, что указывает на возможную версию-специфическую проблему в 18.2.4. Вот в чем загвоздка: я планирую развернуть внешний кластер Rook, а образ Ceph в Rook поддерживает только до версии 18.2.4. Так что пока что я вынужден работать с этой версией. Ошибка появляется в логах на ceph-node1 вскоре после перезапуска сервиса ceph-mgr, когда node2 активен (например, 15 марта 03:18:36 ceph-node1 ceph-mgr[36707]: sqlite3.OperationalError: disk I/O error).
Вот соответствующая информация: Логи из `journalctl -u ceph-mgr@ceph-node1.service`
tungpm@ceph-node1:~$ sudo journalctl -u ceph-mgr@ceph-node1.service
Mar 13 18:55:23 ceph-node1 systemd[1]: Started Ceph cluster manager daemon.
Mar 13 18:55:26 ceph-node1 ceph-mgr[7092]: /lib/python3/dist-packages/scipy/__init__.py:67: UserWarning: NumPy was imported from a Python sub-interpreter but NumPy does not properly support sub-interpreters. This will likely work for >
Mar 13 18:55:26 ceph-node1 ceph-mgr[7092]: Improvements in the case of bugs are welcome, but is not on the NumPy roadmap, and full support may require significant effort to achieve.
Mar 13 18:55:26 ceph-node1 ceph-mgr[7092]: from numpy import show_config as show_numpy_config
Mar 13 18:55:28 ceph-node1 ceph-mgr[7092]: 2025-03-13T18:55:28.018+0000 7ffafa064640 -1 mgr.server handle_report got status from non-daemon mon.ceph-node1
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: 2025-03-13T19:10:39.025+0000 7ffaf2855640 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.ceph-node1: disk I/O error
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: 2025-03-13T19:10:39.025+0000 7ffaf2855640 -1 devicehealth.serve:
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: 2025-03-13T19:10:39.025+0000 7ffaf2855640 -1 Traceback (most recent call last):
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 524, in check
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: return func(self, *args, **kwargs)
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/devicehealth/module.py", line 355, in _do_serve
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: if self.db_ready() and self.enable_monitoring:
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 1271, in db_ready
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: return self.db is not None
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 1283, in db
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: self._db = self.open_db()
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 1256, in open_db
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: db = sqlite3.connect(uri, check_same_thread=False, uri=True)
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: sqlite3.OperationalError: disk I/O error
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: During handling of the above exception, another exception occurred:
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: Traceback (most recent call last):
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/devicehealth/module.py", line 399, in serve
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: self._do_serve()
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 532, in check
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: self.open_db();
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 1256, in open_db
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: db = sqlite3.connect(uri, check_same_thread=False, uri=True)
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: sqlite3.OperationalError: disk I/O error
Mar 13 19:16:41 ceph-node1 systemd[1]: Stopping Ceph cluster manager daemon...
Mar 13 19:16:41 ceph-node1 systemd[1]: ceph-mgr@ceph-node1.service : Deactivated successfully.
Mar 13 19:16:41 ceph-node1 systemd[1]: Stopped Ceph cluster manager daemon.
Mar 13 19:16:41 ceph-node1 systemd[1]: ceph-mgr@ceph-node1.service : Consumed 6.607s CPU time.
Вот соответствующая информация: Логи из `journalctl -u ceph-mgr@ceph-node1.service`
tungpm@ceph-node1:~$ sudo journalctl -u ceph-mgr@ceph-node1.service
Mar 13 18:55:23 ceph-node1 systemd[1]: Started Ceph cluster manager daemon.
Mar 13 18:55:26 ceph-node1 ceph-mgr[7092]: /lib/python3/dist-packages/scipy/__init__.py:67: UserWarning: NumPy was imported from a Python sub-interpreter but NumPy does not properly support sub-interpreters. This will likely work for >
Mar 13 18:55:26 ceph-node1 ceph-mgr[7092]: Improvements in the case of bugs are welcome, but is not on the NumPy roadmap, and full support may require significant effort to achieve.
Mar 13 18:55:26 ceph-node1 ceph-mgr[7092]: from numpy import show_config as show_numpy_config
Mar 13 18:55:28 ceph-node1 ceph-mgr[7092]: 2025-03-13T18:55:28.018+0000 7ffafa064640 -1 mgr.server handle_report got status from non-daemon mon.ceph-node1
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: 2025-03-13T19:10:39.025+0000 7ffaf2855640 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.ceph-node1: disk I/O error
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: 2025-03-13T19:10:39.025+0000 7ffaf2855640 -1 devicehealth.serve:
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: 2025-03-13T19:10:39.025+0000 7ffaf2855640 -1 Traceback (most recent call last):
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 524, in check
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: return func(self, *args, **kwargs)
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/devicehealth/module.py", line 355, in _do_serve
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: if self.db_ready() and self.enable_monitoring:
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 1271, in db_ready
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: return self.db is not None
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 1283, in db
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: self._db = self.open_db()
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 1256, in open_db
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: db = sqlite3.connect(uri, check_same_thread=False, uri=True)
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: sqlite3.OperationalError: disk I/O error
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: During handling of the above exception, another exception occurred:
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: Traceback (most recent call last):
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/devicehealth/module.py", line 399, in serve
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: self._do_serve()
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 532, in check
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: self.open_db();
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 1256, in open_db
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: db = sqlite3.connect(uri, check_same_thread=False, uri=True)
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: sqlite3.OperationalError: disk I/O error
Mar 13 19:16:41 ceph-node1 systemd[1]: Stopping Ceph cluster manager daemon...
Mar 13 19:16:41 ceph-node1 systemd[1]: ceph-mgr@ceph-node1.service : Deactivated successfully.
Mar 13 19:16:41 ceph-node1 systemd[1]: Stopped Ceph cluster manager daemon.
Mar 13 19:16:41 ceph-node1 systemd[1]: ceph-mgr@ceph-node1.service : Consumed 6.607s CPU time.
