TREXman
asked on
Ceph Cluster won't get healthy after one node fails
Hello
we are playing around with Ceph as a new storage server. (Version Luminous)
For the cluster we have 3 physical Ubuntu nodes which are the OSD, Monitoring and iSCSI GW Nodes. For installation purposes we have one virtual Ubuntu admin node (ceph-deploy).
We use the official installation documentation (http://docs.ceph.com) for the setup.
The installation went ok and the cluster is up and running, but if we shutdown one node, the cluster stays degraded.
If I power on the node the cluster get's healthy after a few seconds.
Any ideas what we can do or check?
The cluster has one pool.
Name: rbd
PGs: 400
Size: 3
Min_size: 2
Crush-Map: autogenerated
Cluster with all 3 nodes
Cluster with 2 node
we are playing around with Ceph as a new storage server. (Version Luminous)
For the cluster we have 3 physical Ubuntu nodes which are the OSD, Monitoring and iSCSI GW Nodes. For installation purposes we have one virtual Ubuntu admin node (ceph-deploy).
We use the official installation documentation (http://docs.ceph.com) for the setup.
The installation went ok and the cluster is up and running, but if we shutdown one node, the cluster stays degraded.
If I power on the node the cluster get's healthy after a few seconds.
Any ideas what we can do or check?
The cluster has one pool.
Name: rbd
PGs: 400
Size: 3
Min_size: 2
Crush-Map: autogenerated
Cluster with all 3 nodes
cluster:
id: 551080bb-eada-44e4-bcbe-7c952dbca781
health: HEALTH_OK
services:
mon: 3 daemons, quorum HBceph01,HBceph02,HBceph03
mgr: HBceph01(active), standbys: HBceph03, HBceph02
osd: 12 osds: 12 up, 12 in
tcmu-runner: 3 daemons active
data:
pools: 1 pools, 400 pgs
objects: 1666 objects, 6416 MB
usage: 148 GB used, 89275 GB / 89424 GB avail
pgs: 400 active+clean
io:
client: 5401 B/s rd, 1941 B/s wr, 4 op/s rd, 0 op/s wr
Cluster with 2 node
id: 551080bb-eada-44e4-bcbe-7c952dbca781
health: HEALTH_WARN
Degraded data redundancy: 1689/5067 objects degraded (33.333%), 390 pgs degraded, 400 pgs undersized
1/3 mons down, quorum HBceph01,HBceph03
services:
mon: 3 daemons, quorum HBceph01,HBceph03, out of quorum: HBceph02
mgr: HBceph01(active), standbys: HBceph03
osd: 12 osds: 8 up, 8 in
tcmu-runner: 2 daemons active
data:
pools: 1 pools, 400 pgs
objects: 1689 objects, 6523 MB
usage: 97894 MB used, 59520 GB / 59616 GB avail
pgs: 1689/5067 objects degraded (33.333%)
390 active+undersized+degraded
10 active+undersized
io:
client: 3715 B/s rd, 3 op/s rd, 0 op/s wr
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 87.33000 root default
-3 29.11000 host HBceph01
0 hdd 7.27699 osd.0 up 1.00000 1.00000
1 hdd 7.27699 osd.1 up 1.00000 1.00000
2 hdd 7.27699 osd.2 up 1.00000 1.00000
3 hdd 7.27699 osd.3 up 1.00000 1.00000
-5 29.11000 host HBceph02
4 hdd 7.27699 osd.4 down 0 1.00000
5 hdd 7.27699 osd.5 down 0 1.00000
6 hdd 7.27699 osd.6 down 0 1.00000
7 hdd 7.27699 osd.7 down 0 1.00000
-7 29.11000 host HBceph03
8 hdd 7.27699 osd.8 up 1.00000 1.00000
9 hdd 7.27699 osd.9 up 1.00000 1.00000
10 hdd 7.27699 osd.10 up 1.00000 1.00000
11 hdd 7.27699 osd.11 up 1.00000 1.00000
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thank you for your hint - you are right.
I was looking at the wrong failure domains - which was the NODE and not the OSD at the crush map.