Made a mistake using Xen Center, freely admit that. I am attempting to pinpoint what damage I caused, what the fix should have been.
XEN Pool Production consisting of 9 Xen Boxes and a Dell iScsi attached SAN space. Vms were mixed on local storage and on SAN space.
XEN-4 was Pool Master.
Xen-8 was not working correctly, VMs could be assigned to the box as their home server, but as soon as I started the VM it would move to a different Xen box. stopping the vm returned it to Xen-8.
My horrible mistake!!!!
Using XenCenter, I went to Xen-8 and clicked on Networking. Forgetting to place the server into maintenance mode, I clicked on Pool Wide bond 0 and deleted it. Completely removing the Network Bond 0 from the entire pool. (sicking thud).
What still worked at this point.
At this point, all vms that were on the Xen box's local storage continued to work fine. My websites loaded and were browsable. SQL was working, etc. However no VMs that were stored on the SAN responded.
What would the proper method of fixing this issue have been? I would have assumed putting all Xen boxes except the Pool Master in maintenance mode, or some such. Figured around 2 to 3 hours of down time to fix.
Seems like we just needed to re-establish the Pool Wide network Bond.
I went and told the Primary administrator what I'd done. He decided to take the opportunity to not just fix the issue but to change the pool to make XEN-1 the pool master. I believe in doing this he started removing Xen boxes from the Pool. This act is known to format the internal storage of the Xen box. After he rebuilt the pool with XEN-1 as the master, he found that all the local VMs were gone. He then blamed me for this, while I had tested extensively and knew they were there when he started.
We have been down for days and he is blaming the entire down time on my mistake and my job is in jeopardy.
I would love to know what the proper resolution for this issue should have been, step by step so I can take this into the retro meeting. Perhaps I'm wrong and removing servers from the pool was the only recourse, but I don't think so.
I'm the first one to admit my idiot mistake, but I believe this guy aggravated the situation and caused the extensive down time.