asked on

Accidentally deleted Xen Pool wide Bonded Network, need to know how it should have been fixed.. Xen 1.06, Xen Center 6.2

Made a mistake using Xen Center, freely admit that. I am attempting to pinpoint what damage I caused, what the fix should have been.

Setup
XEN Pool Production consisting of 9 Xen Boxes and a Dell iScsi attached SAN space. Vms were mixed on local storage and on SAN space.
XEN-4 was Pool Master.

Xen-8 was not working correctly, VMs could be assigned to the box as their home server, but as soon as I started the VM it would move to a different Xen box. stopping the vm returned it to Xen-8.

My horrible mistake!!!!
Using XenCenter, I went to Xen-8 and clicked on Networking. Forgetting to place the server into maintenance mode, I clicked on Pool Wide bond 0 and deleted it. Completely removing the Network Bond 0 from the entire pool. (sicking thud).

What still worked at this point.
At this point, all vms that were on the Xen box's local storage continued to work fine. My websites loaded and were browsable. SQL was working, etc. However no VMs that were stored on the SAN responded.

QUESTION:
What would the proper method of fixing this issue have been? I would have assumed putting all Xen boxes except the Pool Master in maintenance mode, or some such. Figured around 2 to 3 hours of down time to fix.

Seems like we just needed to re-establish the Pool Wide network Bond.

I went and told the Primary administrator what I'd done. He decided to take the opportunity to not just fix the issue but to change the pool to make XEN-1 the pool master. I believe in doing this he started removing Xen boxes from the Pool. This act is known to format the internal storage of the Xen box. After he rebuilt the pool with XEN-1 as the master, he found that all the local VMs were gone. He then blamed me for this, while I had tested extensively and knew they were there when he started.
We have been down for days and he is blaming the entire down time on my mistake and my job is in jeopardy.

I would love to know what the proper resolution for this issue should have been, step by step so I can take this into the retro meeting. Perhaps I'm wrong and removing servers from the pool was the only recourse, but I don't think so.

I'm the first one to admit my idiot mistake, but I believe this guy aggravated the situation and caused the extensive down time.

Ayman Bakr

It is really frustrating how some people would abuse the situation and hide their ignorance by throwing their mistakes on others.

The primary admin is really ignorant on the part of XenServers and how to administer them. It is really very strange why he needed to remove all the XenServers from the pool just to make the Xen-1 as the pool master!!!! You can do that as follows:

1. Open the console on the host you want to make as the pool master, which is Xen-1
2. If you have HA, you need to disable it with this command:

xe pool-ha-disable

Open in new window

3. Assuming that your pool master Xen-4 is functioning properly (which I think is in your case), issue the following commands:

xe host-list

Open in new window

Note down the uuid of the XenServer you want to make as a master (Xen-1) from the output above and insert it in the following commands:

xe pool-designate-new-master host-uuid=<uuid of Xen-1>

Open in new window

4. Re-enable HA:

xe pool-ha-enable

Open in new window

As for your initial issue. The most dangerous practice performed by many IT professionals is testing in production!! Your primary admin also complicated things by trying to achieve something on production that he never tried before while a major issue was existing. Instead, he should have focused on resolving the major issue and later think of housekeeping work!!

Yes, you are correct you should have focused on re-establishing the network pool-wide bond. Sites like the following could have helped you even if you needed to do it in a maintenance window where production would be down for a couple of hours or so:
http://www.virtues.it/2011/05/cxs56fp1-config-nic-settings/

I am not sure how bad is it now in your situation. But maybe this would help to get your VMs back:
1. Find the uuids of the failing VMs
2. Reset the power state on these VMs
3. Restart the VMs
Hope this recovery guide would help:

http://support.citrix.com/servlet/KbServlet/download/17140-102-671536/XenServer%20System%20Recovery%20Guide.pdf

adamant40

ASKER

Thank you for your response and your comments. Sorry for the delay, been working crazy hours trying to rebuild entire production infrastructure. Tried looking through the documentation you listed for the exact fix for my mistake but my low level skills did not allow met to find out the specific, step by step instructions that should have been carried out to repair my mistake. Was hoping to have something like that to take to to the review. Is it possible for you to provide that? "In the even that someone is a moron and removes the global bonded network from the global pool, you would fix it by doing the following?"

ASKER CERTIFIED SOLUTION

Ayman Bakr

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

adamant40

ASKER

Thanks very much, wish I'd had this information available to me at the time of my error.