Solved

Accidentally deleted Xen Pool wide Bonded Network, need to know how it should have been fixed.. Xen 1.06, Xen Center 6.2

Posted on 2013-12-17
4
656 Views
Last Modified: 2016-11-23
Made a mistake using Xen Center, freely admit that. I am attempting to pinpoint what damage I caused, what the fix should have been.

Setup
XEN Pool Production consisting of 9 Xen Boxes and a Dell iScsi attached SAN space. Vms were mixed on local storage and on SAN space.
XEN-4 was Pool Master.  

Xen-8 was not working correctly, VMs could be assigned to the box as their home server, but as soon as I started the VM it would move to a different Xen box. stopping the vm returned it to Xen-8.

My horrible mistake!!!!
Using XenCenter, I went to Xen-8 and clicked on Networking. Forgetting to place the server into maintenance mode, I clicked on Pool Wide bond 0 and deleted it. Completely removing the Network Bond 0 from the entire pool. (sicking thud).

What still worked at this point.
At this point, all vms that were on the Xen box's local storage continued to work fine. My websites loaded and were browsable. SQL was working, etc. However no VMs that were stored on the SAN responded.

QUESTION:
What would the proper method of fixing this issue have been? I would have assumed putting all Xen boxes except the Pool Master in maintenance mode, or some such. Figured around 2 to 3 hours of down time to fix.

Seems like we just needed to re-establish the Pool Wide network Bond.

I went and told the Primary administrator what I'd done. He decided to take the opportunity to not just fix the issue but to change the pool to make XEN-1 the pool master. I believe in doing this he started removing Xen boxes from the Pool. This act is known to format the internal storage of the Xen box. After he rebuilt the pool with XEN-1 as the master, he found that all the local VMs were gone. He then blamed me for this, while I had tested extensively and knew they were there when he started.
We have been down for days and he is blaming the entire down time on my mistake and my job is in jeopardy.

I would love to know what the proper resolution for this issue should have been, step by step so I can take this into the retro meeting. Perhaps I'm wrong and removing servers from the pool was the only recourse, but I don't think so.

I'm the first one to admit my idiot mistake, but I believe this guy aggravated the situation and caused the extensive down time.
0
Comment
Question by:adamant40
  • 2
  • 2
4 Comments
 
LVL 23

Expert Comment

by:Ayman Bakr
ID: 39729188
It is really frustrating how some people would abuse the situation and hide their ignorance by throwing their mistakes on others.

The primary admin is really ignorant on the part of XenServers and how to administer them. It is really very strange why he needed to remove all the XenServers from the pool just to make the Xen-1 as the pool master!!!! You can do that as follows:

1. Open the console on the host you want to make as the pool master, which is Xen-1
2. If you have HA, you need to disable it with this command:
xe pool-ha-disable

Open in new window

3. Assuming that your pool master Xen-4 is functioning properly (which I think is in your case), issue the following commands:
xe host-list

Open in new window

Note down the uuid of the XenServer you want to make as a master (Xen-1) from the output above and insert it in the following commands:
xe pool-designate-new-master host-uuid=<uuid of Xen-1>

Open in new window

4. Re-enable HA:
xe pool-ha-enable

Open in new window


As for your initial issue. The most dangerous practice performed by many IT professionals is testing in production!! Your primary admin also complicated things by trying to achieve something on production that he never tried before while a major issue was existing. Instead, he should have focused on resolving the major issue and later think of housekeeping work!!

Yes, you are correct you should have focused on re-establishing the network pool-wide bond. Sites like the following could have helped you even if you needed to do it in a maintenance window where production would be down for a couple of hours or so:
http://www.virtues.it/2011/05/cxs56fp1-config-nic-settings/

I am not sure how bad is it now in your situation. But maybe this would help to get your VMs back:
1. Find the uuids of the failing VMs
2. Reset the power state on these VMs
3. Restart the VMs
Hope this recovery guide would help:

http://support.citrix.com/servlet/KbServlet/download/17140-102-671536/XenServer%20System%20Recovery%20Guide.pdf
0
 

Author Comment

by:adamant40
ID: 39736204
Thank you for your response and your comments. Sorry for the delay, been working crazy hours trying to rebuild entire production infrastructure. Tried looking through the documentation you listed for the exact fix for my mistake but my low level skills did not allow met to find out the specific, step by step instructions that should have been carried out to repair my mistake. Was hoping to have something like that to take to to the review. Is it possible for you to provide that? "In the even that someone is a moron and removes the global bonded network from the global pool, you would fix it by doing the following?"
0
 
LVL 23

Accepted Solution

by:
Ayman Bakr earned 500 total points
ID: 39736790
Maybe the steps at the end of this article could have helped:

http://www.raido.be/knowledge-center/blog/detail/recovering-lost-nics-on-xenserver
0
 

Author Closing Comment

by:adamant40
ID: 39736870
Thanks very much, wish I'd had this information available to me at the time of my error.
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

Will try to explain how to use the VMware feature TAGs in the VMs and create Veeam Backup Jobs using TAGs. Since this article is too long, I will create second article for the Veeam tasks.
Veeam Backup & Replication has added a new integration – Veeam Backup for Microsoft Office 365.  In this blog, we will discuss how you can benefit from Office 365 email backup with the Veeam’s new product and try to shed some light on the needs and …
After creating this article (http://www.experts-exchange.com/articles/23699/Setup-Mikrotik-routers-with-OSPF.html), I decided to make a video (no audio) to show you how to configure the routers and run some trace routes and pings between the 7 sites…
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now