Solved

Bandwidth needs for SAN to SAN (Site to Site) replication

Posted on 2011-09-03
10
1,963 Views
Last Modified: 2012-06-27
I'm looking at replicating 2 SANs across the WAN, site to site.  I would want the replication to be as real time as possible < 15 min difference. From my calculations, my current SAN is getting an average of 1.25 megabytes per second written to it.  So what I need to know is the type of connection and speed I would be looking at to achieve this. There would be nothing else on this link, just the SAN replication traffic.  If I figured full speed achievement of data connections I think I would be looking at at least 10 MBit (8mbit connection = approx 1mbyte per second), but I'm not sure that would be enough to calculate for the overhead.
0
Comment
Question by:mikeewalton
  • 5
  • 4
10 Comments
 
LVL 36

Expert Comment

by:ArneLovius
ID: 36479309
what method of replication are you planning on using ?

Is there any overhead from it ? or compression ?

what type of data is on the SAN ?  would replication be better done at the application layer e.g. Exchange 2010 DAG instead of doing block level SAN replication ?
0
 
LVL 7

Author Comment

by:mikeewalton
ID: 36479384
2 Dell Equllogic SAN's, for the sake of this article I want to think of no compression.
0
 
LVL 36

Expert Comment

by:ArneLovius
ID: 36479852
if you have an average of 1.25mB/s then what does it spike to ?  How long would it take to "recover" from the spike ?

How much data are you replicating ?

If you already have both on-site, then I would setup replication via a managed switch and monitor bandwidth usage, ideally with sflow/netflow, but SNMP would do at a pinch, or setup a monitor/span port and use ntop

You might want to look at the cost of a 1000mb line against the cost of a Riverbed or similar to do block based "compression".

If the link is in the same city, then it's probably a lower cost option to go for the bandwidth, if its between cities, then a Riverbed or similar might be a better option.
0
 
LVL 25

Assisted Solution

by:madunix
madunix earned 100 total points
ID: 36480351
Basically replication (LUN copying) has two different varieties: the synchronous and asynchronous. Synchronous replication is cross reference to a RAID1, but over a larger distance. IBM got Metro Mirror (synchronous) and Global Mirror (asynchronous). The main challenge in SAN-SAN replication is to have enough bandwidth such as dark fiber for replication between the Data Centers.
http://www.ibmsystemsmag.com/aix/storage/software/Disaster-Recovery-x-3/?ht=

I am using IBM technology to replicate Data in various projects, in SAN replication you have 2 methods of replication (sync and async), in case IBM Metro Mirror (sync replica)  is generally considered a campus-level solution, where the systems are located in fairly close proximity, such as within the same city. However, the distance supported will vary based on the write intensity of the application and the network being used. In general, with adequate resources, most customers find up to a 50-kilometer distance acceptable with some customers implementing up to a 300-kilometer distance. With Global Mirror (async replica), the target site may trail the production site by a few seconds. The frequency of creating a consistency group is a tunable parameter, and you?ll need to balance your recovery point objective with the performance impact of creating a consistency group. Many customers find a three- to five-second consistency group achievable (i.e., in a disaster, you?d lose the last three to five seconds of data)."

I made a SAN replication between 2xSites using 2Mps using Global mirror (async replica)
simple calculation to transfer traffic between A and B using 2Mbps E1:
diff = E1bandwidth - A2Btraffic   (in Megabit/Second)
 totalLSizeInMegaBit=size of Data
It will take approximately (totalLunSizeInMegaBit/diff)
Example with 550GB :
in this case calculated A2B=0 and the E1 full available for data mirror
totalSizeInMegaBit=550GB*1024*8=4505600
(totalSizeInMegaBit/diff)=(4505600/2)=2252800 Seconds=26 Days
It means it will take more than 26 days to mirror the data  

So the bandwidth analysis is very important,  by analysing the write load of the disks at the primary site. For this purpose performance data must be collected for all Volumes which will participate in the Metro/Global Mirror. The data can be collected either with (Total Productivity Center) TPC-Disk or with any kind of suitable performance monitor. Alternatively the data can also collected on the disk subsystem, which requires a server which receives and collected the data from the box.

In order to get the correct bandwidth analysis, it is important that during the period of data collection a
realistic write load profile is captured. Especially for the Global Mirror between the intermediate site
and the remote side, it is important to understand the distribution and the relation between write peaks
and the average write rate. For this reason the period of data collection should comprise at least 24h
and if possible a period of high write activities.

Please note, each Metro Mirror (sync) and Global Copy (async) as well, requires the Initial Copy phase, before the regular replication hat start. There is no other way to setup the bitmaps on each site of the relations in a reliable way, so there is no work-around to this process. This means that the time it takes to copy all tracks to the remote site, must be considered in the time planning.


Read:

http://www.redbooks.ibm.com/abstracts/tips0340.html
http://www-03.ibm.com/systems/business_resiliency/
http://www.ibm.com/itsolutions/disaster-recovery/
http://www-01.ibm.com/software/success/cssdb.nsf/hardwareL2VW?OpenView&Count=30&RestrictToCategory=corp_StorageDS8100&cty=en_us
http://www-01.ibm.com/software/tivoli/products/storage-mgr/
http://www.drj.com/
0
 
LVL 36

Expert Comment

by:ArneLovius
ID: 36480535
I would usually suggest that the initial sync (especially on small arrays) is done locally.

By small I mean low tens of TB.
0
Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

 
LVL 7

Author Comment

by:mikeewalton
ID: 36480804
  The way I figured the average data usage, was to take the write data on the SAN listed by each iscsi connection from the Xen server, and divide the total data used by the number of hours connected, and then add those together to get the total per hour write data. (See Attached). I do see an issue with this as it doesnt give me the peak.

I know the built in Equallogic SAN replication does it's own compression, but not sure what it actually amounts to.  

Looking at the Riverbed device I could definitely see an added benefit of adding it in the mix.

The end result here would be to replicate my Xen Storage pools in a colo, so that I could fail over with out losing too much data. It wouldn't necessarily have to be real time fail over, but I would at least like to be able to fail over to the remote site in the event of a disaster in under an hour, 30 minutes would be ideal.

The initial amount of data will be approx 4 TB
0
 
LVL 36

Expert Comment

by:ArneLovius
ID: 36481096
What will you have running in the guests ?

Although SAN replication can be good, usually this would be sync rather than async which you are planning.

You also need to have a process of failing back, and preventing  the main site guest from starting up or replicating while the remote site guest is running.

I would tend to look at moving towards HA rather than DR, this then gives you the capability of switching over to the remote site for maintenance etc with minimal downtime. For you to do a clean move using snapshot based replication, you would need to shut down the guest, then wait for replication to complete and then start up the guest on the remote, so your 15 minute replication time could be considerably longer.

0
 
LVL 7

Author Comment

by:mikeewalton
ID: 36481118
Inside the guest is the usual, a couple of DC's (with failover DNS, 2ndary DHCP scopes, etc), Exchange 2007, Sql 2008,  a file server, ts, SharePoint Moss, and a couple of application server, that host some applications that have the sql server as their back end. All are Server 2008 (most are Enterprise).

I would be open to just looking at a secondary HA site rather than failover, in fact if I can get the log shipping, etc, down for exchange and sql it would be preferred for those things like you state as maintenance, etc.

That being said I would still need to figure out the bandwidth type and requirements to keep up with that.  

0
 
LVL 7

Author Comment

by:mikeewalton
ID: 36481128
I can spec the servers (ex, SQL, FS, etc) out if needed. i.e. size, db's, etc.
0
 
LVL 36

Accepted Solution

by:
ArneLovius earned 400 total points
ID: 36481392
With DCs I would just have additional DCs

For a Fileserver, I'd use DFSr for replication, possibly with DFS namespace as well.

A 10mb line is cutting it a bit fine for an Exchange 2007 geo split CCR cluster, it would depend on your rate of mail flow and created logs, you would also want a cas/hub server at each site, which if you still use public folders, could also be a public folder server. Exchange CCR clusters need at least one cluster VLAN going across your connection, but this shouldn't be an issue as long as it is Ethernet point to point, if you were planning MPLS it could be more involved.

SQL replication is good, but I've had issues with some applications in the past, Neverfail might be a better solution.

For sharepoint, you just need to setup a "farm"

If you already already have the servers and storage, I would build it all locally on gigabit connections, then reduce down "inter-site link" to 100mb and see if it still functions, and then down to 10mb. If you have issues at 10mb, then I would speak to Riverbed, you can usually get a trial pair from them for at least a fortnight

0

Featured Post

What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

If you are thinking of adopting cloud services, or just curious as to what ‘the cloud’ can offer then the leader according to Gartner for Infrastructure as a Service (IaaS) is Amazon Web Services (AWS).  When I started using AWS I was completely new…
If you use NetMotion Mobility on your PC and plan to upgrade to Windows 10, it may not work unless you take these steps.
After creating this article (http://www.experts-exchange.com/articles/23699/Setup-Mikrotik-routers-with-OSPF.html), I decided to make a video (no audio) to show you how to configure the routers and run some trace routes and pings between the 7 sites…
In this tutorial you'll learn about bandwidth monitoring with flows and packet sniffing with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're interested in additional methods for monitoring bandwidt…

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now