• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 2321
  • Last Modified:

Bandwidth needs for SAN to SAN (Site to Site) replication

I'm looking at replicating 2 SANs across the WAN, site to site.  I would want the replication to be as real time as possible < 15 min difference. From my calculations, my current SAN is getting an average of 1.25 megabytes per second written to it.  So what I need to know is the type of connection and speed I would be looking at to achieve this. There would be nothing else on this link, just the SAN replication traffic.  If I figured full speed achievement of data connections I think I would be looking at at least 10 MBit (8mbit connection = approx 1mbyte per second), but I'm not sure that would be enough to calculate for the overhead.
0
mikeewalton
Asked:
mikeewalton
  • 5
  • 4
2 Solutions
 
ArneLoviusCommented:
what method of replication are you planning on using ?

Is there any overhead from it ? or compression ?

what type of data is on the SAN ?  would replication be better done at the application layer e.g. Exchange 2010 DAG instead of doing block level SAN replication ?
0
 
mikeewaltonAuthor Commented:
2 Dell Equllogic SAN's, for the sake of this article I want to think of no compression.
0
 
ArneLoviusCommented:
if you have an average of 1.25mB/s then what does it spike to ?  How long would it take to "recover" from the spike ?

How much data are you replicating ?

If you already have both on-site, then I would setup replication via a managed switch and monitor bandwidth usage, ideally with sflow/netflow, but SNMP would do at a pinch, or setup a monitor/span port and use ntop

You might want to look at the cost of a 1000mb line against the cost of a Riverbed or similar to do block based "compression".

If the link is in the same city, then it's probably a lower cost option to go for the bandwidth, if its between cities, then a Riverbed or similar might be a better option.
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
madunixChief Information Security Officer Commented:
Basically replication (LUN copying) has two different varieties: the synchronous and asynchronous. Synchronous replication is cross reference to a RAID1, but over a larger distance. IBM got Metro Mirror (synchronous) and Global Mirror (asynchronous). The main challenge in SAN-SAN replication is to have enough bandwidth such as dark fiber for replication between the Data Centers.
http://www.ibmsystemsmag.com/aix/storage/software/Disaster-Recovery-x-3/?ht=

I am using IBM technology to replicate Data in various projects, in SAN replication you have 2 methods of replication (sync and async), in case IBM Metro Mirror (sync replica)  is generally considered a campus-level solution, where the systems are located in fairly close proximity, such as within the same city. However, the distance supported will vary based on the write intensity of the application and the network being used. In general, with adequate resources, most customers find up to a 50-kilometer distance acceptable with some customers implementing up to a 300-kilometer distance. With Global Mirror (async replica), the target site may trail the production site by a few seconds. The frequency of creating a consistency group is a tunable parameter, and you?ll need to balance your recovery point objective with the performance impact of creating a consistency group. Many customers find a three- to five-second consistency group achievable (i.e., in a disaster, you?d lose the last three to five seconds of data)."

I made a SAN replication between 2xSites using 2Mps using Global mirror (async replica)
simple calculation to transfer traffic between A and B using 2Mbps E1:
diff = E1bandwidth - A2Btraffic   (in Megabit/Second)
 totalLSizeInMegaBit=size of Data
It will take approximately (totalLunSizeInMegaBit/diff)
Example with 550GB :
in this case calculated A2B=0 and the E1 full available for data mirror
totalSizeInMegaBit=550GB*1024*8=4505600
(totalSizeInMegaBit/diff)=(4505600/2)=2252800 Seconds=26 Days
It means it will take more than 26 days to mirror the data  

So the bandwidth analysis is very important,  by analysing the write load of the disks at the primary site. For this purpose performance data must be collected for all Volumes which will participate in the Metro/Global Mirror. The data can be collected either with (Total Productivity Center) TPC-Disk or with any kind of suitable performance monitor. Alternatively the data can also collected on the disk subsystem, which requires a server which receives and collected the data from the box.

In order to get the correct bandwidth analysis, it is important that during the period of data collection a
realistic write load profile is captured. Especially for the Global Mirror between the intermediate site
and the remote side, it is important to understand the distribution and the relation between write peaks
and the average write rate. For this reason the period of data collection should comprise at least 24h
and if possible a period of high write activities.

Please note, each Metro Mirror (sync) and Global Copy (async) as well, requires the Initial Copy phase, before the regular replication hat start. There is no other way to setup the bitmaps on each site of the relations in a reliable way, so there is no work-around to this process. This means that the time it takes to copy all tracks to the remote site, must be considered in the time planning.


Read:

http://www.redbooks.ibm.com/abstracts/tips0340.html
http://www-03.ibm.com/systems/business_resiliency/
http://www.ibm.com/itsolutions/disaster-recovery/
http://www-01.ibm.com/software/success/cssdb.nsf/hardwareL2VW?OpenView&Count=30&RestrictToCategory=corp_StorageDS8100&cty=en_us
http://www-01.ibm.com/software/tivoli/products/storage-mgr/
http://www.drj.com/
0
 
ArneLoviusCommented:
I would usually suggest that the initial sync (especially on small arrays) is done locally.

By small I mean low tens of TB.
0
 
mikeewaltonAuthor Commented:
  The way I figured the average data usage, was to take the write data on the SAN listed by each iscsi connection from the Xen server, and divide the total data used by the number of hours connected, and then add those together to get the total per hour write data. (See Attached). I do see an issue with this as it doesnt give me the peak.

I know the built in Equallogic SAN replication does it's own compression, but not sure what it actually amounts to.  

Looking at the Riverbed device I could definitely see an added benefit of adding it in the mix.

The end result here would be to replicate my Xen Storage pools in a colo, so that I could fail over with out losing too much data. It wouldn't necessarily have to be real time fail over, but I would at least like to be able to fail over to the remote site in the event of a disaster in under an hour, 30 minutes would be ideal.

The initial amount of data will be approx 4 TB
0
 
ArneLoviusCommented:
What will you have running in the guests ?

Although SAN replication can be good, usually this would be sync rather than async which you are planning.

You also need to have a process of failing back, and preventing  the main site guest from starting up or replicating while the remote site guest is running.

I would tend to look at moving towards HA rather than DR, this then gives you the capability of switching over to the remote site for maintenance etc with minimal downtime. For you to do a clean move using snapshot based replication, you would need to shut down the guest, then wait for replication to complete and then start up the guest on the remote, so your 15 minute replication time could be considerably longer.

0
 
mikeewaltonAuthor Commented:
Inside the guest is the usual, a couple of DC's (with failover DNS, 2ndary DHCP scopes, etc), Exchange 2007, Sql 2008,  a file server, ts, SharePoint Moss, and a couple of application server, that host some applications that have the sql server as their back end. All are Server 2008 (most are Enterprise).

I would be open to just looking at a secondary HA site rather than failover, in fact if I can get the log shipping, etc, down for exchange and sql it would be preferred for those things like you state as maintenance, etc.

That being said I would still need to figure out the bandwidth type and requirements to keep up with that.  

0
 
mikeewaltonAuthor Commented:
I can spec the servers (ex, SQL, FS, etc) out if needed. i.e. size, db's, etc.
0
 
ArneLoviusCommented:
With DCs I would just have additional DCs

For a Fileserver, I'd use DFSr for replication, possibly with DFS namespace as well.

A 10mb line is cutting it a bit fine for an Exchange 2007 geo split CCR cluster, it would depend on your rate of mail flow and created logs, you would also want a cas/hub server at each site, which if you still use public folders, could also be a public folder server. Exchange CCR clusters need at least one cluster VLAN going across your connection, but this shouldn't be an issue as long as it is Ethernet point to point, if you were planning MPLS it could be more involved.

SQL replication is good, but I've had issues with some applications in the past, Neverfail might be a better solution.

For sharepoint, you just need to setup a "farm"

If you already already have the servers and storage, I would build it all locally on gigabit connections, then reduce down "inter-site link" to 100mb and see if it still functions, and then down to 10mb. If you have issues at 10mb, then I would speak to Riverbed, you can usually get a trial pair from them for at least a fortnight

0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

  • 5
  • 4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now