There are two varieties of (Storage Area Network) SAN to SAN replications: the Synchronous and Asynchronous replications. The synchronous replication has the primary storage system which only commits I/O writes after the replication target acknowledges that data has been written successfully; i.e. data is written to the primary and secondary storage systems simultaneously, on the otherhand asynchronous replication, the data is replicated to replication targets with a delay.
I have implemented 3 sites replication using IBM technology DS8k SAN storage to host our Oracle DB 10g, SAN to SAN replication synchronous metro_mirror over Darkfiber distance=5Km with zero lost data between 2 sites (SiteA+SiteB), and asynchronous replication global_mirror over MPLS/IPVPN L3 with 35Mbps distance=150Km (SiteB+SiteC). The main challenge in SAN-SAN replication is to have enough bandwidth (media link) for replication.
Storage/SAN level data replication: Synchronous
Most storage vendors implement synchronous replication. It is a mechanism for sending a copy of every write IO to a remote storage system. The key difference between synchronous replications and other modes of operation is that no acknowledgement is returned to the host until both local and remote systems have processed the IO. This guarantees data consistency by ensuring that the write complete is only received by the host application once the remote copy has been committed to the remote storage systems and has been acknowledged by both local and remote systems. This is illustrated in the diagram below.
Once the acknowledgement of the write has been received by the local system, both the local and remote write I/Os are eligible for de-staging to disk. De-staging from the cache to the disk drives on both the local and remote storage systems is performed asynchronously. If the remote write fails, acknowledgement is not returned to the local server, causing the host to timeout the I/O on the host. If an acknowledgement of the remote write is not received within a fixed period of time, the write is considered to have failed and is rendered ineligible for restaging to the disk. At this point, the application receives an I/O error, and in due course, the failed write I/O is aged-out of each cache. Synchronous remote replication is ideal for DR solutions involving critical data sources. It ensures that data at the remote site can be used at any time with minimal manual intervention, allowing instant access to remote data in a failover scenario. However, such solutions have a high-cost impact due to the requirement for low latency, high bandwidth communication link to minimise impact to the host performance.
- The local server sends a write I/O to the local storage system.
- The local storage system places the IO into cache and then forwards the write to a remote storage system. At this state, the local system will wait for acknowledgement of the write from the remote system.
- The remote system receives the write I/O and writes it to cache. The remote system then acknowledges the write, as complete, back to the local storage system.
- When the local system receives an acknowledgement from the remote system, it will finally send a write complete acknowledgement back to the host.
Synchronous replication has the following advantages: -
- No data loss
- Support for RPO* and RTO** of zero
- Guaranteed in-order delivery
- Easy management of recovery Process
Synchronous replication has the following disadvantages: -
- High bandwidth requirement
- Latency will directly impact host performance
- Limited to distances below 50km
Storage/SAN level data replication: Asynchronous
Asynchronous replication is a lower cost alternative to a synchronous solution as it has less of a dependency on high bandwidth communications. However, there is also a disadvantage in that asynchronous solutions carry a risk of data loss. In an asynchronous solution, when a write IO is received by the local storage systems it is immediately acknowledged to the host. It is subsequently sent to a remote storage system. Thus this transfer does not impact local host. If the host sends IO faster then the local system can transfer them then, they are queued up on the local system to await transfer. It is the length of this transfer queue that is the potential for data loss in an asynchronous solution. The following diagram illustrates the operation of an asynchronous replication process.
- The local server sends a write I/O to the local storage system.
- The local storage system performs the IO and sends a write complete acknowledgement to the host.
- The local storage systems send the I/O to the remote system.
If the local system is unable to send the I/O to the remote system, it is placed in a transfer queue. Use of transfer queues means that asynchronous solutions are unable to handle workloads where the local hosts have high sustained write IO workloads. In these situations, the queue will continuously grow resulting in the Remote systems are never catching up with the local system. In the worst case scenario, the local queue fill up and cause the replication pair relationship to break or fracture. An additional complexity of using transfer queues is that the solution needs to ensure that write operations are applied to the remote system in the same order As they are received at local system. This is most often accomplished by sending IO in the order they were received or by time stamping each IO. Each vendor’s solution has a different implementation for handling transfer queues. In an asynchronous solution, the storage user has no control over the point
(transfer trigger) at which the write is transferred. The local system determines the priority and point at which writes are sent to the remote system. Asynchronous remote replication is ideal for DR solutions where data sources are not critical and that the risk of data loss is acceptable. It is aimed at solutions where RPO of minutes to hours are acceptable.
Asynchronous replication has the following advantages: -
- No distance limitations
- Lower bandwidths requirement than synchronous replication
- Latency does not immediately impact host performance
Asynchronous replication has the following disadvantages: -
- Potential data loss with data held in transfer queue
- Limited ability to handle sustained I/O load
- No controls over transfer trigger or replication lag
*The recovery point objective (RPO) defines how current the data must be or how much data an organization can afford to lose. The greater the RPO, the more tolerant the process is to interruption.
**The recovery time objective (RTO) specifies the maximum elapsed time to recover an application at an alternate site. The greater the RTO, the longer the process can take to be restored.