What We Can Learn From the AWS S3 Outage

Experts ExchangeThe Original Technology Community.
The original technology community.
In the wake of AWS' S3 outage, we want to discuss the importance of storage and data diversification in the event of a hack, crash, or system disruption. We spoke with Experts Exchange’s COO Gene Richardson for a deeper understanding.
Whether your data is stored in the cloud or not, there are many proactive steps you can take to make sure your data and systems are diversified to protect business operations in an unforeseen event.

It all begins with placing load balancers in front of the infrastructure, no matter where you’re hosting and storing your information. Known as devices that scale the performance of a web server-based program, the load balancers distribute client requests across multiple servers and networks.

Load balancers are present in front of roughly 90+% of web servers, and diversification can even exist within these load balancers, says Gene Richardson, COO at Experts Exchange.

Multi-site and Multi-zone Diversification

According to Richardson, many systems use a “three-tier architecture” and how you implement these tiers can help significantly when considering how to approach diversification.
  1. Web Tier—Usually facing external customers and protected and managed by a load balancer and firewall.
  2. Application Tier—Often protected and managed by a load balancer inside a company’s firewall.
  3. Database Tier—Inside a company’s network, behind a firewall, usually highly protected, containing critical customer and company data, and housed on large amounts of storage.
In the database tier, you’re managing massive amounts of data. This diversification process can be more complex and highly involved than in the other tiers. Company IT teams need to be concerned about how they will replicate that data to a standby server, how often data should be replicated, what latency exists between the two environments, and how quickly they would want to switch over to this standby server in the event of an outage.

Solutions exist that allow this transfer to occur in seconds and others can take minutes or hours. It’s important to understand your company’s availability requirements by each application to design the appropriate system. In some cases, you may even need a disaster recovery system where replication occurs to another part of the country—or world—and you have the ability to “switch” to the other site if the primary environment is nearly unrecoverable.

Many companies hold all three tiers in cloud-based data centers spread across different cities, states, and even separate areas of the country. Known as availability zones, this method of diversification helps companies prepare for power glitches or geographical issues that could affect their ability to access data and process requests.

Data Replication

At Experts Exchange, we can write to a database and within seconds, it copies that data to a high availability server in another zone. If one availability zone goes down, we can switch over to the other high availability server in less than a minute.

This replication across zones and availability is of vital importance, especially when you consider what happened during AWS’ service disruption last week. Over several hours, AWS' Simple Storage Service—a single database server and high availability option on the east coast, now infamously known as S3—experienced a region-wide disruption. Data spread across servers in the region—and those availability zones—were inaccessible. Companies affected only diversified to other S3 servers in the immediate area. This issue created massive halts in work productivity.

To avoid major disturbances in these situations, Richardson recommends companies diversify so all data is not entirely in one zone.

“Say you replicate to a high availability database on the east coast and also to one on the west coast. Having this diversification when you’re already replicating data within seconds means if a significant enough problem occurs, you can switch over from one web tier load to the other in as little as 10 minutes,” Richardson says.

Experts Exchange, for example, relies on multiple web servers and data centers that fire on different levels of capacity in different areas. This way, if one area is impacted, the company can shift to another data center and web server and still function on the remaining capacity.

“It’s important to spread capacity across multiple data centers and availability zones,” Richardson explains. “There are multiple layers. You can diversify each tier across multiple availability zones and you can also diversify within an availability zone.”

Some companies that have massive on-campus data centers practice extreme zoning inside. From room to room, data servers could be hooked up to completely different power grids and AC units. In situations like these, diversification can become as fine-tuned as a company desires. Some have gone so far as to set up different power grids and generators per cabinet in the same room.

“It doesn’t matter if you’re storing information in the cloud or in an on-premise server room, you can make diversification happen,” Richardson says.

Steps to Take If a Disruption or Outage Occurs

If a situation ever takes place where multiple parts of business operations go down due to a data provider error or issue, Phil Phillips, DevOps and engineering director at Experts Exchange, recommends DevOps directors and managers mitigate this problem in the following ways:
  1. Notify your users. Create a status page, an on-site notification, and post a social media announcement so your users are not in the dark.
  2. Be prepared. Have a plan in place for when vital pieces of your infrastructure become unavailable. Make sure you practice and update this plan periodically.
  3. Strength in code. Sometimes, depending on the situation, you can design code to gracefully handle a failure.
“[This approach] is definitely something to have in mind when integrating with a third party,” says Phillips. “For example, a single component on a web page might stop working if a third party has issues. Instead of having the whole page error, you might be able to hide or disable the single component and serve the rest of the page.”

For more information on protecting your company’s information in the cloud, check out this article.
Experts ExchangeThe Original Technology Community.
The original technology community.

Comments (0)

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.