Managing Disaster in The Public Cloud

Disaster recovery public cloud

Despite the best intentions of the public cloud infrastructure providers, issues do happen from time to time. Some problems are more significant than others, as we saw with the recent Amazon Web Services (AWS) DynamoDB outage (details here). The impact to AWS customers was widespread and although it would be easy to dismiss the affected services as trivial (Netflix, IMDB, Tinder, Buffer), these are still clients running production workloads. Do a quick Google search and it’s easy to find many other similar examples, including this one affecting Microsoft’s Azure Storage Service.

It would be naïve and a mistake to assume that the infrastructure provided by AWS, Google and Microsoft is inherently less secure than a customer could deploy within their own data centre. These three companies are in the data centre business; most end users are not. However, it’s impossible to guarantee 100% availability, so how can we protect ourselves against these kinds of failures and disaster in the public cloud?

Understand the architecture

It’s not enough to say that the cloud ISPs provide a “service” and as a result we shouldn’t be concerned with how that service is implemented. In the case of the AWS outage, it appears that the design of DynamoDB was resilient at a hardware level (as it was distributed across multiple servers), but wasn’t logically resilient – a metadata issue affected the entire service. In this instance, there was one failure domain – and it failed. Although it would have been less efficient, having multiple DynamoDB instances might have reduced the impact of the problem for some customers.

Always look in detail at the architecture and decide what impact it will have on your applications. This may mean revisiting existing implementations with a fresh eye on how to architect all possible failure scenarios. The best solution for the customer isn’t always the best solution for the service provider – AWS benefits from large-scale efficient deployments, by having one service spanning many servers, however this exposes the customer more when problems occur.

Build in extra resiliency

It’s not unreasonable to add more resiliency than initially appears necessary. There’s a trade-off to be made between the potential increased cost this incurs and the lost revenue that occurs through increased downtime. I’m sure Tinder and IMDB are upset with some lost advertising revenue; Netflix might be measuring the cost in more concrete terms such as the service credits they will have to issue, or customers closing their accounts.

From a storage perspective, what does extra resiliency mean? Well, it can mean putting in place redundant backup copies of data, taken either through replication or snapshots. It may also mean implementing a set of standby services that use those replicas and snapshots in the event of a failure in the primary environment.

Zadara Storage VPSA

How can we view Zadara’s Virtual Private Storage Array in the context of this discussion? Zadara’s VPSA provides dedicated hardware for the customer, creating a fault domain that is separate from other customer data. A single customer can choose to further subdivide the disk hardware in order to separate fault domains at a more granular level.

The Zadara VPSA supports a range of data protection features – snapshots for local protection against corruption and remote replication for disaster recovery protection. Remote replication is a feature not currently provided by the big cloud ISPs. For example, an Amazon EBS (Elastic Block Store) volume is protected within but not between Availability Zones; protection across zones would need to be built into the application or implemented manually using snapshots. VPSA allows replication to move data between AWS Availability Zones, allowing greater resiliency in the event of a disaster in a single location. It’s also possible to consider running multiple VPSAs within a single Availability Zone if required.

In many respects, the Zadara VPSA provides resiliency capabilities that are focused on the way applications have been traditionally developed, something that cloud providers haven’t focused on. By offering many ways to protect data, Zadara’s VPSA provides a “belt and braces” approach to data protection, removing the dependency on any one single process to secure data. These features can be used to protect both traditional and new “cloud native” applications.

 

Share This Post

More To Explore