Engineering

Transforming Data Protection: Unveiling Faster, More Reliable Backups and Snapshots

authorauthorauthorauthorauthor

Jawaad Tariq, House Li, Urchin Colley , Jenni Griesmann, and Archana Kamath

Posted: May 15, 20245 min read

In response to valuable feedback from our customers, we embarked on a journey to improve our Backup and Snapshot product for Droplets, addressing concerns related to Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO). Recognizing the need for more frequent backups and quicker recovery, we have invested in revamping both our software and storage hardware stack.

Addressing the need for speed

One of the key challenges highlighted by our customers was the inadequacy of weekly backups, especially for transactional workloads where every moment counts. Understanding the importance of time-sensitive scenarios, we set out to significantly reduce the time required for both backup and restoration processes.

1. Faster backups

Our revamped Backup and Snapshots capability now allows us to take backups in less than one-third of the time*, on average, compared to our previous system as observed via our monitoring tools in the datacenters where the new capabilities have already been rolled out. This improvement helps ensure our customers’ critical data is protected efficiently without causing disruption to their ongoing operations. Two significant enhancements in the stack have allowed us to make this improvement. Firstly, the backups are more efficient in tracking the changed blocks on disk and minimizing the size of the backup. Secondly, CPU backed compression algorithms are used to compress the data and then written to a highly resilient and performant flash based storage backend compared to the hard disk drive backed storage previously. These optimizations are setting us up to offer additional backup scheduling options on a daily basis in the future.

2. Speedier restores

In catastrophic circumstances where every second matters, we’ve achieved a remarkable two to three times* increase in the speed of restores on average as observed via our monitoring tools. This means that our customers can now recover their systems in less time than it would have taken with our previous solution, reducing downtime and helping to maximize operational efficiency. The way we break Droplet images down into blocks now also allows us to download multiple parts of the images in parallel instead of needing to stream the file in order. This lets us have much more control in how much bandwidth we want to trade for speed. Another factor is the all new performant flash based backend serving data over a faster and lower latency network connection.

3. Backup scheduler

Our team has undertaken a comprehensive redesign of our current backup scheduler to meet the evolving demands of our users.

Previously, we offered a 23-hour backup window that spanned an entire day. This approach resulted in a concentrated surge of backup activity at the onset of the window, with a potential to overwhelm our network and storage infrastructure. Not only did this pose risks of system overload, but it also hindered our scalability efforts, limiting our ability to onboard more customers.

To address these challenges, we’ve implemented a new strategy. Instead of a single 23-hour window, we’ve divided backups into multiple 4-hour intervals. Customers are now assigned to different start times, allowing for a more distributed and manageable workload. This reduces the strain on our internal systems and also accelerates the backup process within each specific window.

image alt text The graph above illustrates a notable decrease in the maximum concurrent requests, as well as a significantly shorter overall pagination time.

image alt text Note: Pricing and product information are correct as of May 14, 2024, varies by customer, and is subject to change.

The adoption of the 4-hour window approach provides an impactful improvement to user experience. Customers are now empowered to customize their backup schedules by selecting their preferred time slot. This includes the flexibility to choose both the day and a 4-hour window for weekly backups, along with the specific 4-hour window for daily backups.

image alt text

Another improvement we’ve implemented is the transition away from the polling mechanism for backup schedule checks. Instead, we’ve introduced a distributed queuing system, allowing for the setting of individual timers for each backup schedule. This overhaul reduces the backend database load and enhances the scalability and flexibility of our system. As a result, we’re now positioned to offer more advanced control options in the future, extending beyond the current daily and weekly backup features available in select data centers.

4. Enhanced backup policy monitor

Another key area of investment was around our monitoring around tracking and alerting on any missed backups. The turnaround time for alerting and actioning around missed backups has been reduced due to the improved internal monitoring we have available now. This new monitoring was designed in a way that makes it agnostic to the scheduling system. This means we can implement the new changes confidently, knowing that we will be immediately alerted if there are any issues with the new system.

5. Flexible image retention

We changed our image retention logic to move from being count based to date based. Every backup taken will have an expiration date attached to it. This new architecture allows more flexibility around image expirations as we work towards providing customers with more control over their backups.

Global rollout

We are excited to announce that the new Backup and Snapshots capability is currently being rolled out across all our global data centers. This helps ensure that our customers worldwide can benefit from these improvements, irrespective of their geographic location.

As we continue the rollout, we are discovering additional enhancements that are required to build a resilient system. For example, early in the roll out we had to tune the data block concurrency of our uploads because we were using quite a lot of bandwidth, inundating the network. Fortunately, because of how we built the system, that was not a difficult knob to adjust by reducing the number of concurrent streams.

We’re proud to offer a Backup and Snapshot solution that meets the expectations of DigitalOcean customers. The faster backups and restores, coupled with enhanced control and improved storage reliability, mark a step forward in our commitment to providing top-notch data protection. As technology continues to evolve, we remain dedicated to staying at the forefront, helping to ensure our customers’ data is secure, available, and recoverable at a moment’s notice.

Protect your data, protect your business

Daily Droplet backups are now available in SYD1, TOR1, and SFO2, as well as SGP1, NYC1, AMS3, NYC3, SFO3, with availability in other data centers coming very soon. Daily Droplet backups are priced at 30% of the Droplet cost**. Read more about backup pricing or contact our sales team.

Add daily backups to your Droplet workloads today to help protect your data!

*Actual backup speed gains and performance may vary depending on a variety of factors such as system configuration, I/O load, operating environment, and type of workloads.

**Note: Pricing and product information are correct as of May 14, 2024, and subject to change.

Share

Try DigitalOcean for free

Click below to sign up and get $200 of credit to try our products over 60 days!Sign up

Related Articles

Deploying your Microservices Architecture App in App Platform using Managed Kafka
engineering

Deploying your Microservices Architecture App in App Platform using Managed Kafka

Blesswin Samuel and Mavis Franco

July 2, 20243 min read

How SMBs and startups scale on DigitalOcean Kubernetes: Best Practices Part IV - Scalability
engineering

How SMBs and startups scale on DigitalOcean Kubernetes: Best Practices Part IV - Scalability

June 6, 202411 min read

How SMBs and startups scale on DigitalOcean Kubernetes: Best Practices Part III - Reliability
engineering

How SMBs and startups scale on DigitalOcean Kubernetes: Best Practices Part III - Reliability

May 24, 202416 min read