June 19 2021
Optimizing Retrospect's Algorithm for Resource Scheduling
Optimized algorithms enable Engineering teams to deliver a significantly better, dynamic experience for customers, given the right choice of algorithms for the problem space. However, advanced AI algorithms or machine learning techniques aren’t necessary to vastly improve many solutions for customers.
For Retrospect Backup, one core problem is resource scheduling, specifically when to back up a data source, accommodating environments from a couple computers to a large business with hundreds of different data sources with different availability patterns, bandwidths, and data sizes. The goal is to protect every resource within the backup policy that it’s assigned (i.e. every day).
Let’s say you have a main desktop computer and a laptop. The Retrospect engine is running on the desktop, and the Retrospect client agent is running on the laptop. When does the Retrospect engine back up your laptop? The ideal solution is to start the backup when your laptop becomes available. That is exactly what Retrospect does.
The Retrospect engine waits for agents to contact it to schedule them for backup. Retrospect client agents send out a network message when the computer connects to a network, and the Retrospect engine is listening for that. When it detects a new client, the engine checks if the computer is out of policy, and if so, starts a new backup.
This type of resource scheduling is unique in the backup industry. Whereas other backup solutions schedule backups based on a particular time, like every day at 10am, Retrospect can schedule backups for any time that a data source is available within a specific time window, such as anytime between 9am and 5pm.
Moreover, Retrospect can back up more than just computers. It can protect physical servers, virtual machines, desktops and laptops (endpoints), external hard drives, network attached storage (NAS) volumes, email accounts, and cloud storage. The wide variety of resources translates into a dynamic set of scheduling problems.
We created this resource scheduling feature, named “Backup Server”, over fifteen years ago to handle the sporadic nature of endpoints coming and going from the network, and we’ve steadily expanded it to accommodate more data sources, renaming it to Proactive Backup and most recently to ProactiveAI, as it’s significantly smarter than its predecessors.
This type of resource scheduling has a number of attributes to utilize for optimization:
- Data source type
- Data source availability
- Data source importance
- Last backup date
- Last backup size
- Last backup duration
- Last backup performance
Compared to other backup solutions, the previous iteration of Retrospect’s resource scheduling algorithm–a simple first-in-first-out (FIFO) strategy–worked well. However, the sheer diversity of our customer base exposed the algorithm’s weaknesses:
Start Time: The next backup date was scheduled based on the previous backup’s completion date. This nuance translated into time skew, where the first backup started at 10am and finished at 11am, the second started at 11am, and so on until there is a day when the backup is scheduled to start after the backup window and thus skips a day.
Priority, Availability, and Size: Servers are generally always available for backup, but endpoints come and go. Desktops might be shut off at the end of the day or have energy-saving mode enabled. Laptops typically leave the network each day. NAS volumes are always accessible as are email accounts and cloud storage locations. All vary drastically in how important they are to a business and how much data they typically hold.
We needed an algorithm that made initial assumptions about priority but adjusted based on historical data without significant hysteresis.
Wake-On-Lan (WOL): Computers support an energy-saving feature called Wake-On-Lan or WOL. The computer can be asleep to save on power consumption but still listen on its network interface for a special network packet telling it to wake up, enabling it to save on electricity but also be available for tasks like backups.
Computers (especially older ones) take time to wake up, so Retrospect needs to wait at least three minutes after sending the WOL packet to see if the computer is awake, keeping in mind that Retrospect does not know if the computer is even on the network. The computer will eventually go back to sleep based on the energy settings.
The algorithm sent a WOL packet to every applicable computer, moved the computer to the secondary list, and came back to the list after waiting at least three minutes checking other sources for availability. This process led to unfortunate situations where Retrospect sent WOL packets to a subset of computers but did not get back to them until they had returned to sleep mode and Retrospect would send more WOL packets, leading to energy waste and fewer backups.
Parallelization: Despite allowing up to 16 simultaneous executions, the algorithm ran on a single thread, so when it polled each agent to check availability, the engine might wait for five minutes, because of the Wake-On-Lan support.
All of these drawbacks led to a suboptimal experience for certain subsets of customers, particularly those with larger environments.
In one instance, a corporation with 700 laptops that were configured with Wake-On-Lan attempted daily backups with a single Retrospect engine. Beyond being a far larger environment than a typical backup engine handled, the resource scheduling algorithm simply could not scale to that level.
The Engineering team researched various algorithms to preserve the core approach to dynamic scheduling while resolving the above shortcomings. We considered clustering by data source type, ranking by size, and prioritizing by bandwidth or past availability. Ultimately, we settled on a far simpler algorithm that focused on the last backup date and last backup duration as the main inputs.
Retrospect runs this algorithm, named ProactiveAI, for each of its 16 available execution units (slots) for all of the backup scripts (policies) on that Retrospect engine. Let’s walk through it.
Verify backup window: Retrospect can back up every hour, every day, every Sunday, or any other schedule. As soon as ProactiveAI sees a new backup window (i.e. a new day), it will attempt to back up the sources.
Verify an execution unit is available: ProactiveAI only runs when an execution unit is available.
Prioritize by next backup date: For all available or potentially available sources, Retrospect divides them into buckets for what day they are scheduled to be backed up next.
Using a future date might seem strange, but it can be in the past as well. This sorting algorithm ensure Retrospect prioritizes initial backups and then overdue backups. Think of it as last backup date combined with the script’s schedule. As an example, Script A with weekly backups and Script B with daily backups would calculate the next backup date differently.
Prioritize by last time checked: When Retrospect reaches out to a source, it marks that time in its configuration. ProactiveAI uses this time to ensure it doesn’t re-check sources that it already checked but couldn’t find, so that the script can process the entire list of sources before circling back.
Prioritize by the last backup duration: Now that Retrospect is down to sources within the same day of priority, ProactiveAI sorts them based on the last backup duration. Sources with faster previous backups will be backed up sooner than sources with slower previous backups.
As a real-life example, incremental backups of email services are fast, so those would be prioritized over a longer server backup. Because of this sorting, Retrospect will protect more sources throughout the day, but if a long server backup does not happen on a given day, its backup will be automatically given higher priority because its next backup was the day before. Conversely, a long daily server backup will not consistently crowd out shorter backups for endpoints or other data sources.
The Engineering team experimented with using the mean or median of more data points for each source, but the resulting sort order was too prone to hysteresis. In other words, if Retrospect included more past data, including backup durations that were anomalies, the future prioritization continued to be affected for longer than we thought was useful.
Default to prior order: If there is no duration, ProactiveAI uses the prior order. For instance, if it’s the first set of backups, they will occur as sources are available.
Connect to the next source: Retrospect will attempt to back up the selected source. If it’s not available, Retrospect marks that time and moves on. If Retrospect times out and the client and script have Wake-on-LAN (WOL) set, Retrospect sends a WOL packet, waits three minutes, then tries to connect again. If that connection times out, Retrospect marks the sources as unavailable and moves on.
Record next backup date: After a successful backup, Retrospect marks the next backup date for the source and moves on. As discussed earlier, this future date varies based on the script’s schedule.
Focusing only on last backup date and duration limits the assumptions that the algorithm needs to make about the specific environment.
The algorithm can ensure first backups are prioritized and then subsequent backups are ranked by likely duration, using only the last occurrence to avoid hysteresis. Changing how Retrospect handles Wake-On-Lan eliminates the chance of waking up a computer but not connecting to it, and per-execution unit scheduling allows up to 16 scheduling algorithm to run in parallel. Ignoring backup times means Retrospect never misses a day for a backup.
Finally, customers can solve questions of priority by creating multiple scripts, letting the application handle that rather than complicating a single script’s schedule with it.
Computational advances have made advanced AI algorithms like machine learning and deep learning much more accessible as tools, but Engineering teams still need to choose the algorithm that fits their problem space.
The Retrospect Engineering team used this resource scheduling method to ensure more disparate data sources were backed up faster, vastly improving our customers’ level of data protection.