Contents
- Introduction: Shifting the Focus from Prevention to Recovery
- Traditional Metrics vs. Recovery‑Centric Resilience
- The Cyber Recovery Gap: Lessons from Recent Incidents and Research
- Recovery Speed as the Real Metric of Resilience
- Factors Affecting Recovery Speed
- Selecting the Right Metrics and KPIs
- Why Bacula Excels at Fast, Clean Recovery
- Strategies to Accelerate Recovery and Improve Resilience
- Conclusion: Adopting a Recovery‑First Mindset
- Frequently Asked Questions
Introduction: Shifting the Focus from Prevention to Recovery
For the majority of the last 20 years, the primary investment case for cybersecurity has been centered around prevention: firewalls, endpoint protection, threat intelligence and keeping attackers out at all costs. The concept made sense when incidents were less frequent and more containable.
This approach makes far less sense in a world where, for many organizations, the question has shifted from “Will we have a major incident?” to “How fast will we recover after having an incident?”
The business impact of downtime and ransomware attacks
As businesses have become more reliant on uninterrupted information access, the financial and operational impact of unplanned downtime has increased significantly. In industries like healthcare, financial services and critical infrastructure, being offline for a matter of hours can lead to a wide range of detrimental events:
- Postponed operations
- Botched transactions
- Regulatory penalties
- Damaged brand reputation that lasts beyond the actual downtime
Modern ransomware has changed this dynamic quite significantly. It’s now standard practice to attack the backups alongside the primary systems, if only to reduce the recovery options (and leverage) of the attacked organization. Paying a ransom does not guarantee the restoration of business operations, either – decryption keys are often slow or incomplete, and the data that is restored could still contain dormant malicious code in it. Therefore, the recovery process isn’t just about reversing the encryption process.
Cyber resilience defined: beyond protection and detection
Cyber resilience is commonly seen as a synonym for cyber security, even though they are conceptually different by nature. Cyber security concentrates on minimising the possibility of an incident occurring, whereas Cyber resilience describes how a business would restore the required functions in the event of failed preventative controls. Given the sophistication of modern threats, the question of these controls failing is not about “if”, but about “when”.
A resilient organization is not an organization that has no incidents. A resilient organization is the one that recovers from incidents faster, smoother and with less sustained impact on operations. This differentiation is significant for setting strategy, allocating budget, and evaluating whether existing controls are adequate to begin with.
Traditional Metrics vs. Recovery‑Centric Resilience
Most metrics that commonly measure the security posture were developed during the age when containment was the primary security goal. They are still valuable, but provide an incomplete picture of how well an organization’s going to perform in a serious incident – since it stops once the attacker is removed. The recovery-centric resilience approach, on the other hand, is where this point is treated as the beginning, not the end, focusing on how efficiently and cleanly a company can return to normal functioning.
Brief overview of MTTD, MTTR, RPO and RTO
MTTD (Mean Time to Detect) is used to define the time between when something has happened and when that fact is discovered.
MTTR (Mean Time to Respond, in security contexts) is used to define the time between detection and containment.
RPO (Recovery Point Objective) defines the maximum acceptable data loss as a point in time, RTO (Recovery Time Objective) defines how quickly systems must be recovered.
These metrics are not new to security, and they themselves are not the problem per se. The problem lies in how much weight organizations give them in relation to each other.
Limitations of detection speed and prevention spend
Detection speed is a factor, but only up to a certain point. Knowing about an intrusion immediately is beneficial by itself, but if the organization’s infrastructure is unable to recover clearly once the issue has been identified and contained – there is no meaningful reduction to the business impact.
Prevention spend deals with a similar kind of ceiling – not a single preventive control measure can eliminate the risk entirely, and a security budget weighted heavily toward prevention at the expense of recovery capability is going to leave an organization well-defended and poorly prepared at the same time.
Why mean time to recovery (MTTR/MTCR) matters more
The metric that reflects an organization’s resilience the best is how long it actually takes to recover from a verified clean slate to normal operation. This kind of approach goes far beyond the usual definition of MTTR in security operations, as well.
In the context of data recovery, Mean Time to Clean Recovery (MTCR) is defined by the timeframe between incident confirmation and a trusted, malware-free system running at full capacity. This distinction becomes extremely important when considering the integrity of what’s being restored, and not the mere restoration speed.
The Cyber Recovery Gap: Lessons from Recent Incidents and Research
The gap between assumed recovery capability and actual recovery performance is often quite substantial. It’s not uncommon for organizations to discover this difference during an incident, not in testing – which is far from the most suitable time to find it out.
High failure rates of ransomware restorations in healthcare and other sectors
Healthcare is one of the primary targets for ransomware, both due to the overall importance of healthcare operations and because of the legacy infrastructures and underfunded IT departments that are both common in the field.
According to the Sophos State of Ransomware in Healthcare 2024 report, only 22% of healthcare organizations were able to recover from ransomware attacks within a week or less, which is a significant drop from the 54% of organizations that reported successful recovery back in 2022.
The same report also revealed that attackers often try to exploit the backups of healthcare organizations (reported in 95% cases), with 2/3s of those attempts being successful. Organizations with compromised backups were also found to be twice as likely to pay the ransom (63% vs 27%), as well.
Data showing limited recovery practice and compromised backups
The frequency of recovery testing is a persistent weak point in itself. A 2021 study referenced by DR practitioner Dale Shulmistra found that nearly half of businesses test disaster recovery once a year or less, and 7% don’t test it at all.
Attackers learn to target that vulnerability: the dwell time (the time between the intruder getting access and the ransomware being initiated) can be anything from days to months, allowing the time for malware to enter the backup chain. Unless integrity checking is integrated into the backup process, there is no telling how far one would have to go to find an uncompromised backup.
Typical recovery speeds for different storage media
The speed of recovery is greatly affected by the storage media being used.
The fastest tier includes NVMe SSDs and storage-class memory (SCM). Traditional SAS/SATA drives are much slower in comparison, object storage performance depends on network and object size, and tape introduces substantial retrieval latency (up to several hours for large data sets).
Precise throughput figures are environment-specific and typically live in vendor benchmarks rather than independent research – but the gap between the tiers is big enough to determine if a documented RTO is actually possible or not.
Recovery Speed as the Real Metric of Resilience
Defining Mean Time to Recovery (MTTR) vs. Mean Time to Clean Recovery (MTCR)
Since we have already defined both MTTR and MTCR, it’s also important to talk about their differences in more detail – the differences that become the most apparent under attack conditions. The table below showcases how MTTR differs from MTCR depending on the incident type:
| Scenario | MTTR | MTCR |
| Hardware failure | Time to restore from backup | Same as MTTR — integrity not in question |
| Accidental deletion | Time to restore affected data | Same as MTTR — source is trusted |
| Ransomware (backups intact) | Time to restore clean systems | Close to MTTR — integrity verifiable |
| Ransomware (backups compromised) | Time to restore systems | Significantly longer — clean restore point must first be identified |
| Targeted attack with long dwell time | Time to restore systems | Potentially much longer — compromise may extend deep into backup history |
How fast, clean recovery reduces regulatory exposure and downtime costs
The cost of an incident increases over time, and recovery speed is one of the biggest factors determining the final value. Downtime cost estimates that are being published tend to vary significantly depending on sector, organization size, and methodology – from tens of thousands to several hundred thousand dollars per hour in data-intensive industries (with the variation partially reflecting how rare it is for organizations to disclose actual costs publicly).
All data sources agree with the fact that every hour of downtime has a measurable financial cost, while tested and proven recovery processes also manage to reduce regulatory exposure in situations where timely restoration of data availability is a compliance factor.
Regulatory pressures: EU Cyber Resilience Act and other frameworks
The exact coverage of the EU Cyber Resilience Act (Regulation (EU) 2024/2847) is worth talking in detail about.
The CRA entered into force on 10 December 2024, with main obligations entering into force on 11 December 2027. It applies specifically to products incorporating digital elements – both hardware and software made available in the EU, with manufacturers being responsible for cybersecurity during all stages of the product lifecycle.
The frameworks more directly relevant to organisational recovery capability are NIS2 (Network and Information Systems), which covers critical sectors and supply chains, and DORA (Digital Operational Resilience Act), which imposes specific operational resilience and testing requirements on financial entities.
Factors Affecting Recovery Speed
Recovery speed is not just a single isolated variable, but the product of several interconnected factors. In order to improve MTCR in a meaningful way, it’s necessary to understand where bottlenecks are most likely to appear.
Infrastructure and storage performance (SCM, SSD, SAS, Object, Tape)
Maximum restoration speeds are primarily dictated by the throughput capability of the media that the recovered data is written to.
Storage tiering (using high-speed media for mission-critical applications while reserving slower storage for the less time-sensitive data) can be employed to achieve an acceptable restoration speed for key data without incurring the costs of high-performance storage across the board.
Similarly, network bandwidth becomes the bottleneck to restoration if a large dataset is restored across a busy network – even data from high-performance media storage would take longer to recover if it’s bottlenecked by the network infrastructure’s capabilities.
Data integrity: ensuring clean backups free of malware
Speed without integrity in the context of cybersecurity is actually worse than useless – as restoring quickly using a compromised backup is only going to prolong the incident.
Effective recovery depends on both integrity verification and malware scanning being part of the backup process, not just a one-time check during the restoration process.
Backups to WORM storage cannot be encrypted or modified by ransomware, even if the backup system itself is under the control of an attacker.
All this, combined with versioned retention, creates a recoverable state that is difficult to infect – if the retention period is long enough to contain the initial infection.
Automation, orchestration and prioritization of restore jobs
Manual recovery processes generate variability that is difficult to work with under pressure. Standardized playbooks can help prioritize critical systems, sequence dependencies correctly, and execute restore jobs in parallel where possible – and there’s no need for a human judgment call at each step during an emergency.
The point here is not to remove human oversight, but to ensure that decisions requiring human judgment are made during planning instead of being improvised on the spot.
Human factors: testing recovery plans and skills regularly
A recovery plan that only exists in documentation is not as reliable as a plan that has been executed already. Tabletop exercises demonstrate communication and decision-making weaknesses within an organization, while full restoration tests highlight potential technical failures – undocumented dependencies, systems that will not be able to restore cleanly, schedules that will not meet initial expectations.
These tests must occur often enough to keep pace with infrastructure changes, and it’s also important for those tests to mimic real threat scenarios as much as possible instead of focusing solely on hardware failures.
Selecting the Right Metrics and KPIs
Combining RPO, RTO, MTTR and MTCR for a holistic view
No single metric can capture the full picture in this case.
- RPO defines acceptable data loss and informs backup frequency
- RTO sets the restoration target
- MTTR tracks actual performance against that target
- MTCR adds the integrity dimension that matters most in cyber recovery scenarios
When combined, these metrics allow an organization to pinpoint specific weaknesses. For example, a combination of robust RTO and poor MTCR points to backup integrity as the biggest issue. Alternatively, high MTCR with a missed MTTR means that the problem is either in the resource or the process department.
Aligning metrics with business continuity and compliance objectives
Metrics are at their most useful when they can be tied to outcomes that really matter to the business. RTOs that are based on business impact analysis (showcasing the real operational cost of downtime) are more actionable than RTOs set to match vendor defaults or copied from generic frameworks.
Similarly, MTCR targets should reflect the practical integrity requirements of the data in question, along with the regulatory obligations that apply to it.
Why Bacula Excels at Fast, Clean Recovery
The problems outlined above – compromised backup, slow restoration, integrity uncertainty, manual process variability – are the exact same problems that solutions like Bacula Enterprise were built to address. Its architecture is a clear reflection of the idea that the backup cleanliness and the recovery performance cannot be treated as separable concerns.
Bacula’s modular architecture and scalability
Bacula’s modular design helps ensure that organizations don’t have a single point of failure, even when managing large and distributed environments. The platform consists of three main components: the director, storage daemon and the file daemon. Each component can scale independently based on an organization’s throughput and capacity needs.
This design helps support large and complex environments (including hybrid and multi-cloud deployments) without the prerequisite of a monolithic infrastructure that becomes a single point of failure.
Granular recovery: restoring individual files and systems quickly
Not every issue requires a full system restore. More often than not, restoring only certain files, databases, or services is a faster way of returning to an operational state than restoring entire systems from scratch.
Granular recovery from Bacula allows the system administrator to select exactly which item to restore, limiting the time of restore and the risk of reintroducing potentially infected data.
Integration with WORM storage, immutability and malware scanning
Bacula Enterprise allows for the integration with WORM storage devices and immutable backup destinations, reducing the risk of both backup tampering and backup encryption. Its malware scanning capabilities verify backup integrity before a restore is performed, thus mitigating the risk of restoring from a corrupted backup point.
These features directly address the MTCR challenge – helping to verify whether recovery will begin from a trusted backup copy.
Prioritizing restore jobs and automating recovery workflows
The scripting and API features offered by Bacula can facilitate automated restore workflows and sequenced runbooks. System restore jobs can be prioritised by business importance, with system dependencies being managed to ensure that everything comes back online in the correct sequence. This can aid in improving practical MTTR and also improve RTOs for when the need arises.
Strategies to Accelerate Recovery and Improve Resilience
Regularly testing backups and verifying data integrity
A successful backup job does not equal a backup that can be restored cleanly. Integrity verification is all about performing periodic restore testing – not simply checking that the backup process is running, but making sure that the data it produces is recoverable, uncorrupted, and malware-free.
Restore test frequency should reflect two primary factors:
- The criticality of the systems involved
- The pace at which infrastructure changes
Using tiered storage and high‑speed media for critical data
Not every data piece has to be stored on the fastest medium the company has, but those that require short RTOs should certainly be stored this way. Adopting a tiered approach – with high-performance, high-throughput media being used for the applications that demand it, while less critical data placed on slower, cheaper storage – helps organizations optimize recovery speed where it matters most without facing the cost of high-performance storage across the board.
Automating incident response and disaster recovery playbooks
Recovery playbooks that have to be assembled under incident conditions are a lot less reliable than those that were created and tested in advance. Automation as a feature helps reduce the dependence on real-time judgment for decisions that can be pre-defined – be it system restore order, dependency sequencing, or parallel job executions. Automation also results in more predictable outcomes, making post-incident review and improvement significantly more useful.
Measuring and improving MTTR and MTCR over time
Resilience improves when it’s measured in a consistent manner. Monitoring MTTR and MTCR across both tests and live incidents (instead of treating each exercise as a one-off event) allows companies to figure out where time is being lost – be it in detection, backup integrity checking, restore sequencing, or human coordination.
That data is what helps turn recovery planning from a compliance exercise into a useful programme with measurable outcomes.
Conclusion: Adopting a Recovery‑First Mindset
Summarizing why recovery speed defines cyber resilience
While prevention and detection are needed, the speed and cleanliness of recovery both dictate the true cost of an incident. MTCR – time to a verified, non-infected, working state – is a much more honest indicator of resilience than security posture metrics alone, and it’s also the most controllable metric within an organization’s reach during an attack.
Encouraging organizations to evaluate and improve their recovery metrics
Organizations would not be able to have an accurate picture of their actual MTCR if they have not recently tested their recovery capabilities against realistic scenarios, such as compromised backup chains or extended dwell time.
Bacula Enterprise offers the architecture, integrity controls, and automation capabilities needed to meaningfully reduce that gap even in the most complex, large-scale environments, while also helping develop a recovery posture that can be demonstrated instead of simply being assumed.
Frequently Asked Questions
Is recovery speed more important than breach prevention?
Neither option is mutually exclusive. Prevention minimizes the risk of incidents occurring; strong recovery capability minimises the impact if an incident is actually taking place. The practical case for giving recovery greater emphasis than it usually has is that prevention has a certain ceiling – complex attacks will, at some stage, succeed against even the most robust of targets – while recovery capability is directly proportional to how much an incident is going to cost in general.
How do cyber insurance providers evaluate recovery capabilities?
Underwriters have become more rigorous in this area as of late. Most now are explicitly asking about the frequency of the backup, offsite and immutable backup availability, how often the recovery process is tested, and whether backups are isolated from the production network. Organisations with documented, regularly tested recovery processes and verifiable clean backup chains tend to receive more favourable terms than those whose backup strategy exists primarily on paper.
What recovery metrics do regulators and auditors actually care about?
While regulatory scope differs between frameworks and sectors, commitments and demonstration of practicability for RTO and RPO are universally applicable. The ability to restore access to personal data within an acceptable timeframe after a breach is a specific requirement of GDPR and comparable data protection legislations. In the meantime, DORA provides testing specifics for financial entities. Auditors increasingly want to see test results, not just documented targets.