Hadoop HDFS Backup and Disaster Recovery Strategies & Guide

Home > Backup and Recovery Blog > Hadoop HDFS Backup and Disaster Recovery Strategies & Guide

Updated 22nd December 2023, Rob Morrison

Contents

Definition of Hadoop
HBase
Hadoop and data security
Hadoop misconceptions when it comes to data protection
What is expected from a modern Hadoop data protection solution?
Built-in Hadoop backup tools and measures
Methodology for figuring out the best Hadoop HDFS backup solution
Third-party Hadoop backup solutions
Hadoop HDFS backups and Bacula Enterprise
Conclusion

Definition of Hadoop

Hadoop was originally created to work with massive data sets, something that is commonly referred to as “big data”. It is a software framework with an open-source nature that is capable of both storing and processing massive data volumes. Designed with extensive scalability in mind, Hadoop delivers high availability, fault tolerance, and the capability to manage petabytes of data.

Hadoop consists of four main components:

Yet Another Resource Negotiator, or YARN. It is a resource management framework that oversees the allocation of computational resources (CPU, memory, and storage) to applications running on the Hadoop cluster. It simplifies resource allocation and scheduling, making Hadoop application management and scaling more manageable.
MapReduce. It is a programming model which facilitates the processing of large datasets on distributed clusters. The data in question is processed by multiple cluster nodes at the same time after it has been separated in smaller chunks. The processing results are then combined to generate the final output.
ZooKeeper. It is a distributed coordination service that provides a centralized registry for naming, configuration, and synchronization across Hadoop cluster nodes. The main goal of ZooKeeper is to ensure that the system state is consistent at all times by monitoring every single node in the cluster.
Hadoop Distributed File System, or HDFS. As its naming suggests, it is a file system designed specifically to work with large data volumes separated in clusters and nodes. It partitions data across multiple nodes, replicating data blocks for fault tolerance.

Hadoop’s scalability and flexibility make it a compelling choice for organizations that handle massive amounts of data. Its ability to store, process, and analyze large datasets efficiently has made it a cornerstone of modern data infrastructure. Hadoop has plenty of potential use cases, including, but not exclusive to:

Facilitating the creation of data warehouses for storage and analysis of massive data volumes in a specific structure.
Offering a priceless ability to receive insights, trends, and patterns from analyzing these data volumes.
Generating data lakes – repositories for massive amounts of unprocessed data.
Enabling the training and deployment of machine learning models on large datasets.
Collecting, analyzing, and storing large volumes of logs from websites, applications, and servers.

HBase

Hadoop is an interesting framework, but its main purpose is still data storage for specific purposes. Most users resort to using HBase in order to interact with all that data in a meaningful fashion. Apache HBase is a distributed database type with a NoSQL base that was created to work with massive data sets – just like Hadoop. HBase is an open-source offering that integrates with Apache Hadoop and its entire ecosystem. It can work with both EMRFS (Amazon S3’s variation of MapReduce) and HDFS (Hadoop’s own file system).

HBase uses Apache Phoenix to allow for SQL-like queries to be applied to HBase tables while also processing all of the commands to and from the MapReduce framework. It is scalable, fast, and tolerant to many faults with its structure that copies Hadoop’s approach quite a lot – with data being spread across multiple hosts in a cluster, so that no single host failure can bring down the entire system.

The introduction to HBase is important in this context since Hadoop and HBase are often used in tandem for many different use cases and project types.

At the same time, HBase backup and restore methods are different from what Hadoop uses – it is something that we would go over later.

Hadoop and data security

Hadoop has been going through a rather spontaneous popularity phase in recent years, especially in the context of AI and ML introduction in the form of chatbots and LLMs (Large Language Model) such as ChatGPT that are taught using massive data pools.

At the same time, the topic of Hadoop security as a whole has been rather problematic for a while now. There are several reasons for that problem, including the average data size (mostly counting petabytes and exabytes), the solution’s overall scalability (making it practically impossible to implement something that would work for any data form and size), as well as the included data replication feature.

Data replication is Hadoop’s original alternative to data backups – it creates 3 copies of each data block by default, which makes some users think that there is no need for a backup solution in the first place. What this approach typically lacks is the understanding that Hadoop’s capabilities only work for traditional unstructured data pools in warehouses and such.

So, when it comes to ML models, IoT data, social media data, and other data types that differ from the usual data lakes Hadoop is known for – it may offer little protection for that data, creating a massive security issue for its users.

There is also the problem of accessibility – data replicated by Hadoop is not stored separately from the original, making it just as vulnerable to issues and data breaches as the original data set. As such, there is a demand for Hadoop backup measures – both built-in and third-party ones.

However, before we move on to Hadoop backups specifically, it is important to talk a bit more about Hadoop in the context of data protection.

Hadoop misconceptions when it comes to data protection

The widespread adoption of Hadoop within enterprises has led to the proliferation of hastily implemented, basic Hadoop backup and recovery mechanisms. These rudimentary solutions, often bundled with Hadoop distributions or pieced together by internal development teams, may appear functional at first glance, but they pose significant risks to data integrity and organizational resilience, especially as systems grow in size and complexity.

Any resulting downtime or data loss due to failed recoveries during a disaster can have severe repercussions for businesses, tarnishing reputations, escalating costs, and hindering time-to-market efforts. Most of the disadvantages of such an approach can be explained by looking into some of the biggest misconceptions that Hadoop has in terms of data protection.

Misconception #1 – Using HDFS snapshots is a viable data protection strategy

The Hadoop Distributed File System uses snapshots to generate point-in-time copies of either single files or entire directories. There are plenty of limitations to this approach to data protection:

Recovering data from HDFS snapshots is a cumbersome process, requiring manual file location, schema reconstruction, and data file recovery.
HDFS snapshots operate at the file level, rendering them ineffective for databases like Hive and HBase, as the associated schema definitions are not captured in the backups.
While it is possible to perform and store multiple snapshots of the system, every single snapshot increases the cluster’s overall requirements in terms of storage, which may turn out to be a massive problem down the line.
Since snapshots reside on the same nodes as the data they protect, a node or disk failure can result in the loss of both snapshots and the protected data.

Misconception #2 – Hadoop commercial distributions offer sufficient backup capabilities

Commercial Hadoop distributions often include integrated backup tools. These tools, while offering a basic level of backup functionality, may not align with an organization’s stringent RPOs and RTOs. Essentially, these tools act as a user interface for HDFS snapshots, inheriting all the limitations associated with HDFS snapshots discussed previously. Moreover, these tools generally lack user-friendly recovery mechanisms, leaving data recovery a manual and error-prone process.

Misconception #3 – File System Replicas are a sufficient data protection measure for Hadoop

While replicas effectively safeguard data against hardware failures, such as node outages or disk drive malfunctions, they fall short in protecting against more prevalent scenarios involving data corruption. User errors, such as accidental table deletion in Hive, and application bugs can lead to data corruption, rendering replicas ineffective in restoring data integrity.

Misconception #4 – Custom scripts for Hadoop are suitable for long-term backup and recovery tasks

In-house development teams within many organizations often resort to developing custom scripts for backing up their Hive and HBase databases, as well as HDFS files. This approach typically involves dedicating several human-months to writing and testing scripts to ensure their functionality under all scenarios.

Unfortunately, this approach as a whole is extremely difficult to maintain, since custom scripts have to be updated and revised on a regular basis – be it because of Hadoop’s updates or some other reason. Similar to snapshots, scripts primarily focus on data replication and lack automated recovery mechanisms. As a result, data recovery remains a manual and error-prone process.

Furthermore, the absence of regular testing can lead to data loss, especially when the team responsible for script development is no longer available.

What is expected from a modern Hadoop data protection solution?

Data recovery strategies are something that every Hadoop-based environment would have to think about sooner or later. A comprehensive and well-defined Hadoop backup and recovery strategy is essential to ensure reliable and swift data recovery while minimizing the burden on engineering and development resources.

A modern Hadoop data protection solution should be able to balance between complex custom scripting and sophisticated data backup capabilities. It should operate autonomously, eliminating the need for dedicated resources and requiring minimal Hadoop expertise. Additionally, it should be exceptionally reliable and scalable to effectively manage petabytes of data, meeting stringent internal compliance requirements for recovery point objectives and recovery time objectives.

Furthermore, the solution in question should provide comprehensive ransomware protection, ensuring data integrity in the face of malicious attacks. Cloud storage integration is another crucial feature, enabling cost optimization and flexible data storage. The solution should also preserve multiple point-in-time copies of data for granular recovery, ensuring the availability of historical data when needed.

Moreover, a modern Hadoop backup and recovery software has to prioritize recovery efficiency, employing intelligent data awareness to deduplicate big data formats and streamline recovery processes. By leveraging advanced technologies and automation, such a solution can safeguard critical data assets and minimize the impact of data loss or corruption.

Built-in Hadoop backup tools and measures

As we have mentioned before, Hadoop does not offer any way to perform a “traditional” data backup, for a number of reasons. One of the biggest reasons for that is the sheer amount of data Hadoop usually operates with – petabytes and exabytes of unstructured information in a very unusual structure.

Fortunately, that is not to say that Hadoop is completely defenseless. Its own data structure with 3x replication by default makes it relatively safe against small parts of the cluster being out of commission – since the data itself is stored in multiple locations at the same time.

The aforementioned data replication is one of the biggest reasons why not all of the Hadoop users bother with backup measures in the first place – while completely forgetting that replication on its own cannot protect from cluster loss or other large-scale issues like natural disasters.

DistCp

Speaking of data replication, there is also a manual data replication tool that plenty of Hadoop users work with – DistCp, or Distributed Copy. It is a relatively simple CLI tool that offers the ability to replicate data from one cluster to another, creating a “backup” of sorts that acts as one more safeguard against potential data loss.

DistCp can be used to perform cluster copying with a relatively simple command:

bash$ hadoop distcp2 hdfs://fns1:8020/path/loc hdfs://fns2:8020/loc/parth

The command in question locates the fns1 namenode with the namespace under /path/loc and expands it into a temporary file. The directory’s contents are then divided among a set of map tasks before the copying process begins – using the fns2 cluster and the /loc/path location as the final destination.

It should be noted that there are two commonly used versions of DistCp out there – the original/legacy version and the “second” version called DistCp2. There are two large differences between these tool versions:

The legacy version of DistCp was not capable of creating empty root directories in the target folder, but DistCp2 can do that just fine.
The legacy version of DistCp did not update any file attributes of the files that were skipped during the copying process – that is not the case with DistCp2, since it would update all of the values such as permissions and owner group information even if the file in question was not copied.

HDFS Snapshots

The alternative to data replication for Hadoop when it comes to built-in measures is snapshotting. HDFS Snapshots are point-in-time copies of data with a read-only status that are fast and efficient – but not without their own caveats.

Snapshot creation is instant and does not affect regular operations of HDFS – since the reverse chronological order is used to record data modifications. Snapshots themselves only require additional memory when there are modifications that are made relative to a snapshot. Additionally, the Snapshot function does not copy blocks in data nodes – the only data that gets recorded is the file size and the block list.

There are a few basic commands that are associated with HDFS Snapshot creation, including:

HDFS Snapshot creation

hdfs dfs -createSnapshot hdfs://fns1:8020/path/loc

This specific command also supports optional custom naming for the snapshot in question – a standardized name would be used for the snapshot in question if the custom name has not been detected.

HDFS Snapshot deletion

hdfs dfs -deleteSnapshot hdfs://fns1:8020/path/loc snapshot2023

Unlike the previous command, the snapshot name is a non-optional argument in this case.

Permitting the creation of a Snapshot for a directory

hdfs dfs -allowSnapshot hdfs://fns1:8020/path/loc

Disallowing the creation of a Snapshot for a directory

hdfs dfs -disallowSnapshot hdfs://fns1:8020/path/loc

Of course, there are other approaches that can be used to safeguard Hadoop’s data in one way or another, such as dual load – the data management approach that loads all information to two different clusters at the same time. However, such approaches are often extremely nuanced and require extensive knowledge on the subject (as well as plenty of resources) to perform properly.

It should also be noted that HBase backup and restore operations are not identical to Hadoop backup measures mentioned in this article – even though HBase itself is running on top of HDFS (part of Hadoop). HBase backup and restore operations are completely different from Hadoop backup and recovery measures, with different CLI commands, different approach to backup creation, and more.

Methodology for figuring out the best Hadoop HDFS backup solution

Third-party backup solution providers can offer quite a lot in terms of Hadoop data backup. There are multiple different backup solutions that offer HDFS backup support in some way or another – but choosing one solution can be rather tricky. Luckily, we can offer several different factors that we’ve chosen to showcase every single solution in comparison with the rest of them.

Customer ratings

Customer ratings exist to act as a representative of the average opinion about the subject matter – a backup solution, in our case. We have used sites such as Capterra, TrustRadius, and G2 to receive this kind of information.

Capterra is a review aggregator platform that uses thorough checks on all of its customers to ensure review authenticity. It does not allow for vendors to remove customer reviews whatsoever. The overall review count for Capterra is over 2 million now, with almost a thousand different categories to choose from.

TrustRadius is a review platform that uses extensive multi-step processes to make sure that each review is authentic and real, and there is also a separate in-house Research Team to go through reviews for them to be detailed and thorough. The platform does not allow any kind of tampering with user reviews from the vendor side.

G2 is a notable review platform with over 2.4 million reviews to date. It has a library of more than 100,000 vendors to choose from, and its own review validation system to make sure that every review is real and genuine. G2 also has a number of other services to choose from, including tracking, investing, marketing, and more.

Key features and advantages/disadvantages

This is a rather complex category, including both the features and the advantages/disadvantages of the solution. In a sense, they are relatively similar, with some of the more prominent key features of an average Hadoop HDFS backup being:

Extensive scalability due to the sheer amount of data Hadoop deployments are dealing with.
High performance of backup/restore operations to ensure fast backups and quick recoveries, when necessary.
Flexibility in terms of data types that can be backed up, be it Namespaces, Deployments, Pods, Apps, etc.
Snapshot consistency should always be present in a Hadoop solution to ensure minimal data loss risk and easier recovery operations down the road.
Detailed analytics are recommended, they can greatly simplify the overall backup management task by providing useful insights and other kinds of data.

Pricing

Price is one of the most important factors of a backup solution – or any kind of product or service. When it comes to backup solutions specifically (especially Hadoop HDFS backup solutions) – the price can easily be the deciding factor for a variety of companies. The result depends a lot on the current needs of a customer, as well as plenty of other internal factors. It is highly recommended to always compare the price of the solution with its feature set to ensure the best value for money for your company.

A personal opinion of the author

A completely subjective part of the methodology – the opinion of the author about the topic (Hadoop HDFS backups). This category may include practically anything, from the author’s personal opinion about the subject at hand to some information that may not have been suitable to mention in other parts of the methodology.

Third-party Hadoop backup solutions

There are multiple possible third-party backup options for the Hadoop user, including both popular and lesser-known backup solutions.

Commvault

Commvault attempts to completely change the current field of data management by not requiring any form of on-site administration in order to control the entire data protection system. It operates as a centralized platform with both physical and virtual backups, offering the ability to manage every single aspect of the system from a single location. All of Commvault’s capabilities are packed in an accessible and user-friendly interface with no unnecessary complexity whatsoever.

Support for Hadoop data backups is one of many different capabilities that Commvault can offer. Both HDFS and HBase backup and restore capabilities are included in the overall package – with three different backup types (incremental, full, synthetic full), backup scheduling capabilities, granular data restoration, multiple restoration targets, and so on.

Customer ratings:

Capterra – 4.8/5 points with 11 customer reviews
TrustRadius – 8.0/10 points with 217 customer reviews
G2 – 4.2/5 points with 112 customer reviews

Advantages:

Commvault prioritizes user convenience, ensuring that routine configuration tasks are effortless to execute. This intuitive approach minimizes training requirements and maximizes productivity, fostering a smooth user experience.
Commvault’s scalability extends beyond vertical growth; it seamlessly scales horizontally to meet evolving demands by leveraging diverse integrations and supporting a wide range of storage types.
Commvault’s scalability is fairly good; it adapts well to some intricate and advanced IT infrastructures, providing comprehensive data protection for organizations of all sizes. It can work with some big data frameworks such as Hadoop.

Shortcomings:

Detailed reporting seems to be a rather common challenge for many enterprise data backup solutions – including Commvault. Despite specific integrations offering enhanced reporting, overall reporting shortcomings are evident across the board.
While Commvault boasts extensive support for containers, hypervisors, and databases, it’s crucial to acknowledge that universal compatibility remains elusive. A comprehensive evaluation of supported systems is advised prior to adoption.
Cost considerations are particularly pertinent for small and medium-sized businesses, as Commvault’s pricing often exceeds market averages, potentially straining budgets. A thoughtful assessment of financial implications is essential before investing in Commvault.

Pricing (at time of writing):

There is no official pricing information that can be found on Commvault’s website.
However, there is also the unofficial information that offers the pricing of $3,400 to $8,781 per month for a single hardware appliance.

My personal opinion on Commvault:

Commvault’s versatility shines through, with its support for a diverse array of storage solutions, spanning physical and cloud environments. Whether your data resides in traditional on-premises infrastructure or the elastic expanses of the cloud, Commvault ensures protection and accessibility. Its versatility is impressive, with the capability to create HDFS backups in multiple ways, making it a great contender for this list of Hadoop backup and recovery solutions.

NetApp

NetApp’s global reach, spanning over 150 offices worldwide, ensures readily accessible local support, providing prompt assistance whenever and wherever it’s needed. This extensive network of support centers underscores NetApp’s commitment to customer satisfaction. A centralized interface serves as the nerve center of NetApp’s data protection prowess, providing a unified platform for monitoring, scheduling, and logging your backup and recovery operations.

NetApp’s versatility shines through its support for a wide spectrum of data types, encompassing applications, databases, MS Exchange servers, virtual machines, and even data management frameworks such as Hadoop. NetApp works with the aforementioned DistCp in order to receive backed up data – since NetApp uses MapReduce to set its own NFS share as a backup target location for DistCp, acting similar to an NFS driver.

Customer ratings:

Capterra – 4.5/5 points with 8 reviews
TrustRadius – 9.2/10 points with 2 reviews
G2 – 3.8/5 points with 2 reviews

Advantages:

A substantial portion of the cloning process is automated, making it remarkably user-friendly with minimal complex settings or menus to navigate – and the same could be said for the rest of the solution, as well.
The solution’s remote backup capabilities are particularly noteworthy, potentially enabling a seamless data protection strategy.
The support for HDFS backup and restore tasks is realized through integration with DistCp – setting up a Network File System from NetApp as a destination for a DistCp backup task.

Shortcomings:

Despite its strengths, the solution can be marred by a notable number of bugs that can hinder its overall performance.
The solution lacks remote restoration capabilities for Linux servers, a significant drawback for some users.
Additionally, customer support is somewhat limited, leaving users to rely more heavily on self-service resources.

Pricing (at time of writing):

NetApp solutions tend to vary drastically in price and capabilities.
To obtain any kind of pricing information, potential customers must contact NetApp directly to initiate a free trial or demo.
Unofficial sources suggest that NetApp SnapCenter’s annual subscription fee starts at $1,410.

My personal opinion on NetApp:

NetApp can offer centralized backup management, a multitude of scheduling options, extensive backup-oriented features, and the capability to work with plenty of storage types. Backups generated with the solution are readily accessible from virtually any device equipped with a web browser, including laptops and mobile phones. NetApp stands out among its competitors by providing a global network of offices, which will likely help towards localized support for businesses in specific regions. It’s important to acknowledge that there was no single solution chosen as a description for NetApp’s Hadoop backup capabilities, since this particular feature utilizes a number of NetApp’s technologies that are not all bound to a single solution.

Veritas NetBackup

A stalwart in the realm of data protection, Veritas stands as a venerable entity with a rich legacy in the backup and recovery industry. Veritas can offer information governance, multi-cloud data management, backup and recovery solutions, and more. Furthermore, its flexible deployment model empowers clients to tailor their data protection strategies to their unique requirements. Veritas can offer a choice between a hardware appliance for seamless integration or software deployable on a client’s own hardware for maximum flexibility and control.

Veritas NetBackup can also offer Hadoop backup operations with its agentless plugin that can offer a multitude of features. This plugin offers both full and incremental backups, allowing for point-in-time data copies to be created at a moment’s notice. There are very few limitations when it comes to restoring said data, as well – an administrator is able to choose the restoration location, and the plugin also supports granular restore if necessary.

Customer ratings:

Capterra – 4.1/5 points with 8 reviews
TrustRadius – 6.3/10 points with 159 reviews
G2 – 4.1/5 points with 234 reviews

Advantages:

The overall number of features that Veritas can offer is strong in comparison to other vendors in the backup and recovery market.
Users commend the solution’s user-friendly interface, which effectively presents its comprehensive feature set without hindering accessibility.
Veritas’s customer support service fares reasonably well in its efficiency and responsiveness.
The overall versatility of the solution is another praise-worthy argument, with the software being capable of working with all kinds of environment types, including Hadoop (via a separate plugin for NetBackup).

Shortcomings:

Despite being an enterprise-class solution, Veritas falls short in certain areas regarding automation capabilities.
Moreover, its pricing can be considered expensive compared to some of its competitors.
There is no way to save backup reports in a custom location, and the overall reporting capability of Veritas is rather rigid.
The integration of tape library features is hindered by existing unresolved issues.

Pricing (at time of writing):

Veritas intentionally omits specific pricing information from its official website, opting instead for a personalized approach.
Potential customers must engage directly with Veritas to obtain pricing details that align with their specific requirements and deployment needs.
This individualized strategy allows Veritas to carefully curate its offerings, ensuring a perfect fit for each customer’s unique circumstances and preferences.

My personal opinion on Veritas:

Veritas stands as a venerable and trustworthy powerhouse in the realm of data management and backup solutions. With a proven track record spanning over several decades, Veritas has garnered widespread acclaim as a preferred backup vendor, particularly among industries that place high value on a company’s rich history and comprehensive portfolio. Renowned for its performance, Veritas offers a diverse array of backup solutions and features, complemented by a user interface that caters to a broad spectrum of users. It can even support complex structures such as Hadoop, including SSL support and Kerberos Authentication support.

Dell PowerProtect DD

PowerProtect DD stands as a comprehensive data protection and storage solution, encompassing backup, disaster recovery, and data deduplication capabilities. Its modular design caters to organizations of all sizes, making it a solution suitable for a wide variety of use cases. There are appliances for all business types available, from entry-level companies to large enterprises, boasting up to 150 Petabytes of logical capacity and a throughput of roughly 68 Terabytes an hour.

PowerProtect DD integrates seamlessly with Hadoop environments through a dedicated driver, DDHCFS, offering comprehensive data protection and a host of other advantages. The solution itself requires little to no prior configuration, and it uses a combination of its own technology (DD Boost, for faster data transfer) and Hadoop’s data replication/snapshotting capabilities in order to create and transfer backups to be stored in the PowerProtect DD appliance.

Customer ratings:

TrustRadius – 8.0/10 points with 44 customer reviews

Advantages:

Some customers praise the reliability of the appliance that can operate 24/7 and be accessible at all times.
The first-time installation process seems to be relatively simple.
There are plenty of different frameworks and storage types that are supported – some even have dedicated drivers, such as Hadoop, offering plenty of features to choose from, combined with effortless configuration.

Shortcomings:

Most of the offerings seem to be rather expensive when compared with an average market price.
Data restoration speed from an actual appliance seems to be relatively slow. This could become untenable for large data sets.
While the hardware management solution operates within acceptable limits, it does seem to be somewhat simplistic in its structure.

Pricing:

There is no official pricing information for most Dell EMC products on the official website, and PowerProtect DD appliances are no exception.

My personal opinion on Dell:

PowerProtect DD is slightly different from the rest of third-party options, mostly because it is a physical piece of hardware instead of a virtual software or platform. It is a comprehensive data protection and storage solution encompassing backup, disaster recovery, and data deduplication capabilities. It can work with both large enterprises and small companies, if necessary. It even has a dedicated driver for Hadoop disaster recovery tasks called DDHCFS – DD Hadoop Compatible File System, offering comprehensive data protection along with plenty of other advantages.

Cloudera

Cloudera is an American software company that specializes in enterprise data management and analytics. Their flagship platform is the only cloud-native platform specifically designed to operate seamlessly across all major public cloud providers and on-premises private cloud environments. Cloudera’s platform is built for enterprises that are looking into different ways of managing their massive data pools, generating insights and making informed decisions afterwards.

This management platform is by no means focused on backup and recovery, nor does it offer a traditional backup solution. However, Hadoop is the core framework for Cloudera as a whole, which is why it can offer some HDFS disaster recovery capabilities by providing the means of replicating data from one cluster to another. Cloudera’s backup capabilities are not particularly comprehensive on their own, but it does offer a number of useful features on top of basic DistCp-like capability – such as scheduling, data verification, and so on. It is a rather complex process in itself, but Cloudera does offer a step-by-step guide on this exact topic, making it a lot easier to perform.

Customer ratings:

G2 – 4.0/5 points with 38 customer reviews

Advantages:

The customer support is fast and efficient, offering extensive knowledge about the solution’s capabilities.
A sizable community around the solution makes it easier to find answers for various questions online, including some of the more unconventional capabilities of the software.
The solution can scale extremely well, making it applicable for small-scale businesses, large enterprises, and everything in-between.

Shortcomings:

The overall cost of the solution is rather high, and the cheapest possible offering is still considered quite expensive for most small businesses.
The solution’s documentation is rather lackluster, leaving a lot of topics and functions unexplained for the average user.
The solution’s user interface does not receive a lot of praise, plenty of users consider it rigid and unresponsive.

Pricing:

There is no official pricing information available on the Cloudera website.
Contact information and demo request form are the only things that can be acquired publicly.

My personal opinion on Cloudera:

Technically speaking, Cloudera is not a backup solution in itself – it is an enterprise data management platform. However, the platform in question is using Hadoop as its main framework, and there are data retention capabilities that are included in the package – even though they are mostly copying the capabilities of DistCp. Luckily, Cloudera can create data replication schedules, and even data restoration schedules for potentially problematic data-related events in the future. —nevertheless, by itself, it lacks many features that would make true backup and recovery operations limited at best, leading to potential business continuity, compliance and efficient operation difficulties in some organizations.

Hadoop HDFS backups and Bacula Enterprise

Bacula Enterprise is a highly secure, scalable backup solution that offers its flexible capabilities via a system of modules. There is a separate HDFS backup module that offers an efficient HDFS cluster backup and restore with multiple backup types (incremental, differential, full), and automatic snapshot management.

The module is capable of filtering data based on its creation date, making it extremely convenient to work with for an end user. Plenty of other backup functionality is also there, as well as almost complete freedom when it comes to choosing the restoration directory for HDFS backups.

The way this module works is also simple – a backup operation prompts a connection between a Hadoop FS and a Hadoop module in order to generate a snapshot of the system before sending it to the Bacula File Daemon. The full backup does not need to access previous snapshots, while both differential and incremental backups need to do so to take note of any differences between the last and the current snapshots.

There is also the fact that Bacula Enterprise is distributed using an advantageous subscription licensing model with no data volume limits. This is a massive advantage in the context of Hadoop, since most Hadoop deployments are massive data pools, and backing up these kinds of deployments scales the price up quite heavily in other solutions – but not with Bacula.

Plenty of other enterprise-class capabilities of Bacula are also applicable to backed-up Hadoop data. Bacula Enterprise is an exceptional and versatile solution suitable for many different use cases, including HPC which frequently utilize HDFS,

Bacula’s entire architecture is modular and customizable, making it easy for the solution to adapt to various IT environments – no matter what their size. The support for distributed infrastructures with load balancing via multiple Bacula Director servers helps to avoid overloads during heavy load periods. Generally speaking, Bacula has a track record of working with large data storages with little to no issues whatsoever – an exceptionally useful quality which contributes to its efficiency in Hadoop deployments. Bacula is also capable of being part of a comprehensive disaster recovery strategy. These are just some of the reasons it is used by the largest military and defense organizations in the world, banks, NASA, and US National Laboratories.

Conclusion

Hadoop is an important framework, especially with so many companies relying on large pools of data to perform ML and AI tasks, among many others. “Big data” has grown in use and the applications for its use have matured into sophisticated, high value business solutions. Similarly, the demand for frameworks that complement it is developing at the same pace.

However, with new data structures and frameworks, new problems also arise – because existing data safety protocols and measures are not always compatible with Hadoop systems. Fortunately, Hadoop has its own capabilities for data replication and snapshotting – and there are also multiple third-party backup solutions and platforms that can offer Hadoop backup capabilities.

Solutions such as Bacula or Veritas would be great for companies looking for an “all-in-one” solution that can cover Hadoop deployments while also protecting a broad range of different data and application types within the same infrastructure to achieve single pane of glass protection. Cloudera or even some of the built-in methods can work for some organizations with simple backup and recovery needs, as it offers a somewhat focused solution to a narrow problem but with very limited capabilities outside of HDFS and HBase coverage.

HDFS and HBase data can be protected to some extent with different methods and approaches within management solutions such as Cloudera. But if backup and recovery is needed to any level of sophistication at all, then specialized solutions such as Bacula will be needed to deliver the level of service needed.

About the author

Rob Morrison is the marketing director at Bacula Systems. He started his IT marketing career with Silicon Graphics in Switzerland, performing strongly in various marketing management roles for almost 10 years. In the next 10 years Rob also held various marketing management positions in JBoss, Red Hat and Pentaho ensuring market share growth for these well-known companies. He is a graduate of Plymouth University and holds an Honours Digital Media and Communications degree, and completed an Overseas Studies Program.