Data Deduplication Volumes. Variable Block Deduplication.

Save disk space and money with the unique feature from Bacula Systems – Deduplication Volumes.

No other backup software stores data this way (patent pending technology)
Bacula Systems helps you overcome your scaling challenges
Raising the record size limit brings a major positive impact to your storage costs

Deduplication refers to any of a number of methods for reducing the storage requirements of a dataset through the elimination of redundant pieces without rendering the data unusable. Unlike compression, each redundant piece of data receives a unique identifier that is used to reference it within the dataset and a virtually unlimited number of references can be created for the same piece of data.

It is popular in applications that inherently produce many copies of the same data with each copy differing only slightly from the others, or even not at all. There are many storage systems and applications on the market today which implement deduplication. All can be classified into one of two deduplication types depending on how they store their data:

Fixed Block
Deduplication takes places in units of a fixed size (typically 4kB -128kB). Data must be aligned on block boundaries to deduplicate
Variable Block
Deduplication takes place in variable-length units anywhere from a few bytes to many gigabytes in size. Block boundaries do not exist.

Deduplicable filesystems use fixed block deduplication. The optimal unit of deduplication is the record size and it varies depending on the filesystems. For example, 128kB by default for ZFS, 4kB for NetApp. Others available on request.

The Traditional Way

Traditional backup programs were designed to work with tapes. When they write to disks, they use the same format only writing to a container file instead of a tape. The Unix program tar is an example of this and so is Bacula’s traditional volume format. Files are interspersed with metadata and written one after the other. File boundaries do not align with block boundaries as they do on the filesystem.

For this reason, backup data does not typically deduplicate well on fixed-block systems.

It is the main reason why variable block deduplication was invented and why there is a large market for deduplicating backup appliances.

Storage without deduplication

The new era. Bacula Enterprise Deduplication Volumes.

Deduplication Volumes store data on disks by aligning file boundaries to the block boundary of the underlying filesystem. Metadata, which does not align, is separated into a special metadata volume. Within the data volume, the space between the end of one file and the start of the next block boundary is left empty. Since every file begins on a block boundary, redundant data within files will deduplicate well using ZFS’s fixed block deduplication. This type of file is known as a sparse or holey file.

Storage with Deduplication Volumes

What to deduplicate?

The following data types deduplicate well:

Files that change constantly but are only appended to large log files
Large files that change daily but only in small amounts
Monolithic Databases
Some types of email boxes
Identical files that appear in backups from many clients
Operating system data from virtual machines
Email attachments with multiple recipients

Deduplication Volumes are limited only by ZFS itself:

Data will not dedupe across zpools
Deduplicated metadata is stored in the ARC/L2ARC
Only 1/4 of the ARC/L2ARC is reserved for metadata
Large dedupe repositories will require a large ARC/L2ARC

Download Trial

Sizing

How big does my L2ARC need to be?
It is tempting to start with the total amount of primary data to be backed-up when calculating the size of the L2ARC but the space taken by holes in the Bacula volumes needs to be considered too. This must be subtracted when trying to estimate the total amount of primary data that can be backed-up and deduped using an L2ARC of a given size. One way to think about this is to picture the storage of data inside a deduplication volume in terms of full and partially full blocks. It is the number of these blocks that affects the size of the L2ARC, not the amount of data they contain.

Files smaller than the block size will consist of one partially full block
Files larger than the block size will consist of one or more full blocks and usually end with one partially full block

Deduplication sizing – important parameters

The amount of data to deduplicate
The block size used for deduplication

The average percent full per-block vs empty space
The percentage of the L2ARC reserved for meta data

Examples of the impact of changing the parameters:

Primary data 100 TB
ZFS record size 128 kB
Average block fill percentage 50%
Retention period 90 days

– A typical situation with default values
L2ARC metadata percentage 25%
Daily percent of data changed 2%
L2ARC size needed 560 GB
L2ARC as percentage of primary data 0.547%

– Changing daily percentage: 2% ->5%
L2ARC metadata percentage 25%
L2ARC size needed 1 110 GB
L2ARC as percentage of primary data 1.074%

– Changing L2ARC metadata percentage: 25% -> 50%
Daily percent of data changed 5%
L2ARC size needed 550 GB
L2ARC as percentage of primary data 0.537%

How to size?

Accurate sizing is difficult in practice. Oversizing and using conservative estimates is recommended. To help in sizing your infrastructure for deduplication, Bacula Systems provides an online deduplication sizing calculator.

Sizing Recommendations

Install as much RAM as possible (ARC)
Use only SSDs for your L2ARC
Create a much larger L2ARC than you think you need

How sizing impacts your costs?

Current storage pricing trends bode well for ZFS deduplication. Solid State Drive (SSD) performance continues to increase and prices have come down significantly in the past years making large L2ARCs economically feasible. The combination of fast I/O processor (IOP) performance and large capacity is essential to maintain performance as the amount of data stored in the filesystem increases.

Deduplication Volumes is supported with:

Nexenta Systems OpenStorage Appliances
NetApp Data ONTAP 8.0.1 and higher
Oracle / Sun ZFS Storage Appliances
White Bear Solutions WBSAirback Appliances
ZFS on Linux (64-bit only)

Download Trial
Free backup infrastructure assessment

Further help:

Need to see all our plugins?
BWeb™ Management Suite is a comprehensive GUI management suite for Bacula Enterprise that provides the data reports, core metrics and analysis that system administrators need to provide to managers.
Training is available in different locations, depending on the Certified Bacula Systems Training Center you choose.