Save disk space and money with the unique feature from Bacula Systems – Deduplication Volumes.
- No other backup software stores data this way (patent pending technology)
- Bacula Systems helps you overcome your scaling challenges
- Raising the record size limit brings a major positive impact to your storage costs
Deduplication refers to any of a number of methods for reducing the storage requirements of a dataset through the elimination of redundant pieces without rendering the data unusable. Unlike compression, each redundant piece of data receives a unique identifier that is used to reference it within the dataset and a virtually unlimited number of references can be created for the same piece of data.
It is popular in applications that inherently produce many copies of the same data with each copy differing only slightly from the others, or even not at all. There are many storage systems and applications on the market today which implement deduplication. All can be classified into one of two deduplication types depending on how they store their data:
- Fixed Block
Deduplication takes places in units of a fixed size (typically 4kB -128kB). Data must be aligned on block boundaries to deduplicate
- Variable Block
Deduplication takes place in variable-length units anywhere from a few bytes to many gigabytes in size. Block boundaries do not exist.
Deduplicable filesystems use fixed block deduplication. The optimal unit of deduplication is the record size and it varies depending on the filesystems. For example, 128kB by default for ZFS, 4kB for NetApp. Others available on request.
The Traditional Way
Traditional backup programs were designed to work with tapes. When they write to disks, they use the same format only writing to a container file instead of a tape. The Unix program tar is an example of this and so is Bacula’s traditional volume format. Files are interspersed with metadata and written one after the other. File boundaries do not align with block boundaries as they do on the filesystem.
For this reason, backup data does not typically deduplicate well on fixed-block systems.
Storage without deduplication
The new era. Bacula Enterprise Deduplication Volumes.
Deduplication Volumes store data on disks by aligning file boundaries to the block boundary of the underlying filesystem. Metadata, which does not align, is separated into a special metadata volume. Within the data volume, the space between the end of one file and the start of the next block boundary is left empty. Since every file begins on a block boundary, redundant data within files will deduplicate well using ZFS’s fixed block deduplication. This type of file is known as a sparse or holey file.
Storage with Deduplication Volumes
What to deduplicate?
The following data types deduplicate well:
- Files that change constantly but are only appended to large log files
- Large files that change daily but only in small amounts
- Monolithic Databases
- Some types of email boxes
- Identical files that appear in backups from many clients
- Operating system data from virtual machines
- Email attachments with multiple recipients
Deduplication Volumes are limited only by ZFS itself:
- Data will not dedupe across zpools
- Deduplicated metadata is stored in the ARC/L2ARC
- Only 1/4 of the ARC/L2ARC is reserved for metadata
- Large dedupe repositories will require a large ARC/L2ARC
How big does my L2ARC need to be?
It is tempting to start with the total amount of primary data to be backed-up when calculating the size of the L2ARC but the space taken by holes in the Bacula volumes needs to be considered too. This must be subtracted when trying to estimate the total amount of primary data that can be backed-up and deduped using an L2ARC of a given size. One way to think about this is to picture the storage of data inside a deduplication volume in terms of full and partially full blocks. It is the number of these blocks that affects the size of the L2ARC, not the amount of data they contain.
- Files smaller than the block size will consist of one partially full block
- Files larger than the block size will consist of one or more full blocks and usually end with one partially full block
Deduplication sizing – important parameters
- The amount of data to deduplicate
- The block size used for deduplication
- The average percent full per-block vs empty space
- The percentage of the L2ARC reserved for meta data
Examples of the impact of changing the parameters:
Primary data 100 TB
ZFS record size 128 kB
Average block fill percentage 50%
Retention period 90 days
– A typical situation with default values
L2ARC metadata percentage 25%
Daily percent of data changed 2%
L2ARC size needed 560 GB
L2ARC as percentage of primary data 0.547%
– Changing daily percentage: 2% ->5%
L2ARC metadata percentage 25%
L2ARC size needed 1 110 GB
L2ARC as percentage of primary data 1.074%
– Changing L2ARC metadata percentage: 25% -> 50%
Daily percent of data changed 5%
L2ARC size needed 550 GB
L2ARC as percentage of primary data 0.537%
How to size?
Accurate sizing is difficult in practice. Oversizing and using conservative estimates is recommended. To help in sizing your infrastructure for deduplication, Bacula Systems provides an online deduplication sizing calculator.
– Install as much RAM as possible (ARC)
– Use only SSDs for your L2ARC
– Create a much larger L2ARC than you think you need
How sizing impacts your costs?
Current storage pricing trends bode well for ZFS deduplication. Solid State Drive (SSD) performance continues to increase and prices have come down significantly in the past years making large L2ARCs economically feasible. The combination of fast I/O processor (IOP) performance and large capacity is essential to maintain performance as the amount of data stored in the filesystem increases.
Deduplication Volumes is supported with:
- Nexenta Systems OpenStorage Appliances
- NetApp Data ONTAP 8.0.1 and higher
- Oracle / Sun ZFS Storage Appliances
- White Bear Solutions WBSAirback Appliances
- ZFS on Linux (64-bit only)
- Need to see all our plugins?
- BWeb™ Management Suite is a comprehensive GUI management suite for Bacula Enterprise Edition that provides the data reports, core metrics and analysis that system administrators need to provide to managers.
- Training is available in different locations, depending on the Certified Bacula Systems Training Center you choose.