Save disk space and money with the unique feature from Bacula Systems – Deduplication Volumes.
Deduplication refers to any of a number of methods for reducing the storage requirements of a dataset through the elimination of redundant pieces without rendering the data unusable. Unlike compression, each redundant piece of data receives a unique identifier that is used to reference it within the dataset and a virtually unlimited number of references can be created for the same piece of data.
It is popular in applications that inherently produce many copies of the same data with each copy differing only slightly from the others, or even not at all. There are many storage systems and applications on the market today which implement deduplication. All can be classified into one of two deduplication types depending on how they store their data:
Deduplicable filesystems use fixed block deduplication. The optimal unit of deduplication is the record size and it varies depending on the filesystems. For example, 128kB by default for ZFS, 4kB for NetApp. Others available on request.
The Traditional Way
Traditional backup programs were designed to work with tapes. When they write to disks, they use the same format only writing to a container file instead of a tape. The Unix program tar is an example of this and so is Bacula’s traditional volume format. Files are interspersed with metadata and written one after the other. File boundaries do not align with block boundaries as they do on the filesystem.
For this reason, backup data does not typically deduplicate well on fixed-block systems.
It is the main reason why variable block deduplication was invented and why there is a large market for deduplicating backup appliances.
Storage without deduplication
The new era. Bacula Enterprise Deduplication Volumes.
Deduplication Volumes store data on disks by aligning file boundaries to the block boundary of the underlying filesystem. Metadata, which does not align, is separated into a special metadata volume. Within the data volume, the space between the end of one file and the start of the next block boundary is left empty. Since every file begins on a block boundary, redundant data within files will deduplicate well using ZFS’s fixed block deduplication. This type of file is known as a sparse or holey file.
Storage with Deduplication Volumes
What to deduplicate?
The following data types deduplicate well:
Deduplication Volumes are limited only by ZFS itself:
How big does my L2ARC need to be?
It is tempting to start with the total amount of primary data to be backed-up when calculating the size of the L2ARC but the space taken by holes in the Bacula volumes needs to be considered too. This must be subtracted when trying to estimate the total amount of primary data that can be backed-up and deduped using an L2ARC of a given size. One way to think about this is to picture the storage of data inside a deduplication volume in terms of full and partially full blocks. It is the number of these blocks that affects the size of the L2ARC, not the amount of data they contain.
Deduplication sizing – important parameters
Examples of the impact of changing the parameters:
Primary data 100 TB
ZFS record size 128 kB
Average block fill percentage 50%
Retention period 90 days
– A typical situation with default values
L2ARC metadata percentage 25%
Daily percent of data changed 2%
L2ARC size needed 560 GB
L2ARC as percentage of primary data 0.547%
– Changing daily percentage: 2% ->5%
L2ARC metadata percentage 25%
L2ARC size needed 1 110 GB
L2ARC as percentage of primary data 1.074%
– Changing L2ARC metadata percentage: 25% -> 50%
Daily percent of data changed 5%
L2ARC size needed 550 GB
L2ARC as percentage of primary data 0.537%
How to size?
Accurate sizing is difficult in practice. Oversizing and using conservative estimates is recommended. To help in sizing your infrastructure for deduplication, Bacula Systems provides an online deduplication sizing calculator.
How sizing impacts your costs?
Current storage pricing trends bode well for ZFS deduplication. Solid State Drive (SSD) performance continues to increase and prices have come down significantly in the past years making large L2ARCs economically feasible. The combination of fast I/O processor (IOP) performance and large capacity is essential to maintain performance as the amount of data stored in the filesystem increases.