Every IT department in the world is feeling the pressure. The mandate now and for the foreseeable future is to reduce capital expenditures, lower operating costs and save energy. This is not just about being green anymore; it is about fiscal common sense in a slow economy.
Now is a time for IT professionals to think out of the box and investigate technologies that can effect greater efficiency and return on investment. This is nothing new to IT, but it is now a matter of survival. What may have simply been a good idea before is now a mandate, which is why the adoption of deduplication technology has accelerated towards the end of this year. Deduplication has become recognized as the next evolutionary step in backup technology. The benefits are tangible and extremely practical: eliminating duplicate data in secondary storage archives can slash media costs, streamline management tasks and minimize the bandwidth required to replicate data. In short, deduplication improves efficiency and saves money – just what is required when IT budgets are tight while mission critical data continues its exponential growth.
There are many providers of deduplication solutions today, so how does one deploy the right one? Each vendor lays claim to having the best approach to data deduplication, leaving customers to face the difficulty of separating hype from reality and determining which factors are really important to their business. With some vendors setting unrealistic expectations by predicting huge reductions in data volume, some customers may find themselves ultimately disappointed with their solution.
Companies must consider a number of key factors in order to select a data deduplication solution that actually delivers cost-effective, high-performance and scalable long-term data storage. This article will provide the background information required to make an informed data deduplication purchasing decision.
Data deduplication is now more than ever an operational requirement
So what caused the proliferation of duplicated data in the first place? Ironically, current industry standard backup practices are the number one cause of duplication. In the interest of data protection, the traditional backup paradigm copies data to a safe secondary-storage repository over and over again, creating a monstrous overkill of backed-up information. Under this scenario, every backup exacerbates the problem.
Because secondary storage volumes are growing exponentially, companies need a way to dramatically reduce these data volumes. Regulatory requirements magnify the challenge, forcing businesses to change the way they look at data protection. By eliminating duplicate data and ensuring that data archives are as compact as possible, companies can keep more data on line longer – at significantly lower costs. As a result, data deduplication is now a required technology for any company wanting to optimize the performance, efficiency and cost-effectiveness of its data storage environment.
Although compression technology can deliver an average 2:1 data volume reduction, this is only a fraction of what is required to deal with the data deluge most companies now face. Only data deduplication technology can meet the requirements companies have for far greater reductions in data volumes.
Data deduplication also can minimize the bandwidth needed to transfer backup data to offsite archives. With the hazards of physically transporting tapes being well-established (damage, theft, loss, etc.), electronic transfer is fast becoming the offsite storage modality of choice for companies concerned about minimizing risks and protecting essential resources.
Eight criteria for a robust data deduplication solution
There are eight important criteria to consider when evaluating data deduplication solutions:
1. Focus on the largest problem
2. Integration with current environment
3. Virtual tape library capability
4. Impact of deduplication on backup performance
5. Scalability
6. Distributed topology support
7. Highly available deduplication repository
8. Efficiency and effectiveness
1. Focus on the largest problem
The first consideration is whether the solution attacks the area where the largest problem exists: backup data in secondary storage. Duplication in backup data can cause its storage requirement to be many times that which would be required if the duplicate data could be eliminated.
The following graphic, courtesy of the Enterprise Strategy Group (ESG), illustrates why a new technology evolution in backup is necessary. Incremental and differential backups were introduced to decrease the amount of data required compared to a full backup, as depicted in Figure 1.
However, even within incremental backups, there is significant duplication of data when protection is based on file-level changes. When considered across multiple servers at multiple sites, the opportunity for storage reduction by implementing a data deduplication solution becomes huge.

[Figure 1]
2. Integration with current environment
An effective data de-duplication solution should be as non-disruptive as possible. Many companies are turning to virtual tape libraries (VTLs) to improve the quality of their backup without disruptive changes to policies, procedures, or software. This makes VTL-based data deduplication the least disruptive way to implement this technology. It also focuses on the largest pool of duplicated data: backups. Others are deploying a disk-to-disk backup paradigm, which requires a deduplication solution to present a network interface to the backup application. Introducing deduplication into this process simplifies and enhances disk-to-disk backups, performing deduplication without disruption to ongoing operations.
Solutions requiring proprietary appliances tend to be less cost-effective than those providing more openness and deployment flexibility. An ideal solution is one that is available as both software and turnkey appliances in order to provide the maximum opportunity to utilize existing resources.
3. Virtual tape library capability
If data deduplication technology is implemented around a virtual tape library (VTL), the capabilities of the VTL itself must be considered as part of the evaluation process. It is unlikely that the savings from data deduplication will override the difficulties caused by using a sub-standard VTL. Consider the functionality, performance, stability and support of the VTL as well as its deduplication extension.
4. Impact of deduplication on backup performance
It is important to consider where and when data deduplication takes place in relation to the backup process. Although some solutions attempt deduplication while data is being backed up, this inline method processes the backup stream as it comes into the deduplication appliance, making performance dependant on the single node’s strength. Such an approach can slow down backups, jeopardize backup windows and degrade VTL performance over time.
By comparison, data deduplication solutions that run after backup jobs complete, or concurrently with backup processes, avoid this problem and have no adverse impact on backup performance. This post-processing method processes the backup data by reading it from the backup repository after backups have been cached to disk, which ensures that backups are not throttled by any storage limitations. An enterprise-class solution that offers this level of flexibility is ideal for organizations looking for a choice of deduplication methods.
For maximum manageability, the solution should allow for granular (tape- or group-level) policy-based deduplication based on a variety of factors: resource utilization, production schedules, time since creation and so on. In this way, storage efficiencies can be achieved while optimizing the use of system resources.
5. Scalability
Because the solution is being chosen for longer-term data storage, scalability, in terms of both capacity and performance, is an important consideration. Consider growth expectations over five years or more. How much data will you want to keep on disk for fast access? How will the data index system scale to your requirements?
A deduplication solution should provide an architecture that allows economic “right-sizing” for both the initial implementation and the long-term growth of the system. For example, a clustering approach allows organizations to scale to meet growing capacity requirements – even for environments with many petabytes of data – without compromising deduplication efficiency or system performance. Clustering enables VTL to be managed and used logically as a single data repository, supporting even the largest of tape libraries. Clustering also inherently provides a high-availability environment, protecting the backup repository interface (VTL or file interface) and deduplication nodes by offering failover support.

[Figure 2]
6. Distributed topology support
Data deduplication is a technology that can deliver benefits throughout a distributed enterprise, not just in a single data center. A solution that includes replication and multiple levels of deduplication can achieve maximum benefits from the technology.
For example, a company with a corporate headquarters, three regional offices and a secure disaster recovery (DR) facility should be able to implement deduplication in the regional offices to facilitate efficient local storage and replication to the central site. The solution should only require minimal bandwidth for the central site to determine whether the remote data is contained in the central repository. Only unique data across all sites should be replicated to the central site and subsequently to the DR site, to avoid excessive bandwidth requirements.
7. Highly available deduplication repository
It is extremely important to create a highly available deduplication repository. Since a very large amount of data has been consolidated in one location, risk tolerance for data loss is very low. Access to the deduplicated data repository is critical and should not be vulnerable to a single point of failure. A robust data deduplication solution will include mirroring to protect against local storage failure as well as replication to protect against disaster. The solution should have failover capability in the event of a node failure. Even if multiple nodes in a cluster fail, the company must be able to continue to recover its data and respond to the business.
8. Efficiency and effectiveness
File-based deduplication approaches do not reduce storage capacity requirements as much as those that analyze data at a sub-file or block level. Consider, for example, changing a single line in a 4-megabyte presentation. In a file-based solution, the entire file must be stored, doubling the storage required. If the presentation is sent to multiple people, as presentations often are, the negative effects multiply. Most sub-file de-duplication processes use some sort of “chunking” method to break up a large amount of data, such as a virtual tape cartridge, into smaller-sized pieces to search for duplicate data. Larger chunks of data can be processed at a faster rate, but less duplication is detected. It is easier to detect more duplication in smaller chunks, but the overhead to scan the data is much higher.
If the “chunking” begins at the beginning of a tape (or data stream in other implementations), the deduplication process can be fooled by the metadata created by the backup software, even if the file is unchanged. However, if the solution can segregate the metadata and look for duplication in chunks within actual data files, the duplication detection will be much higher. Some solutions even adjust chunk size based on information gleaned from the data formats. The combination of these techniques can lead to a 30 to 40 percent increase in the amount of duplicate data detected. This can have a major impact on the cost-effectiveness of the solution.
Focus on the total solution
In today’s environment, as stored data volumes continually increase while IT spending decreases, data deduplication is fast becoming a vital technology. Data deduplication is the only way to dramatically reduce data volumes, slash storage requirements and minimize data protection costs and risks.
Although the benefits of data deduplication are dramatic, organizations should not be seduced by the hype sometimes attributed to the technology. No matter the approach, the amount of data deduplication that can occur is driven by the nature of the data and the policies used to protect it.
In order to achieve the maximum benefit of deduplication, organizations should choose data deduplication solutions based on a comprehensive set of quantitative and qualitative factors rather than relying solely on statistics such as theoretical data reduction ratios.
About the Author: Fadi Albatal is the director of marketing, FalconStor Software

What drives a Data Center? Want to know more about Cost vs Efficiency in Data Center Design?
To find out and to read more great articles in this issue, CLICK HERE!
Get the NEW & IMPROVED DCJ Bi-Weekly eNewsletter! Sign up below!