Organizations create, modify, and maintain huge amounts of data to help their businesses earn money and generally function. Typically every network user within an organization has a bit of disk space to store in process items and personal files.
In addition to these user centric areas for storage, departments or projects have disk space on the network reserved just for them. This allows files within a project or in a department to be accessed by all the employees working on that project or within a department.
Suppose my company has a folder on the network for the Accounting department. One of the employees there creates a workbook in Excel for use by everyone. Time goes by and the new workbook is being used by everyone, but the Accounting manager keeps getting emails from IT that there are multiple copies of the file on the network and that he should work with his team to determine which is the most current, and most accurate.
Another common use of file duplication is the old fashioned telephone directory; you know the one where it lists the names, extensions, and cell numbers of all the employees company wide. The document itself isn't too bad, but every time the thing is changed, the owner of the file emails it to everyone so they can save a copy in their own folder for later reference.
These are two examples of file clutter that can be found in what I would guess is the majority of corporate computer networks in existence today.
So there is clutter and duplication on file servers - what can be done about it?
Microsoft has a product called Windows Storage Server which is currently only available via OEM. In the Windows 2003 R2 release of the product, Single Instance Storage was moved from Remote Installation Services (RIS) to Storage Server. A component of the SIS implementation for Storage Server is called the Groveler, which allows a schedule to be created for duplicates to be found and cleaned up.
The Groveler works by scouring the WSS machine to look for identical copies of files. When multiple copies of a file are found it creates a link back to the stored file in the place where the copy was located. Then it removes the copy. Using links to the "one file" cleans up the significant disk space used by multiple copies while still allowing the known path to be used by the individual who saved the duplicate file.
When Single Instance Storage is enabled, which is done per volume, a common store (called the Single Instance Store) is created where all copies of files are kept. When the Groveler service creates a copy of files, they are placed in this common folder and given and sis extension (.sis). Links or file reparse points are created, which then allow access to the stored files as if they lived everywhere the users put them.
As an example, suppose I have saved a copy of the telephone directory in my user's folder and made no changes to the file. SIS and the Groveler will notice that the file named D:\corporate\phonelist.xls
ls are the same file, using twice the disk space needed.
Duplicates are determined by examining the file and its byte count. This allows the Groveler service to determine if a file is a true duplicate. This eliminates, for example, someone in accounting using a file name of phone list.xls for a list of their children's emergency contacts and someone in Information Technology using phone list.xls as a list of the on call rotation. The two files are similar in context and name only in that they contain phone numbers, but used out of departmental context, would not mean much.
But if a link is stored, won't that use disk space too?
Reparse points will use disk space, but not nearly as much as a complete copy of a file. If you look at the properties for any document and see what the size of the document might be, for example 200Kb, and then examine the properties for a link left by SIS and look at the Size on Disk, the link file should be much smaller than the original to which it is pointing. I believe the concept to be similar to creating a shortcut to a file on the desktop.
The shortcut (or link) serves as a placeholder to the actual document. When a user clicks the shortcut, the stored instance of the file opens and the user of the link uses it as they intended when creating the duplicate copy.
Keeping one copy of a file is much more efficient for your file servers than keeping two copies of the same file, much less 200 copies of that file.
Configuring SIS in Windows Storage Server 2003 R2
Windows Storage Server on Windows 2003 R2 installs the components for Single Instance Storage by default; however they are located in the Other Network File and Print Services section within Windows components in Add/Remove Programs if you need to check that the components are installed (or add them).
To determine if SIS is enabled on a volume, open the property sheet for the volume and select the advanced tab. If enabled, the SIS checkbox will be selected. As noted on the advanced tab, the SIS service will need to be enabled and started to turn on Single Instance Storage.
from the command line will also check the storage server to see if SIS is enabled.
If you are more comfortable using the command line that the GUI or prefer that method of administration, SIS can be managed using the sisadmin command and several switches, displayed by entering sisadmin /? at a command prompt.
After getting things turned on, duplicate a few files and run the Groveler to check for duplicates. After it runs you should be able to view the properties of the original and duplicate files to see that copies have been exchanged for links and that the links do indeed lead back to the single instance of the file.
Hopefully Windows Storage Server and Single Instance Storage will help your organization maximize its use of resources. For more information visit the Windows Storage Server product page