Rdfind.php
rdfind.php is the ProgClub redundant data processing software. That's the software that replaces duplicate files with hard links to save your disk space. It's a reimplementation of rdfind with support for a maximum number of hard links per file. For other projects see projects.
Status
Version 0.1 released.
Motivation
Why make this software? Good question! I was using the original rdfind program and I lost a bunch of files because I exceeded the maximum number of hard links per file and the rdfind program choked on that and I lost data. Also, my program creates hard links per UID/GID/mode so you don't lose permissions and ownership information when new links are created. Also, the original rdfind program makes multiple passes to check first bytes, last bytes, etc., and I don't bother with that. I just use the hashing function and process files once. So my program is O(n).
And in case that wasn't enough, I also create a restore script which can recover the last file being worked on in the case of a software interrupt (e.g. Ctrl+C) or program crash.
Administration
Contributors
Members who have contributed to this project. Newest on top.
All contributors have agreed to the terms of the Contributor License Agreement. This excludes any upstream contributors who tend to have different administrative frameworks.
Copyright
Copyright 2014, Contributors.
License
Licensed under the GPL license.
Resources
Downloads
There are no downloads for this software, get your copy from subversion.
Source code
The repository can be browsed online:
https://www.progclub.org/pcrepo/rdfind.php/branches/0.1
The latest stable released version of the code is available from:
https://www.progclub.org/svn/pcrepo/rdfind.php/tags/latest/0.1
Or if you want the latest version for development purposes:
https://www.progclub.org/svn/pcrepo/rdfind.php/branches/0.1
Links
- See rdfind for the software that inspired our project.
Specifications
Functional specification
The functional specification describes what the project does.
This software processes a number of input directories and looks for descendant files that are duplicates of each other. The software replaces duplicate files with hard links, reclaiming disk space.
The software determines that files are duplicates by way of a hashing algorithm. A number of algorithms are available with a minimum bit-length of 128 bits (16 bytes). The default algorithm is sha256 which should be relatively safe. If you use a weaker hashing algorithm be sure your inputs are safe.
Safety first
File system operations aren't atomic. That means they can fail in the middle of processing. For example, if you want to create a new hard link you have to remove the original file first. That's two operations, one to remove the file and then another to create a hard link for it.
If the program is interrupted or crashes between operations the file system is left in an inconsistent state. If you don't address that you could lose a file because you terminated the program during operation.
So before my program does a bunch of file system operations it writes out a shell script which can accomplish the same actions. If the program is interrupted during operation it will leave the shell script there so you can run it to complete the last operation and restore your data.
Technical specification
The technical specification describes how the project works.
The PHP software is split into two parts: a library (bin/rdfind.inc.php) and an executable (bin/rdfind.php). The executable just calls the library passing in command-line arguments. This separation allows you to include the library and call the rdfind_php function from your own scripts.
The software enumerates files below the input directories and looks for duplicates with the same UID, GID and MODE. When duplicates are discovered hard links are made. If a file reaches the maximum number of hard links it is replaced and matching starts over again at 1 hard link for the following files.
Notes
There are more notes in the README file.
Notes for implementers
If you are interested in incorporating this software into your project, here's what you need to know:
Include the bin/rdfind.inc.php file, e.g.:
require_once '/path/to/rdfind-php/bin/rdfind.inc.php';
Then call the rdfind_php function with your paramters.
Notes for developers
If you're looking to set up a development environment for this project here's what you need to know:
Check out the latest development branches with:
svn co https://www.progclub.org/svn/pcrepo/rdfind.php/branches/ rdfind-php
Then look in your rdfind-php directory for the major.minor version you're interested in (at time of writing only v0.1). The bulk of the code is in the library file bin/rdfind.inc.php.
Tasks
TODO
Things to do, in rough order of priority:
N/A -- can't think of anything more to add at this point! (ideas welcome).
Done
Stuff that's done. Latest stuff on top.