August 24 2010
            Dedupe Files with 50 Lines of Ruby
            
          
          Somehow, my iPhoto library contains duplicates. Lots of duplicates. I tried Brattoo Propaganda’s Duplicate Annihilator, but while it took care of many, many photos, there are some photos and quite a few movies duplicated. Ruby to the rescue!
I wrote a 50-line Ruby script to list out the duplicate files. The script uses SQLite3 and SHA-1 digests to identify multiple copies of each file in a directory. The end result was 2K duplicate files and 20 GB of freed disk space. Woot.
Here’s the script:
#!/usr/bin/ruby
# deduplicate_files.rb
require 'rubygems'
require 'sqlite3' # Look at 'http://github.com/luislavena/sqlite3-ruby' then do 'sudo gem install sqlite3-ruby'
require 'digest/sha1'
require 'pathname'
# Pass in the directory or assume the current one.
arg = ARGV[0] || "."
root_path = Pathname.new(arg).realpath.to_s
puts "Examining #{root_path}"
# Create a SQLite3 database in the current directory.
db = SQLite3::Database.new("deduplicate_files.db")
db.execute("create table files(digest varchar(40),path varchar(1024))")
# Recursively generate hash digests of all files.
Dir.chdir("#{root_path}")
current_file = 0
Dir['**/*.*'].each do |file|
  path = "#{root_path}/#{file}"
  # Ignore non-existent files (symbolic links) and directories.
  next if !File.exists?("#{path}") || File.directory?("#{path}")
  # Create a hash digest for the current file.
  digest = Digest::SHA1.new
  File.open(file, 'r') do |handle|
    while buffer = handle.read(1024)
      digest << buffer
    end
  end
  # Store the hash digest and full path in the database.
  db.execute("insert into files values(\"#{digest}\",\"#{path}\")")
  # Print out every Nth file.
  puts "[#{digest}] #{path} (#{current_file})" if current_file % 100 == 0
  current_file = current_file + 1
end
# Loop through digests.
db.execute("select digest,count(1) as count from files group by digest order by count desc").each do |row|
  if row[1] > 1 # Skip unique files.
    puts "Duplicates found:"
    digest = row[0]
    # List the duplicate files.
    db.execute("select digest,path from files where digest='#{digest}'").each do |dup_row|
      puts "[#{digest}] #{dup_row[1]}"
    end
  end
endUpdate: I’ve committed a slightly modified version of this ruby script to my GitHub repo.



