August 24 2010
Dedupe Files with 50 Lines of Ruby
by Brian Dunagan

Somehow, my iPhoto library contains duplicates. Lots of duplicates. I tried Brattoo Propaganda’s Duplicate Annihilator, but while it took care of many, many photos, there are some photos and quite a few movies duplicated. Ruby to the rescue!

I wrote a 50-line Ruby script to list out the duplicate files. The script uses SQLite3 and SHA-1 digests to identify multiple copies of each file in a directory. The end result was 2K duplicate files and 20 GB of freed disk space. Woot.

Here’s the script:

#!/usr/bin/ruby

# deduplicate_files.rb

require 'rubygems'
require 'sqlite3' # Look at 'http://github.com/luislavena/sqlite3-ruby' then do 'sudo gem install sqlite3-ruby'
require 'digest/sha1'
require 'pathname'

# Pass in the directory or assume the current one.
arg = ARGV[0] || "."
root_path = Pathname.new(arg).realpath.to_s
puts "Examining #{root_path}"

# Create a SQLite3 database in the current directory.
db = SQLite3::Database.new("deduplicate_files.db")
db.execute("create table files(digest varchar(40),path varchar(1024))")

# Recursively generate hash digests of all files.
Dir.chdir("#{root_path}")
current_file = 0
Dir['**/*.*'].each do |file|
  path = "#{root_path}/#{file}"
  # Ignore non-existent files (symbolic links) and directories.
  next if !File.exists?("#{path}") || File.directory?("#{path}")
  # Create a hash digest for the current file.
  digest = Digest::SHA1.new
  File.open(file, 'r') do |handle|
    while buffer = handle.read(1024)
      digest << buffer
    end
  end
  # Store the hash digest and full path in the database.
  db.execute("insert into files values(\"#{digest}\",\"#{path}\")")
  # Print out every Nth file.
  puts "[#{digest}] #{path} (#{current_file})" if current_file % 100 == 0
  current_file = current_file + 1
end

# Loop through digests.
db.execute("select digest,count(1) as count from files group by digest order by count desc").each do |row|
  if row[1] > 1 # Skip unique files.
    puts "Duplicates found:"
    digest = row[0]
    # List the duplicate files.
    db.execute("select digest,path from files where digest='#{digest}'").each do |dup_row|
      puts "[#{digest}] #{dup_row[1]}"
    end
  end
end

Update: I’ve committed a slightly modified version of this ruby script to my GitHub repo.

August 24 2010 Dedupe Files with 50 Lines of Ruby by Brian Dunagan

August 24 2010
Dedupe Files with 50 Lines of Ruby
by Brian Dunagan