fill the void - bdunagan

24 Aug 2010
Dedupe Files with 50 Lines of Ruby

Somehow, my iPhoto library contains duplicates. Lots of duplicates. I tried Brattoo Propaganda’s Duplicate Annihilator, but while it took care of many, many photos, there are some photos and quite a few movies duplicated. Ruby to the rescue!

I wrote a 50-line Ruby script to list out the duplicate files. The script uses SQLite3 and SHA-1 digests to identify multiple copies of each file in a directory. The end result was 2K duplicate files and 20 GB of freed disk space. Woot.

Here’s the script:


# deduplicate_files.rb

require 'rubygems'
require 'sqlite3' # Look at '' then do 'sudo gem install sqlite3-ruby'
require 'digest/sha1'
require 'pathname'

# Pass in the directory or assume the current one.
arg = ARGV[0] || "."
root_path =
puts "Examining #{root_path}"

# Create a SQLite3 database in the current directory.
db ="deduplicate_files.db")
db.execute("create table files(digest varchar(40),path varchar(1024))")

# Recursively generate hash digests of all files.
current_file = 0
Dir['**/*.*'].each do |file|
path = "#{root_path}/#{file}"
# Ignore non-existent files (symbolic links) and directories.
next if !File.exists?("#{path}") ||"#{path}")
# Create a hash digest for the current file.
digest =, 'r') do |handle|
while buffer =
digest << buffer
# Store the hash digest and full path in the database.
db.execute("insert into files values(\"#{digest}\",\"#{path}\")")
# Print out every Nth file.
puts "[#{digest}] #{path} (#{current_file})" if current_file % 100 == 0
current_file = current_file + 1

# Loop through digests.
db.execute("select digest,count(1) as count from files group by digest order by count desc").each do |row|
if row[1] > 1 # Skip unique files.
puts "Duplicates found:"
digest = row[0]
# List the duplicate files.
db.execute("select digest,path from files where digest='#{digest}'").each do |dup_row|
puts "[#{digest}] #{dup_row[1]}"

Update: I’ve committed a slightly modified version of this ruby script to my GitHub repo.

Previous LinkedIn Twitter GitHub Email Next