Find and Delete Duplicates in SQL

We all know we shouldn’t have duplicates in the database. And despite my best efforts, somehow they sneak in from some legacy code, or from the hyper-active-compulsory-submit-the-form-fifty-times-in-five-seconds-bloke.

So now and then I find myself writing the same SQL to track them down. Most solutions online say that you have to use a temporary table or rename tables to get rid of them. This solution works without either, providing my way of eliminating the duplicates is sufficient for your needs.

Here is my code to find duplicates:

SELECT *
FROM users u JOIN (
   SELECT username, min(userid) AS minid FROM users
   GROUP BY username
   HAVING ( COUNT(username) > 1 )
) AS dupusers
ON u.username = dupusers.username
ORDER BY userid DESC

Let’s first look at the nested Select:

SELECT username, min(userid) AS minid FROM users
GROUP BY username
HAVING ( COUNT(username) > 1 )

Because I identify a record by the same username, I group on that. Once grouped on username, I can use Having to check the count for that username. If the count is > 1, then it is a duplicate of another record. I am taking the viewpoint that the first user inserted is the one we are going to keep. All users with the same username after that one is hence a duplicate. I am returning the minimum userid for each duplicate record so I know which one I will not delete later.

This query by itself would find you all the duplicates, but isn’t as handy because you can’t view all the columns of each record.

Now if we look at the full join, I join the two tables together: The nested table is the set of all users that have a duplicate in the database inner joined with the set of all users. This will give us the set of all users that have duplicates.

Ok, so how do we go about deleting the sibling duplicates of a base record and not every record that has a duplicate? In other words, if I have:

ID: 1
username: Scott
MinID: 1

ID: 2
username: Scott
MinID: 1

ID: 3
username: Scott
MinID: 1

I only want to delete records with ID’s of 2 and 3; not the ID of 1. Let’s look at my Delete syntax:

DELETE users
FROM users u JOIN (
    SELECT username, min(userid) AS minid FROM users
    GROUP BY username
    HAVING ( COUNT(username) > 1 )
) AS dupusers
ON u.username = dupusers.username
WHERE u.userid <> minid

The code is the exact same with two exceptions. One, I switched from the Select statement to a Delete. The second is the filter :

WHERE u.userid <> minid

This will delete every userid that isn’t the smallest userid for that group of duplicates. Using my example set above, ID’s 2 and 3 would not equal the minid of 1, and therefore would be deleted, leaving one non-duplicated record.

This code assumes that a row is a duplicate of another if their username is the same. In most cases, a duplicate is only a duplicate if a combination of fields are the same. This can be accommodated by changing the nested Select statement:

SELECT *
FROM users u JOIN (
    SELECT firstname, lastname, min(userid) AS minid FROM users
    GROUP BY firstname, lastname
    HAVING ( COUNT(*) > 1 )
) AS dupusers
ON u.firstname = dupusers.firstname AND u.lastname = dupusers.lastname
ORDER BY u.lastname, u.firstname

This is a lame example since there can be many people with the same name, but you get the picture. In reality, I have had to use this technique when looking for duplicate addresses. For that I grouped on address, city and zip.

So there you go…a way to nuke duplicates without using temp tables.

March 24, 2006

Scott

Programming, SQL

7 responses to “Find and Delete Duplicates in SQL”

LON says:

April 2, 2010 at 12:09 pm

You convinced me by just reading at your code! Nicew work!
Umar says:

May 12, 2010 at 3:53 am

Very clean. Nice Job!
Peter says:

October 10, 2010 at 8:34 am

Thanks for the well written work.
I also found a design concept to not process burden dups.
http://dev.mysql.com/doc/refman/5.0/en/insert.html
Joshua Curtiss says:

November 29, 2010 at 2:52 pm

ROCKIN’. Thank you very much for this. I had a “group members” type table that was out of control with up to 7 dupes for most groups. With some tweaks, I can use your example to nuke the tens of thousands of extra records. Thanks again!!!
Valerie says:

May 5, 2011 at 1:52 pm

You’re my hero! Your example was easy to understand and follow, and I desperately need to clean some tables.
nck says:

August 2, 2012 at 1:33 pm

Excellent! Thank you very much
Drew_W says:

October 18, 2012 at 2:02 pm

Scott – your find-the-dups query saved my life today. Thanks a million times over.

Scott Elkin

Find and Delete Duplicates in SQL

7 responses to “Find and Delete Duplicates in SQL”

Leave a Reply