Find and Delete Duplicates in SQL


We all know we shouldn’t have duplicates in the database. And despite my best efforts, somehow they sneak in from some legacy code, or from the hyper-active-compulsory-submit-the-form-fifty-times-in-five-seconds-bloke.

So now and then I find myself writing the same SQL to track them down. Most solutions online say that you have to use a temporary table or rename tables to get rid of them. This solution works without either, providing my way of eliminating the duplicates is sufficient for your needs.

Here is my code to find duplicates:

SELECT *
FROM users u JOIN (
   SELECT username, min(userid) AS minid FROM users
   GROUP BY username
   HAVING ( COUNT(username) > 1 )
) AS dupusers
ON u.username = dupusers.username
ORDER BY userid DESC

Let’s first look at the nested Select:

SELECT username, min(userid) AS minid FROM users
GROUP BY username
HAVING ( COUNT(username) > 1 )

Because I identify a record by the same username, I group on that. Once grouped on username, I can use Having to check the count for that username. If the count is > 1, then it is a duplicate of another record. I am taking the viewpoint that the first user inserted is the one we are going to keep. All users with the same username after that one is hence a duplicate. I am returning the minimum userid for each duplicate record so I know which one I will not delete later.

This query by itself would find you all the duplicates, but isn’t as handy because you can’t view all the columns of each record.

Now if we look at the full join, I join the two tables together: The nested table is the set of all users that have a duplicate in the database inner joined with the set of all users. This will give us the set of all users that have duplicates.

Ok, so how do we go about deleting the sibling duplicates of a base record and not every record that has a duplicate? In other words, if I have:

ID: 1
username: Scott
MinID: 1

ID: 2
username: Scott
MinID: 1

ID: 3
username: Scott
MinID: 1

I only want to delete records with ID’s of 2 and 3; not the ID of 1. Let’s look at my Delete syntax:

DELETE users
FROM users u JOIN (
    SELECT username, min(userid) AS minid FROM users
    GROUP BY username
    HAVING ( COUNT(username) > 1 )
) AS dupusers
ON u.username = dupusers.username
WHERE u.userid <> minid

The code is the exact same with two exceptions. One, I switched from the Select statement to a Delete. The second is the filter :

WHERE u.userid <> minid

This will delete every userid that isn’t the smallest userid for that group of duplicates. Using my example set above, ID’s 2 and 3 would not equal the minid of 1, and therefore would be deleted, leaving one non-duplicated record.

This code assumes that a row is a duplicate of another if their username is the same. In most cases, a duplicate is only a duplicate if a combination of fields are the same. This can be accommodated by changing the nested Select statement:

SELECT *
FROM users u JOIN (
    SELECT firstname, lastname, min(userid) AS minid FROM users
    GROUP BY firstname, lastname
    HAVING ( COUNT(*) > 1 )
) AS dupusers
ON u.firstname = dupusers.firstname AND u.lastname = dupusers.lastname
ORDER BY u.lastname, u.firstname

This is a lame example since there can be many people with the same name, but you get the picture. In reality, I have had to use this technique when looking for duplicate addresses. For that I grouped on address, city and zip.

So there you go…a way to nuke duplicates without using temp tables.

,

7 responses to “Find and Delete Duplicates in SQL”

  1. ROCKIN’. Thank you very much for this. I had a “group members” type table that was out of control with up to 7 dupes for most groups. With some tweaks, I can use your example to nuke the tens of thousands of extra records. Thanks again!!!

  2. You’re my hero! Your example was easy to understand and follow, and I desperately need to clean some tables.

Leave a Reply

Your email address will not be published. Required fields are marked *