Friday, 13 July 2018

How To Remove All Duplicate Rows Except One In SQL?

In this article we look at ways to remove all duplicate rows except one in an SQL database. For all examples in this article, we'll be using MySQL and the following table called user with names of users:
+----+--------+
| id |  name  |
+----+--------+
|  1 |  john  |
|  2 |  david |
|  3 |  john  |
|  4 |  john  |
|  5 |  wayne |
|  6 |  david |
+----+--------+
CREATE TABLE user (
    id INT NOT NULL AUTO_INCREMENT,
    name VARCHAR(25) NOT NULL,
    PRIMARY KEY (id)
);

INSERT INTO user(name) VALUES ('john'), ('david'), ('john'), ('john'), ('wayne'), ('david');

Create A New Table With Unique Values Copied From Original Table

You can simply create a new table with unique values copied over from the original table like so:
CREATE TABLE user2 AS (
    SELECT id, name 
    FROM user
    GROUP BY name
)
Result:
+----+--------+
| id |  name  |
+----+--------+
|  1 |  john  |
|  2 |  david |
|  5 |  wayne |
+----+--------+
You will have to add original table's constraints and indexes on the new table as they aren't copied over.

Use Temporary Table To Fill Original Table With Unique Rows

You can use a temporary table to copy over unique values from the original table, and add them back to the original table. This can be useful when a primary key does not exist.
# Step 1: Copy distinct values to temporary table
CREATE TEMPORARY TABLE tmp_user (
    SELECT id, name 
    FROM user
    GROUP BY name
);

# Step 2: Remove all rows from original table
DELETE FROM user;

# Step 3: Remove all rows from original table
INSERT INTO user (SELECT * FROM tmp_user);

# Step 4: Remove temporary table
DROP TABLE tmp_user;
Be careful when using this method because if there are foreign key constraints referencing the original table, rows deleted in step 2 may delete referenced rows in child tables as well to maintain referential integrity.
Result:
+----+--------+
| id |  name  |
+----+--------+
|  1 |  john  |
|  2 |  david |
|  5 |  wayne |
+----+--------+

Add Unique Constraint And Copy Unique Rows To Original Table

To make sure there are no duplicates added in future, we can extend the previous set of queries to add a UNIQUE constraint to the column.
# Step 1: Copy distinct values to temporary table
CREATE TEMPORARY TABLE tmp_user (
    SELECT id, name 
    FROM user
    GROUP BY name
);

# Step 2: Remove all rows from original table
DELETE FROM user;

# Step 3: Add Unique constraint
ALTER TABLE user ADD UNIQUE(name);

# Step 4: Remove all rows from original table
INSERT IGNORE INTO user (SELECT * FROM tmp_user);

# Step 5: Remove temporary table
DROP TABLE tmp_user;
Be careful when using this method because if there are foreign key constraints referencing the original table, rows deleted in step 2 may delete referenced rows in child tables as well to maintain referential integrity.
Result:
+----+--------+
| id |  name  |
+----+--------+
|  1 |  john  |
|  2 |  david |
|  5 |  wayne |
+----+--------+

Remove Duplicates And Keep Row With Lowest ID

Using the methods listed below, row with the lowest id is kept, all other duplicates are removed.
Expected Result:
+----+--------+
| id |  name  |
+----+--------+
|  1 |  john  |
|  2 |  david |
|  5 |  wayne |
+----+--------+

Method #1:

DELETE u1 FROM user u1, user u2 
WHERE u1.id > u2.id AND u1.name = u2.name

Method #2:

DELETE u1 FROM user u1 
INNER JOIN user u2 ON u1.id > u2.id AND u1.name = u2.name;

Method #3:

DELETE FROM user
WHERE id NOT IN (
    SELECT * 
    FROM (
        SELECT MIN(id)
        FROM user
        GROUP BY name
    ) temp
)

Method 4:

# Step 1: Copy all unique rows with lowest id
CREATE TEMPORARY TABLE tmp_user (
    SELECT MIN(id) id 
    FROM user 
    GROUP BY name
);

# Step 2: Delete all rows in original table that are not in the temporary table
DELETE FROM user WHERE id NOT IN (SELECT id FROM tmp_user);

# Step 3: Remove temporary table
DROP TABLE tmp_user;

Remove Duplicates And Keep Row With Highest ID

Using the methods listed below, row with the highest id is kept, all other duplicates are removed.
Expected Result:
+----+--------+
| id |  name  |
+----+--------+
|  4 |  john  |
|  5 |  wayne |
|  6 |  david |
+----+--------+

Method #1:

DELETE u1 FROM user u1, user u2 
WHERE u1.id < u2.id AND u1.name = u2.name

Method #2:

DELETE u1 FROM user u1 
INNER JOIN user u2 ON u1.id < u2.id AND u1.name = u2.name;

Method #3:

DELETE FROM user
WHERE id NOT IN (
    SELECT * 
    FROM (
        SELECT MAX(id)
        FROM user
        GROUP BY name
    ) temp
)

Method 4:

# Step 1: Copy all unique rows with lowest id
CREATE TEMPORARY TABLE tmp_user (
    SELECT MAX(id) id 
    FROM user 
    GROUP BY name
);

# Step 2: Delete all rows in original table that are not in the temporary table
DELETE FROM user WHERE id NOT IN (SELECT id FROM tmp_user);

# Step 3: Remove temporary table
DROP TABLE tmp_user;

0 comments:

Post a Comment