I have a table with the following fields:
id (Unique)
url (Unique)
title
company
site_id
Now, I need to remove rows having same
title, company and site_id
. One way to do it will be using the following SQL along with a script (PHP
):SELECT title, site_id, location, id, count( * )
FROM jobs
GROUP BY site_id, company, title, location
HAVING count( * ) >1
After running this query, I can remove duplicates using a server side script.
But, I want to know if this can be done only using SQL query.
Answers
A really easy way to do this is to add a
UNIQUE
index on the 3 columns. When you write the ALTER
statement, include the IGNORE
keyword. Like so:ALTER IGNORE TABLE jobs
ADD UNIQUE INDEX idx_name (site_id, title, company);
This will drop all the duplicate rows. As an added benefit, future
INSERTs
that are duplicates will error out. As always, you may want to take a backup before running something like this...
MySQL has restrictions about referring to the table you are deleting from. You can work around that with a temporary table, like:
create temporary table tmpTable (id int);
insert tmpTable
(id)
select id
from YourTable yt
where exists
(
select *
from YourTabe yt2
where yt2.title = yt.title
and yt2.company = yt.company
and yt2.site_id = yt.site_id
and yt2.id > yt.id
);
delete
from YourTable
where ID in (select id from tmpTable);
From Kostanos' suggestion in the comments:
The only slow query above is DELETE, for cases where you have a very large database. This query could be faster:
The only slow query above is DELETE, for cases where you have a very large database. This query could be faster:
DELETE FROM YourTable USING YourTable, tmpTable WHERE YourTable.id=tmpTable.id
There is another solution :
DELETE t1 FROM my_table t1, my_table t2 WHERE t1.id < t2.id AND t1.my_field = t2.my_field AND t1.my_field_2 = t2.my_field_2 AND ...
I have this query snipet for SQLServer but I think It can be used in others DBMS with little changes:
DELETE
FROM Table
WHERE Table.idTable IN (
SELECT MAX(idTable)
FROM idTable
GROUP BY field1, field2, field3
HAVING COUNT(*) > 1)
I forgot to tell you that this query doesn't remove the row with the lowest id of the duplicated rows. If this works for you try this query:
DELETE
FROM jobs
WHERE jobs.id IN (
SELECT MAX(id)
FROM jobs
GROUP BY site_id, company, title, location
HAVING COUNT(*) > 1)
The faster way is to insert distinct rows into a temporary table. Using delete, it took me a few hours to remove duplicates from a table of 8 million rows. Using insert and distinct, it took just 13 minutes.
CREATE TABLE tempTableName LIKE tableName;
CREATE INDEX ix_all_id ON tableName(cellId,attributeId,entityRowId,value);
INSERT INTO tempTableName(cellId,attributeId,entityRowId,value) SELECT DISTINCT cellId,attributeId,entityRowId,value FROM tableName;
TRUNCATE TABLE tableName;
INSERT INTO tableName SELECT * FROM tempTableName;
DROP TABLE tempTableName;
I keep visiting this page anytime I google "remove duplicates form mysql" but for my theIGNORE solutions don't work because I have an InnoDB mysql tables
this code works better anytime
CREATE TABLE tableToclean_temp LIKE tableToclean;
ALTER TABLE tableToclean_temp ADD UNIQUE INDEX (fontsinuse_id);
INSERT IGNORE INTO tableToclean_temp SELECT * FROM tableToclean;
DROP TABLE tableToclean;
RENAME TABLE tableToclean_temp TO tableToclean;
tableToclean = the name of the table you need to clean
tableToclean_temp = a temporary table created and deleted
I like to be a bit more specific as to which records I delete so here is my solution:
delete
from jobs c1
where not c1.location = 'Paris'
and c1.site_id > 64218
and exists
(
select * from jobs c2
where c2.site_id = c1.site_id
and c2.company = c1.company
and c2.location = c1.location
and c2.title = c1.title
and c2.site_id > 63412
and c2.site_id < 64219
)
I had to do this with text fields and came across the limit of 100 bytes on the index.
I solved this by adding a column, doing a md5 hash of the fields, and the doing the alter.
ALTER TABLE table ADD `merged` VARCHAR( 40 ) NOT NULL ;
UPDATE TABLE SET merged` = MD5(CONCAT(`col1`, `col2`, `col3`))
ALTER IGNORE TABLE table ADD UNIQUE INDEX idx_name (`merged`);
0 comments:
Post a Comment