
Sometimes we may find that there are unwanted duplicate records in one of the tables of our database. Often we notice this issue when trying to create a unique key based on one or several of the fields in the table.
This post explains a couple of simple methods that can be used to remove duplicate rows from a mysql table, keeping only one of each group of duplicates.
Method 1 – Creating a unique index that uses the IGNORE modifier
Let’s suppose there is a ‘leads’ table in our database, with the following content:
1 2 3 4 5 6 7 8 9 10 11 |
mysql> select * from leads; +---------+-----------+-----------------------+ | name1 | name2 | email | +---------+-----------+-----------------------+ | john | smith | jsmith@gmail.com | | john | smith | johnsmith@hotmail.com | | alfonso | diaz | alfon345@hotmail.com | +---------+-----------+-----------------------+ 3 rows in set (0.00 sec) |
At a given point, we might decide to keep only one record per lead, because it is enough to have an email address per user. But the name “john smith” is duplicated. If we try to create a unique key on the fields (name1,name2), we get an error:
1 2 3 4 |
mysql> create unique index idx_leads on leads(name1,name2); ERROR 1062 (23000): Duplicate entry 'john-smith' for key 'idx_leads' |
To avoid this error, we can delete the duplicated records at the time the unique index is created, using the IGNORE modifier of the “ALTER TABLE” command:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
mysql> alter ignore table leads add unique index idx_leads(name1,name2); Query OK, 3 rows affected (0.00 sec) Records: 3 Duplicates: 1 Warnings: 0 mysql> select * from leads; +---------+-----------+----------------------+ | name1 | name2 | email | +---------+-----------+----------------------+ | john | smith | jsmith@gmail.com | | alfonso | diaz | alfon345@hotmail.com | +---------+-----------+----------------------+ 2 rows in set (0.00 sec) |
In some versions of MySQL, this procedure only works for MyISAM tables. If our table uses the InnoDB storage engine, we can convert it to MyISAM, remove the duplicats and convert it back to InnoDB:
1 2 3 4 5 |
mysql> ALTER TABLE leads ENGINE MyISAM; mysql> alter ignore table leads add unique index idx_leads(name1,name2); mysql> ALTER TABLE leads ENGINE InnoDB; |
Method 2 – Create a temporary table
In the previous case, we have seen that using the IGNORE clause has the desired effect of eliminating duplicates, but we have no control on which the record to keep and which other records to delete in a set of duplicates. Using a temporary table gives us more flexibility. First, we create an empty table with the same structure as the original:
1 2 3 |
mysql> create table leads_tmp like leads; |
next, we create the unique index on the new table:
1 2 3 |
mysql> alter table leads_tmp add unique index idx_leads(name1,name2); |
next, the new table is loaded choosing only one record from the set of duplicates. En the select sentence, we can choose the value we want to keep for each of the fields that are not part of the unique index. For instance, we could load the table choosing always the shorter email in each group of duplicates:
1 2 3 4 5 6 7 8 9 10 |
mysql> insert into leads_tmp (name1, name2, email) ( select name1, name2, email from ( select name1, name2, email from leads order by length(email) asc ) table_tmp group by name1, name2 ); |
Finally, the original table is dropped and the new table is renamed:
1 2 3 4 |
mysql> drop table leads; mysql> alter table leads_tmp rename to leads; |
—
Index of posts related to MySQL
—