|
|
Question : Deleting duplicate rows.
|
|
There is one table that could have duplicated rows. The only field on each row that would differ between duplicates would be field that contains a date.
I would like to delete the duplicates but make sure that the dupe that gets deleted is the earlier entry.
How could i achieve this?
Wing
|
Answer : Deleting duplicate rows.
|
|
Here are a few pointers that can help you find the duplicate values in a table and delete those pesky pieces of information.
Lets say you have a table called Employees and its layout is as follows:
create table employee ( EmpId number, EmpName varchar2(10), EmpSSN varchar2(11) );
insert into employee values(1, 'Jack', '555-55-5555'); insert into employee values (2, 'Joe', '555-56-5555'); insert into employee values (3, 'Fred', '555-57-5555'); insert into employee values (4, 'Mike', '555-58-5555'); insert into employee values (5, 'Cathy', '555-59-5555'); insert into employee values (6, 'Lisa', '555-70-5555'); insert into employee values (7, 'Jack', '555-55-5555'); insert into employee values (8, 'Mike', '555-58-5555'); insert into employee values (9, 'Cathy', '555-59-5555'); insert into employee values (10, 'Lisa', '555-70-5555'); insert into employee values (11, 'Lisa', '555-70-5555');
You can see from the data inserted above that the EmpId is unique but the EmpName and EmpSSN have duplicate values, so now you ask yourself, how in the world can I find these duplicates and remove them?
Look no further as I will show you below one way of solving this problem.
First off you will need to get an idea of what records are duplicate and by running the query below it will help you figure that out:
SQL> select count(empssn), empssn from employee 2 group by empssn 3 having count(empssn) > 1;
COUNT(EMPSSN) EMPSSN ------------- ----------- 2 555-55-5555 2 555-58-5555 2 555-59-5555 3 555-70-5555 From the above output you can now tell what employees have duplicate records and now you can move onto the next step and remove them but still retain a unique record for the duplicate employees. Again here is a list all of the records in the table before we remove them.
SQL> select * from employee;
EMPID EMPNAME EMPSSN ---------- ---------- ----------- 1 Jack 555-55-5555 2 Joe 555-56-5555 3 Fred 555-57-5555 4 Mike 555-58-5555 5 Cathy 555-59-5555 6 Lisa 555-70-5555 7 Jack 555-55-5555 8 Mike 555-58-5555 9 Cathy 555-59-5555 10 Lisa 555-70-5555 11 Lisa 555-70-5555 Now, let's get rid of those duplicate records by running this query: delete from employee where (empid, empssn) not in ( select min(empid), empssn from employee group by empssn); After the above query has been executed lets select all the records from the table to see if indeed it actually worked: SQL> select * from employee;
EMPID EMPNAME EMPSSN ---------- ---------- ----------- 1 Jack 555-55-5555 2 Joe 555-56-5555 3 Fred 555-57-5555 4 Mike 555-58-5555 5 Cathy 555-59-5555 6 Lisa 555-70-5555
6 rows selected. BAM! Problem solved, no more duplicates, now finally you can go to bed and get some sleep. WAIT! Just before you do, let me throw a little wrench into the situation, lets say that you have duplicate records once again but this time the empids and empSSNs are the same. For example the data now looks like this:
insert into employee values(1, 'Jack', '555-55-5555'); insert into employee values (2, 'Joe', '555-56-5555'); insert into employee values (3, 'Fred', '555-57-5555'); insert into employee values (4, 'Mike', '555-58-5555'); insert into employee values (5, 'Cathy', '555-59-5555'); insert into employee values (6, 'Lisa', '555-70-5555'); insert into employee values (1, 'Jack', '555-55-5555'); insert into employee values (4, 'Mike', '555-58-5555'); insert into employee values (5, 'Cathy', '555-59-5555'); insert into employee values (6 ,'Lisa', '555-70-5555'); insert into employee values (6, 'Lisa', '555-70-5555'); Now, how can you figure out what data duplicate as now there is no unique identifier in the table. The Ids are the same and the SSN are the same. This is just great! No problem, relax, there is actually a distinct value in the table but you can't see it. This little helper is called the rowid.
Remember there is nothing to uniquely identify the data as you can see by the above insert statements so you will now have to rely on the rowid to get us out of this mess. In order to delete the duplicate values in the employee table you will now need to write a query like this:
delete from employee where (rowid, empssn) not in (select min(rowid), empssn from employee group by empssn); And hence, after the delete query above has been executed your data will appear as: SQL> select * from employee 2 /
EMPID EMPNAME EMPSSN ---------- ---------- ----------- 1 Jack 555-55-5555 2 Joe 555-56-5555 3 Fred 555-57-5555 4 Mike 555-58-5555 5 Cathy 555-59-5555 6 Lisa 555-70-5555
6 rows selected. As you can see all the duplicate values have been removed from the table and now you finally have unique records in the table, so before you call it quits or before someone else slips in and enters more duplicate data make sure you add a primary key to that EMPID or EMPSSN column so this wont happen again! Now finally, you will be able to get that long awaited sleep!
IMPORTANT NOTE: Remember rowid will always uniquely identify a row in a table but please do not use this as a primary key. I only showed you this little trick to help you get out of a bind when the database is poorly designed by allowing duplicate data to be entered. =============== The only way you can automatically remove duplicates is if you have a unique key (that perferably also is chronological). In this case you don't so it is not gonna be easy. You could add an autonumber field. That would be unique and could be used to deleted the duplicates.
The trouble is that if you would select the records to be deleted by the duplicate fields, that all records with that address would be deleted.
If you have a unique key you can prevent the record you want kept to be removed.
OK, now for an example. First create a query that will select the records with duplicates (which actually selects the duplicates you want to keep):
SELECT [Street Address], Max(tblAddress.ID) AS ID FROM tblAddress GROUP BY [Street Address] HAVING (((Count(*))>1));
This selects all duplicate Street Address records. I added the autonumber field and named it ID. The query selects the MAX of the duplicate ID's. So the last one is kept. (If you want to keep the oldest record change it to MIN.
Now save the query as "qryFindDuplicates" and create the following query that will delete the other duplicates:
DELETE tblAddress.ID FROM tblAddress INNER JOIN qryFindDuplicates ON tblAddress.[Street Address] = qryFindDuplicates.[Street Address] WHERE (((tblAddress.ID)<>[qryFindDuplicates].[ID]));
This will delete the duplicates, but leave the ones that were selected for keeping (from the previous query) ============= have often found that poor database design creates the need for inventive SQL solutions. In fact, the whole concept of "normalizing" databases sprang from eliminating certain "anomalies" when querying the database. When faced with a sub-optimal database design, the ideal solution from a technical standpoint is to redesign the database and correct its flaws. However, redesigning the database often is not an option due to time or budget constraints, pre-existing software, or simple lack of authority to change the design. We must often play the hand we are dealt. In this article, I'll show you how to leverage advanced join techniques to get the most from design-challenged databases.
Suppose the person that designed your database fell asleep in database theory class when the professor was discussing entity integrity. As a result, the schema lacks primary keys and is rife with duplicate data. Since its difficult to write a quality piece of software against a database with duplicate data, you decide to create primary keys on the tables. Your only problem is that you must first identify and eliminate the duplicate data.
There are several ways of finding and eliminating duplicate data; some are better suited to specific situations than others. The technique that I'll show you in this article is a relatively straightforward technique that should work in a wide variety of situations. To demonstrate the technique, I'm going to use the following table:
CREATE TABLE Payment( CustomerNumber int NOT NULL, PostedDatetime datetime NOT NULL, PaymentAmt money NOT NULL, PaymentMemo varchar(200) NOT NULL, CCRefNum char(10) NOT NULL )
The first step to a duplicate-free table is coming up with a way of uniquely identifying each row. There's no way you can delete a particular row without being able to distinguish it from the others. The easiest way to accomplish this is to add an identity column to this table and make it the primary key.
ALTER TABLE Payment ADD PaymentID int NOT NULL IDENTITY(1, 1), CONSTRAINT PK_Payment PRIMARY KEY(PaymentID)
Now that you have the ability to specify a particular row to delete, you must decide how to define "duplicate". Which columns should uniquely identify a row? In real-world scenarios, the answer will be based upon how the table has been used and the actual data contained in it; however, for this example, a combination of CustomerNumber, PostedDatetime, and PaymentAmt will serve as the logical key. It makes sense that the same customer would not post two payments of exactly the same amount at exactly the same time.
Having decided what a duplicate is, you need some way of identifying them. To identify the duplicate rows, you need to construct a query which returns each unique CustomerNumber / PostedDatetime / PaymentAmt combination where there is more than one row with the same key values.
SELECT CustomerNumber, PostedDatetime, PaymentAmt FROM Payment GROUP BY CustomerNumber, PostedDatetime, PaymentAmt HAVING COUNT(*) > 1
This query groups by the key values to return one row per unique set of values. The HAVING clause then filters out all groups where there is only one row for that set of values. This leaves a rowset containing only those logical key values where there are duplicates.
The next problem to solve is how to eliminate all but one row per set of key values. The PaymentID column added earlier provides an answer. If we could somehow get exactly one PaymentID for each set of duplicates, we could delete all rows where the CustomerNumber / PostedDatetime / PaymentAmt is returned in the above query, but the PaymentID does not match. This would delete all but one row per set of duplicates. How can we accomplish this?
When grouping data, aggregate functions can be used to return data about the group as a whole. While some of the aggregate functions such as SUM() and AVG() return values which don't exist on any particular row in the group, the extrema functions MIN() and MAX() return values which exist in an actual row in the group. Hence, the extrema functions can be used to return an arbitrary value from some set of values.
Using this knowledge about MIN() and MAX(), we can modify the above query to return the CustomerNumber / PostedDatetime / PaymentAmt for each set of duplicates along with exactly one unique PaymentID.
SELECT MAX(PaymentID) AS PaymentID, CustomerNumber, PostedDatetime, PaymentAmt FROM Payment GROUP BY CustomerNumber, PostedDatetime, PaymentAmt HAVING COUNT(*) > 1
This query returns all the data we need to perform the delete operation, but how do we tell the database engine to actually delete the records? We can't simply change the SELECT to DELETE because DELETE statements don't support grouping and aggregates. However, the DELETE statement does support joining to other tables. We can use a technique called a derived table to treat the above SELECT statement as if it were a table. We can then write a DELETE statement which joins to this derived table and performs the DELETE.
DELETE FROM p1 FROM Payment p1 INNER JOIN ( SELECT MAX(PaymentID) AS PaymentID, CustomerNumber, PostedDatetime, PaymentAmt FROM Payment GROUP BY CustomerNumber, PostedDatetime, PaymentAmt HAVING COUNT(*) > 1 ) p2 ON( p1.CustomerNumber = p2.CustomerNumber AND p1.PostedDatetime = p2.PostedDatetime AND p1.PaymentAmt = p2.PaymentAmt AND p1.PaymentID <> p2.PaymentID )
The most interesting part of this query is the join criteria. We join on CustomerNumber, PostedDatetime, and PaymentAmt to delete rows that match logically to the set of duplicates that we've found. If we left the criteria at this point, we'd end up deleting the duplicates, but we wouldn't have any copies of the duplicate rows remaining. Our requirements stated that we delete all the duplicates, but leave one copy of each duplicated row. To do this, we add the fourth criterion stating the PaymentID not match one of the PaymentID values in the set of duplicates. Thus, one row per set of duplicates is retained while deleting the others.
Hopefully, you'll never need to use these techniques to delete duplicate data. However, if you do, you won't have to struggle to find a technique that precisely and efficiently performs the task. Even if you don't use this technique to delete duplicates, many of the elements of the solution - derived tables, joined deletes, grouping and aggregation - can be applied to other situations that require inventive query techniques.
|
|
|
|
|