Midrange News for the IBM i Community


Posted by: neilrh
A/P
Jackson, MI
Find duplicate records using SQL
has no ratings.
Published: 14 Jul 2014
Revised: 21 Jul 2014 - 3566 days ago
Last viewed on: 25 Apr 2024 (4608 views) 

Using IBM i? Need to create Excel, CSV, HTML, JSON, PDF, SPOOL reports? Learn more about the fastest and least expensive tool for the job: SQL iQuery.

Find duplicate records using SQL Published by: neilrh on 14 Jul 2014 view comments(6)

Many times in my work I'm sent a spreadsheet with a "Please load to production" note. The problem is that frequently they don't check all of the data and we end up trying to add duplicates into the production files. Now while it's possible to run an INSERT INTO ..... WHERE NOT EXISTS deal, in some cases we need to run this multiple times until the records are loaded. And there's no early warning you're just discarding records arbitrarily.

I've begun to use the following SQL to find duplicate entries, prior to attempting to add the records, and I can even go back to the spreadsheet - highlight the duplicates and ask which is correct!

with stuff as (select rrn(mf) as recno, mf.* from mylib/myfile mf)
select * from stuff a left join stuff b on a.f1 = b.f1 [and a.f2 = b.f2....]
where a.recno > b.recno

just remember to join on all of the primary keys.

Return to midrangenews.com home page.
Sort Ascend | Descend

COMMENTS

(Sign in to Post a Comment)
Posted by: DaleB
Premium member *
Reading, PA
Comment on: Find duplicate records using SQL
Posted: 9 years 9 months 12 days 7 hours 28 minutes ago
Edited: Thu, 17 Jul, 2014 at 08:17:26 (3570 days ago)

What about a simple count?

WITH dups AS (SELECT f1 [, f2, ...]
                FROM mylib/myfile AS mf1
                GROUP BY f1 [, f2, ...]
                HAVING COUNT(*) > 1
             )
SELECT RRN(mf), mf.*
  FROM       mylib/myfile AS mf
  INNER JOIN dups         AS d  ON  d.f1=mf.f1 [AND d.f1=mf.f2, ...]

You could easily rewrite this as an EXISTS instead of the CTE and the INNER JOIN, but it would come out the same.

Posted by: neilrh
Premium member *
Jackson, MI
Comment on: Find duplicate records using SQL
Posted: 9 years 9 months 12 days 16 minutes ago

I'll have to give that one a try.

Posted by: DaleB
Premium member *
Reading, PA
Comment on: Find duplicate records using SQL
Posted: 9 years 9 months 10 days 7 hours 10 minutes ago

We've actually done this in the INSERT TO table, as an up front way to avoid generating a dup key error that would otherwise have to be handled.

If you're loading from a spreadsheet, you probably don't have a huge quantity of data. But, generally speaking, if f1 [, f2, ...] are high order keys in any index, the GROUP BY shold be an index only operation, so fast. The larger the table, the more important this becomes.

Posted by: bobcozzi
Site Admin ****
Chicagoland
Comment on: Find duplicate records using SQL
Posted: 9 years 9 months 9 days 5 hours 3 minutes ago

To avoid duplicate keys/entries on an INSERT, I tend to use the WHERE NOT EXISTS clause.

Something like this:

 EXEC SQL INSERT INTO customer (dbField1, dbField2)           
           (select  :hostVar1, :hostVar2 from sysibm/SYSDUMMY1   
            WHERE NOT exists                                   
       (select * from customer where dbField1 = :hostVar1)); 
Posted by: neilrh
Premium member *
Jackson, MI
Comment on: Find duplicate records using SQL
Posted: 9 years 9 months 8 days 4 hours 23 minutes ago

The problem I've found with WHERE NOT EXISTS is that applies to what is in the destination file at the time you run the insert. In many cases the data I'm given is repleat with duplicates. The result is that I run the insert and get 30 records added and a duplicate at 31, so I run the SQL !--script-- again (first 31 records are skipped and I get another 20 inserted before the next duplicate). So I run the same SQL !--script-- over and over, until I reach the end of the file - and I don't have foreknowledge of which records are duplicate and able to make a predetermination of whether one is incomplete, whether it's a true duplicate (all columns) or only the key is duplicate.

By checking for duplicates before I start, I can make an eyeball predetermination of the obvious bad rows, and can highlight duplicates back to the data owner for clarification.

Posted by: SteveCCNJ
Premium member *
Egg Harbor Twp., NJ
Comment on: Find duplicate records using SQL
Posted: 9 years 9 months 5 days 23 hours 57 minutes ago

Just to mention another coding option to avoid adding dupes, you can do an EXCEPTION JOIN between the source and target tables to INSERT all the rows except the ones that already match on keys.  This tends to perform well when you have many rows to process becasue it does an index scan to see the match.

To avoid adding SOME dupes from the source table, you can add the word to DISTINCT to your SELECT statement.  If everything about the columns in your select list are identical, the SQL processor will automatically coalesce multiple rows into a single row, avoiding dupes for this type of error.