Using IBM i? Need to create Excel, CSV, HTML, JSON, PDF, SPOOL reports? Learn more about the fastest and least expensive tool for the job: SQL iQuery.
Many times in my work I'm sent a spreadsheet with a "Please load to production" note. The problem is that frequently they don't check all of the data and we end up trying to add duplicates into the production files. Now while it's possible to run an INSERT INTO ..... WHERE NOT EXISTS deal, in some cases we need to run this multiple times until the records are loaded. And there's no early warning you're just discarding records arbitrarily.
I've begun to use the following SQL to find duplicate entries, prior to attempting to add the records, and I can even go back to the spreadsheet - highlight the duplicates and ask which is correct!
with stuff as (select rrn(mf) as recno, mf.* from mylib/myfile mf) select * from stuff a left join stuff b on a.f1 = b.f1 [and a.f2 = b.f2....] where a.recno > b.recno
just remember to join on all of the primary keys.
What about a simple count?
WITH dups AS (SELECT f1 [, f2, ...] FROM mylib/myfile AS mf1 GROUP BY f1 [, f2, ...] HAVING COUNT(*) > 1 ) SELECT RRN(mf), mf.* FROM mylib/myfile AS mf INNER JOIN dups AS d ON d.f1=mf.f1 [AND d.f1=mf.f2, ...]
You could easily rewrite this as an EXISTS instead of the CTE and the INNER JOIN, but it would come out the same.
I'll have to give that one a try.
We've actually done this in the INSERT TO table, as an up front way to avoid generating a dup key error that would otherwise have to be handled.
If you're loading from a spreadsheet, you probably don't have a huge quantity of data. But, generally speaking, if f1 [, f2, ...] are high order keys in any index, the GROUP BY shold be an index only operation, so fast. The larger the table, the more important this becomes.
To avoid duplicate keys/entries on an INSERT, I tend to use the WHERE NOT EXISTS clause.
Something like this:
EXEC SQL INSERT INTO customer (dbField1, dbField2) (select :hostVar1, :hostVar2 from sysibm/SYSDUMMY1 WHERE NOT exists (select * from customer where dbField1 = :hostVar1));
The problem I've found with WHERE NOT EXISTS is that applies to what is in the destination file at the time you run the insert. In many cases the data I'm given is repleat with duplicates. The result is that I run the insert and get 30 records added and a duplicate at 31, so I run the SQL !--script-- again (first 31 records are skipped and I get another 20 inserted before the next duplicate). So I run the same SQL !--script-- over and over, until I reach the end of the file - and I don't have foreknowledge of which records are duplicate and able to make a predetermination of whether one is incomplete, whether it's a true duplicate (all columns) or only the key is duplicate.
By checking for duplicates before I start, I can make an eyeball predetermination of the obvious bad rows, and can highlight duplicates back to the data owner for clarification.
Just to mention another coding option to avoid adding dupes, you can do an EXCEPTION JOIN between the source and target tables to INSERT all the rows except the ones that already match on keys. This tends to perform well when you have many rows to process becasue it does an index scan to see the match.
To avoid adding SOME dupes from the source table, you can add the word to DISTINCT to your SELECT statement. If everything about the columns in your select list are identical, the SQL processor will automatically coalesce multiple rows into a single row, avoiding dupes for this type of error.