Effective Techniques for Handling Duplicate Records in SQL: A Comprehensive Guide

Introduction

Handling duplicate records is a crucial aspect of data management and integrity. Whether you’re cleaning up a database or ensuring data quality, knowing how to efficiently identify and manage duplicates is essential. In this blog, we will explore various SQL techniques to find duplicate records, complete with practical examples and data.

Creating the Sample Table

To illustrate these techniques, let’s start by creating a sample table and inserting some data:

CREATE TABLE Employees (
    EmployeeID INT PRIMARY KEY,
    FirstName NVARCHAR(50),
    LastName NVARCHAR(50),
    Email NVARCHAR(100),
    Salary DECIMAL(10, 2)
);

INSERT INTO Employees (EmployeeID, FirstName, LastName, Email, Salary)
VALUES 
    (1, 'John', 'Doe', 'john.doe@example.com', 50000.00),
    (2, 'Jane', 'Smith', 'jane.smith@example.com', 60000.00),
    (3, 'John', 'Doe', 'john.doe@example.com', 50000.00),  -- Duplicate record
    (4, 'Alice', 'Johnson', 'alice.johnson@example.com', 55000.00),
    (5, 'Bob', 'Brown', 'bob.brown@example.com', 60000.00),
    (6, 'John', 'Doe', 'john.doe@example.com', 50000.00);  -- Duplicate record
1. Finding Exact Duplicate Records

To find records that are identical across all columns, use the GROUP BY and HAVING clauses:

SELECT EmployeeID, FirstName, LastName, Email, Salary, COUNT(*)
FROM Employees
GROUP BY EmployeeID, FirstName, LastName, Email, Salary
HAVING COUNT(*) > 1;
  • GROUP BY: Groups records by all columns.
  • HAVING COUNT(*) > 1: Filters out groups with more than one record, indicating duplicates.
2. Finding Duplicates Based on Specific Columns

If you’re interested in duplicates based on specific columns, you can group by those columns:

SELECT FirstName, LastName, Email, COUNT(*)
FROM Employees
GROUP BY FirstName, LastName, Email
HAVING COUNT(*) > 1;
  • Groups records by FirstName, LastName, and Email to identify duplicates.
3. Selecting All Columns for Duplicate Records

To retrieve full details of duplicate records, use a JOIN with a subquery or a CTE:

-- Common Table Expression (CTE) to find duplicate records
WITH DuplicateRecords AS (
    SELECT FirstName, LastName, Email
    FROM Employees
    GROUP BY FirstName, LastName, Email
    HAVING COUNT(*) > 1
)

-- Selecting all columns from the original table for the duplicate records
SELECT e.*
FROM Employees e
INNER JOIN DuplicateRecords d
ON e.FirstName = d.FirstName
   AND e.LastName = d.LastName
   AND e.Email = d.Email;
  • Common Table Expression (CTE): Identifies duplicates.
  • JOIN: Retrieves full details of the duplicates.
4. Using WHERE EXISTS:
SELECT *
FROM Employees e1
WHERE EXISTS (
    SELECT 1
    FROM Employees e2
    WHERE e1.FirstName = e2.FirstName
      AND e1.LastName = e2.LastName
      AND e1.Email = e2.Email
    GROUP BY e2.FirstName, e2.LastName, e2.Email
    HAVING COUNT(*) > 1
);
  • WHERE EXISTS: Checks if there are records matching the criteria of duplicates.
5. Finding Duplicates with Row Numbering

Use window functions to assign row numbers and filter duplicates:

WITH NumberedRecords AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY FirstName, LastName, Email ORDER BY EmployeeID) AS RowNum
    FROM Employees
)
SELECT *
FROM NumberedRecords
WHERE RowNum > 1;
  • ROW_NUMBER(): Assigns a unique number within each partition of duplicates.
  • PARTITION BY: Identifies duplicates based on specified columns.
6. Finding Duplicates with a Specific Condition

To find duplicates based on a condition, combine GROUP BY with WHERE:

SELECT FirstName, LastName, Email, COUNT(*)
FROM Employees
WHERE Salary > 40000
GROUP BY FirstName, LastName, Email
HAVING COUNT(*) > 1;
  • WHERE: Filters records based on the condition before grouping.
Conclusion

Identifying and managing duplicate records is key for maintaining data quality. By using these SQL techniques, you can effectively handle duplicates and ensure your data remains accurate and reliable. Choose the method that best suits your needs, whether it’s finding exact duplicates, handling specific columns, or using advanced functions.

Have you encountered challenges with duplicate records in your database? Share your experiences and solutions in the comments below!

Thanks for reading! and happy querying!

Leave a Reply

Your email address will not be published. Required fields are marked *