How to find duplicates in excel

Today’s post is about a powerful Excel formula for tracking duplicates in a large set of email metadata.  This routine is not for the faint of heart, but if you are managing a document production, supervising a redaction project, or creating a privilege log, this method will save you many hours of time and reduce the likelihood of mistakes.  Roll up your sleeves and read on!

When the document ID numbers are in Column A, and the hash values (or other deduping criterion) are in Column B, type the word “original” in D2.  The first formula, in cell D3, is (drum roll please):

=IF(COUNTIF($B$2:B3,B3)>1,”Duplicate of “&INDEX($A$2:A3,MATCH(B3,$B$2:B3,0)),”original”)

As explained in more detail below, the formula identifies duplicates based on the hash value.  When it hits a duplicate value, it reports the Document ID number of the first document with the same hash value.

Here is a screen shot of the formula in action, in cell D7, where the first duplicate occurs:

Tracking Duplicates Screenshot

Notice how the formula correctly identifies that the email in row 7, with Document ID ENRON-000004452, is a duplicate of the email in Row 6, with Document ID ENRON-000004451.

Here’s how it works:

The COUNTIF formula looks at all the hash values so far and counts up how many times the current value has occurred.  If Excel is seeing the current value for the first time, the formula returns 1 – that means it is an “original” document.  If it is seeing the hash value for, say, the 2nd time, it will return a 2.  Column C in the screenshot contains just the COUNTIF function.  If this is all you did, you would have a nice way of segregating the originals from the duplicates.  All the 1’s are originals, and all the others are duplicates.  The rest of the formula tells you what each duplicate is a duplicate of.

The MATCH formula finds the first occurrence of a value in a stack of values, and returns the relative row number where the value is found.  Here, MATCH(B7,$B$2:B7,0) asks “In the stack of hash values from B2 (the first value) to B7 (the current value), where does the current hash value first appear?”  The answer is 5: the 41DCD hash value first occurs in B6, which is the 5th row of the range.  The 0 in the 3rd position of the MATCH function means that we want an exact match.

The INDEX formula pulls a value from a specified row of a specified stack of cells.  For example, INDEX(A1:A100,8) would return the 8th value in the range A1:A100, i.e., the value in A8.  The row number can be (and usually is) specified by another cell reference or a formula.  For example, INDEX(A1:A100,C55) would return the value in A1:A100 corresponding to whatever value is in cell C55.  Here, the MATCH function provides the reference.  We want INDEX(A2:A7,n), where A2:A7 is the stack of Document ID numbers so far, and n is the row where the current hash value first appears.  Hence, INDEX($A$2:A7,MATCH(B7,$B$2:B7,0).

Related text  How to make a paypal

Putting these three pieces together pulls the Document ID number of the first occurrence of any given hash value–that is the original document corresponding to any given duplicate!  Of course, we only want to do this if the hash value in question is, in fact, a duplicate.  That’s why this routine is embedded in an IF function that makes sure COUNTIF is greater than 1.  Otherwise, the formula returns the message “original.”

A couple more notes about this routine:

  • The references to A2 and B2 in the range arguments are anchored with $ signs because we don’t want those references to change when we copy the formula down.  You can insert those $ signs by hitting the F4 key after typing or selecting the reference in the formula.
  • As with most complicated formulas, you could break this formula down into several smaller pieces.
  • The combination of INDEX and MATCH is a popular strategy for overcoming the limitations of the VLOOKUP function.
  • The routine described here can also be used to link email attachments with their parent emails, based on the GroupID or FamilyID field.

Using this routine has helped me streamline privilege logs and document reviews, and avoid producing duplicate documents with inconsistent redactions.  I hope you find it helpful, too.

This is quite a bit more complicated than what is covered in my off-the-shelf Excel Esquire courses.  If you would like to learn about setting up a customized training that covers this type of eDiscovery analytics–whether for attorneys, paralegals, or litigation support staff–please get in touch.

Background

Suppose someone has given you a list like the one shown below (which you can download here if you’d like to follow along):

image

While multiple brands are okay here, we need a list that shows only unique SKU numbers.  While the list provided to you was supposed to be duplicate free, you’re not 100% sure that it actually is.  While it would be easy to just hit the SKU column with the Remove Duplicates function, you don’t want to do that.  Instead you’d like to indentify which records have duplicate entries in the list.

Related text  How to tell what iphone you have

So how do we do this?

Naturally, there will be a few different ways to do this.  I’m carving off one method that is the easiest to replicate via the user interface…

Step 1:  Link to the Data

Of course we’ll start by pulling the data in to Power Query

  • Click anywhere in the Products Table
  • Create a new query –> From Table

The data will be loaded in to Power Query, and you’ll see two steps in the Applied Steps window:

  • Source (pointing to your source data)
  • Changed Type (setting the data types for the columns)

This might seem like an odd step right now, but we’re going to add a Index column to this table as well.  The reason will become apparent later, but for now:

  • To to Add Column –> Add Index Column –> From 0

Your data should now look like this:

SNAGHTML1693b894

Now we need to figure out how to flag any repeating SKU as a duplicate.

Step 3: Identify Duplicates and Show Duplicate Records

Let’s fix this.  Remember how we added a new step to show “All Rows” for the ProductDetail column?  That step gave us the ability to do something pretty cool… it gave us the ability to get back all the lost records and product detail information we’re currently missing.

  • Click the Expand button at the top right of the ProductDetail column
  • Uncheck the SKU Number option (as we already have it)
  • Uncheck the option to “Use original column name as prefix”

SNAGHTML16a0c46c

As you can see, this will bring back all the details we lost earlier.

Step 4: Final Cleanup

But hang on a second.  Let’s look at this output a bit more closely…

SNAGHTML16a24427

Notice, that it re-sorted the data.  That’s not exactly a desirable outcome, as we are trying to flag duplicates for a reason. Maybe we want to know where they exist in an inventory count or we have some other reason for wanting to preserve the original sort order of our data.  It’s for this reason that we added the Index column earlier.  That came through with the All Rows step, so let’s put our data back into its original order.

  • Click the drop down arrow on the Index column –> Sort Ascending
  • Right click the Index column –> Remove

And we can now finalize the query:

  • Rename the query to ShowDuplicates
  • Go to Home –> Close & Load
Related text  How to be a real estate agent

Step 5: Make the Duplicates Obvious

With the data now in an Excel table, we can make the duplicates even more obvious by applying some conditional formatting to the table.  To do this:

  • Select all the values in the Duplicates column of the table
  • Go to Home –> Conditional Formatting –> Data Bars –> Choose a colour

I chose blue data bars, which makes the data look like this:

image

Conditional Formatting

Use When:

  • You have a data set that is small enough to be cleansed visually
  • You want to visually identify, analyze, and confirm your duplicates before removing them
  • Your data is complex enough that highlighting of duplicates will help identify them

One of the most useful conditional formatting functions in Excel is the ability to highlight duplicate values.  To use this function, simply highlight the data in question.  Within our sample data set, since we are only concerned about duplicate values in the Unique ID field, we’ll highlight that specific column.  Then under the “Home” tab of the Excel ribbon, click:

  • Conditional Formatting
  • Highlight Cell Rules
  • Duplicate Values

 Remove Duplicates 04

If you use the default setting, your cells will be highlighted in red.  Since conditional formatting doesn’t automatically remove duplicates for you, you’ll have to manually delete them.  Because of this limitation, the conditional formatting approach is primarily useful for smaller data sets, where you can reasonably scan through the entire data set to make edits.

Remove Duplicates 05

Sort the Data

Use When:

  • You have a data set that is small enough to be cleansed visually
  • You want to visually identify, analyze, and confirm your duplicates before removing them
  • Your data is simple enough that you can easily tell the values apart

Sorting your data table is one of the fastest ways to find duplicates in Excel.  Assuming your data set is small and has simple values, a simple sort should allow you to see if any duplicates exist.

However, if you have complex values in your column, such as a ten digit alpha numeric code, it’ll be very difficult for you to see the repeating values visually.  In that case, you should use conditional formatting.

To sort the data, go to the “Data” tab of the Excel ribbon and click any one of the sort buttons.

Remove Duplicates 08

Additionally, to use the shortcut function, you can just select the column in question and type either:

ALT + A + S + D (for sort descending)ALT + A + S + A (for sort ascending)

Like this post? Please share to your friends: