Highlighting duplicates is handy when you are comparing data to see where there’s a match, or going through trying to filter out double entries.
We’ll be taking a look at ways Google sheets gives you for doing this using Conditional formatting.
We’ll start with a simple, single column example and work on to more complex situations.
Our examples will include a small dataset but the principle is the same regardless of whether there are 10 rows or 1000 rows.
Google sheets offers a conditional formatting option which allows the user to create a rules based formatting strategy based on one or more criteria.
There are a number of predefined rules that can be used to test against text, dates and numbers.
You can also define a custom formula, and this is what we will be using.
In our first example, we have a list of email addresses and we want to highlight only those rows that constitute duplicates.
How we make this selection impacts on how we structure our custom formula.
In the screenshot above, we’ve highlighted the range of cells A2 to A25.
Then, we select Conditional formatting, available from Google sheet’s Format menu option.
This will open the Conditional format rules pane on the far right hand side.
The first thing to notice is the Apply to range section, which by default has the range of cells we highlighted previously.
We can manually change this if we wanted to include more cells in the range, but as we’ll see in a moment, this range has to coincide with our custom formula.
Since none of the predefined rules help in finding duplicates, we need to use a formula, by selecting the last option, Custom formula is.
This adds a new field to the Formula rules section, Value or formula.
It is here where we define the formula that will find any duplicates in our email addresses.
Our formula is going to use the COUNTIF function.
The following function definition is taken from the Google sheets documentation.
Returns a conditional count across a range.
- range – The range that is tested against criterion.
- criterion – The pattern or test to apply to range.
- If range contains text to check against, criterion must be a string. criterion can contain wildcards including ? to match any single character or * to match zero or more contiguous characters. To match an actual question mark or asterisk, prefix the character with the tilde (~) character (i.e. ~? and ~*). A string criterion must be enclosed in quotation marks. Each cell in range is then checked against criterion for equality (or match, if wildcards are used).
- If range contains numbers to check against, criterion may be either a string or a number. If a number is provided, each cell in range is checked for equality with criterion. Otherwise, criterion may be a string containing a number (which also checks for equality), or a number prefixed with any of the following operators: =, >, >=, <, or <=, which check whether the range cell is equal to, greater than, greater than or equal to, less than, or less than or equal to the criterion value, respectively.
- COUNTIF can only perform conditional counts with a single criterion. To use multiple criteria, use COUNTIFS or the database functions DCOUNT or DCOUNTA.
- COUNTIF is not case sensitive.
The formula we are going to use is
=COUNTIF ($A$2:$A$25, A2) > 1
COUNTIF takes two parameters, the range ($A$2:$A$25), and the criterion (A2), and what we are looking for is a count that is greater than 1.
In our example, the range of cells used is from A2 to A25 inclusive.
When setting the range parameter in the COUNTIF function call, we need to use absolute cell referencing (we’ll see in a second why), hence we prepend both the column letter and row number with the dollar symbol, $A$2:$A$25.
COUNTIF then uses the criterion parameter, checking it against all the cells in the range, incrementing a counter every time a match (i.e. duplicate) is found.
In our formula, we’ve provided cell A2 as the criterion, but passed it as a relative cell reference (no $ symbol).
The Conditional format rules will use the Apply to range setting, moving down each cell in the range, applying the formula.
This is where absolute and relative referencing are important.
Since the range parameter in the formula uses absolute referencing, the range remains fixed.
The criterion parameter on the other hand, is a relative cell reference, so it will be adjusted with each call as Conditional format rules iterates past each cell. Each of cells A2 to A25 in turn, will be compared against the entire range and tested for duplicates.
Before finishing, we can adjust the nature of the highlighting to be applied, from the Formatting style section.
Here, we can select a different background fill color, or apply text formatting such as bold or underlined.
With the formula entered in the Value or formula field, and the formatting style set, left mouse click on the Done button.
The previous example highlighted a single column, but what if we wanted to highlight duplicates that spanned two or more columns, as in the case of where our email addresses also had name fields.
The aim again is to find any duplicate emails, but this time we’d like rows across all three columns to be highlighted.
The process is exactly the same as before, with just a couple of very minor changes.
The second change is to our formula where, here too, we simply change the column reference $A$25 to $C$25 so that it also covers the three columns.
=COUNTIF ($A$2:$C$25, A2) > 1
The result is seen in the screenshot above where duplicates are highlighted across the three columns.
The duplicates for email@example.com in the EMAIL column (rows 4 and 11) have now lost their highlighting, however the corresponding cells in the FIRST NAME and LAST NAME columns are still hightlighted.
This shows that the conditional formatting is applied to each individual cell independently, irrespective of which column it is under.
In other words, in row 4, the name Vallie in the FIRST NAME column also exists in row 11, and so is a duplicate.
This may be the behavior you were looking for, especially if you are after all duplicates, regardless of where they occur.
However, if instead you wanted to highlight the row but only when it matches on the EMAIL, we need to tweak the formula ever so slightly, taking advantage of cell referencing. Our new formula is now:
=COUNTIF ($A$2:$C$25, $A2) > 1
The difference is in the criterion parameter, which is now $A2. This means that the column (A) is absolute while the row remains relative.
Even though the range covers columns A to C, using an absolute column reference ensures that the search for duplicates will only apply to column A, the EMAIL column.
Yet another variation is checking for duplicate rows (i.e. where all three columns match), and for this we need to again modify our formula.
The aim now is to find duplicates where the EMAIL, FIRST NAME, and LAST NAME columns match as a row and not as individual cells.
The problem is that the Conditional format rules and the COUNTIF function cannot operate on columns as a set.
To get around this, we will use the ARRAYFORMULA function. Here is the definition for ARRAYFORMULA.
Enables the display of values returned from an array formula into multiple rows and/or columns and the use of non-array functions with arrays.
ARRAYFORMULA(SUM(IF(A1:A10>5, A1:A10, 0)))
- array_formula – A range, mathematical expression using one cell range or multiple ranges of the same size, or a function that returns a result greater than one cell.
- Many array formulas will be automatically expanded into neighboring cells, obviating the explicit use of ARRAYFORMULA.
- Pressing Ctrl+Shift+Enter while editing a formula will automatically add ARRAYFORMULA( to the beginning of the formula.
- Note that array formulas cannot be exported.
ARRAYFORMULA will allow us to combine the EMAIL, FIRST NAME, and LAST NAME columns into a single column (or array) and pass that to the COUNTIF function.
Here is the new formula:
=COUNTIF (ARRAYFORMULA ($A$2:$A$25 & $B$2:$B$25 & $C$2:$C$25), $A2 & $B2 & $C2) > 1
Apart from the ARRAYFORMULA function, we’ve also introduced the use of the ampersand (&) which is the concatenation operator. It concatenates or combines its operands.
It is used in the ARRAYFORMULA function to combine the EMAIL, FIRST NAME, and LAST NAME columns.
ARRAYFORMULA then converts the concatenation into a single column (or array) which the COUNTIF function can handle.
We also concatenate the EMAIL, FIRST NAME, and LAST NAME cells for the criterion parameter, $A2 & $B2 & $C2.
Note here too, the use of absolute referencing with respect to the columns, while still using relative referencing for the row.
As we’ve seen, the Conditional formatting option provides a very convenient way to highlight duplicates using custom formulas.
With the use of the COUNTIF and ARRAYFORMULA functions, coupled with subtle changes in cell referencing, a variety of options for determining duplicates become available.