blog | 4 min read

Decoding Identity Resolution, Part Three: Deterministic Identity Resolution Algorithms

May 27, 2022

Illustration of a person having their data connected through the identity resolution process

Welcome to our blog series on decoding identity resolution. This is a nine part blog that offers an attempt at a friendly, comprehensive view of how to think about the concept of identity resolution as well as how to interpret the way it is represented in marketing and sales materials by different companies across the tech landscape. The other articles in the series can be found here:

Introduction

Deterministic  (or “rules-based”) identity resolution is the most commonly used method for identity resolution. In this entry, we’ll explore what it involves and how to think about it.

Deterministic identity resolution

When we say “deterministic” what we mean is that the matches are an exact match of the values, and the rules are simple and minimal. The results prioritize predictability over accuracy. This is very important for operational use cases like associating a person with their payments, but ultimately insufficient for most other use cases. 

In general, if an application or platform does not offer great detail about its identity resolution features, it usually means it’s using something deterministic. 

Rules-based identity resolution is the most straightforward way of solving an incredibly complex problem. The vast majority of providers are interested in creating marketer tools, analytics features, or workflow features but sidestep providing a robust identity resolution solution because it's “too hard.” These tools all offload the work on their customers.

When you are looking at data management solutions, look for boxes in their architecture diagrams labeled “ETL” which stands for “Extract-Transform-Load.” This is a surefire indication that your technical teams will be on the hook for writing entirely custom jobs to prepare data to conform to the requirements of their tool, rather than being able to input data in whatever format it naturally occurs and letting the tool sort it out. This is incredibly time-consuming, and results in a “garbage in, garbage out” problem.

Let’s take a look at a couple of rules-based solutions commonly seen in the market.

Basic lookups

This is a common element of rules-based identity resolution. It means that the application chooses one or a combination of fields in the data and declares that a “unique identifier”. 

Most tools specializing in email marketing use email addresses as unique identifiers. This means that if you write code or load  data in, it will simply lookup profiles via email addresses, and if there’s a direct match, it will pair the person and correlate data together. 

Cascading rules

Another common technique, a cascade of different rules allows for more flexibility in matching algorithms and is commonly how in-house-built identity resolution algorithms handle the problem. 

This also shows up in product demos as an easily configurable way to control how identity resolution works. The simplicity of the algorithm means that you can typically control the rules without knowing how to write any code and still get a predictable result.

For example, a simple rule set might be something like:

  1. Lookup an email address

  2. If there’s no match, look up a combination of first name, last name, and street address

  3. If there’s still no match, look up on last name and phone number

The perceived advantage is that these are predictable and teams can have a clear discussion on the rules. 

Score tables

Another legacy answer to this problem is using a “score table” that weights different pieces of PII and creates a series of rules. This is similar to the cascading rules concept but with more options for fine-tuning.

If you only have a first and last name for someone, that doesn’t count as knowing who they are. Even though my name, Caleb Benningfield, isn't exactly common, there are at least a handful of other people in the United States with the same name. The score table allows you to then assign points for each type of matching data and establish a threshold for a minimum amount of information required to confidently match people.

For example: 

  • First names match - 1 point 

  • Last names match - 1 point

  • Emails match - 4 points

  • Phone numbers match - 1 point

  • Addresses match - 2 points

With the above scores you can set a threshold at five points. That will give you results like the following:

  • First and last name only - 2 points, NO MATCH

  • Last name and email - 5 points, MATCH

  • First, last, phone number and address - 5 points, MATCH

  • First name, phone number - 2 points, NO MATCH

Then you can tune it to the preferences of your organization.

Optimizing and “Probabilistic”

Optimizing deterministic identity resolution

Some platforms make rules-based matching even more robust by allowing for basic “string” matching algorithms (or text matching algorithms) that can account for the different ways people type in their information. This is often referred to as “fuzzy matching” and includes things like seeing how many characters are different between names and counting it as a match if it’s below one or two characters.

A way to make this more effective is running a process to standardize data, which improves results by eliminating common anomalies, but also makes it slower to process.

“Probabilistic”

Look out for any companies claiming “probabilistic” ID resolution — it most likely means just introducing any probability into the equation. 

The example of “fuzzy matching” to account for common variances in how data is entered means making some guesses, which technically introduces probability.

While it does add a layer to the process, it’s a minimal improvement framed to make a rules-based solution seem more sophisticated.

Tradeoffs of deterministic identity resolution

Every choice of tool or algorithm has its tradeoffs. Below are how to think about the upsides and downsides of a rules-based identity resolution algorithm.

Upsides

Downsides

Can be the fastest option if implemented correctly

Less accurate which can lead to bad customer experiences, inaccurate analytics, etc.

Predictability

Only faster if you choose the simplest rules with the right infrastructure

Better for “operational” use cases where the risk of being wrong is high

Less transparent

Next up

Next up we’ll be talking about the bleeding edge of identity resolution: advanced data science.