## fuzzy match confidence w/ multiple attributes

A place to discuss the science of computers and programs, from algorithms to computability.

Formal proofs preferred.

Moderators: phlip, Moderators General, Prelates

allmhuran
Posts: 3
Joined: Mon Feb 10, 2014 11:32 pm UTC

### fuzzy match confidence w/ multiple attributes

This though is, no doubt, related to a blend of computer science and mathematics.

You may be familiar with the term "master data management". For those that aren't, one of the key capabilities that MDM should provide is a way to look at "the same" entity as persisted in two heterogeneous systems (say, your CRM system and your ERP) and tell you whether an instance of an entity in one system is likely to be the same thing in the real world as an instance in the other system.

For example, take some business out in the real world called "ACME Corp". In the CRM, someone has erroneously entered the trading name as "ACNE Corp". To resolve this we will use some kind of fuzzy matching algorithm returing a score between 0 (poorest match) and 1 (exact match) - we'll probably get a good score when comparing "ACNE Corp" to "ACME Corp". So far so good.

But most of our entities will have many attributes, not just one. And the thing that surprises me is that all of the fuzzy matching operators I can find are binary with respect to two individual attributes.

It seems to me that what we need is a fuzzy match that is n-ary, or perhaps more correctly, binary with respect to two sets of attributes.

Consider: Let's say we also have a phone number attribute. The CRM system says the entity is {"ACNE Corp", "555-1234"}, while the ERP has an entity {"ACME Corp", "555-1234"}. Taken on pairs of attributes, we will get two fuzzy match scores - let's say 0.8 for trading name, and 1.0 for phone number. How, then, should we determine the overall score?

Clearly we don't want to multiply the scores together, since that would give us a result of 0.8. But surely if we match one attribute at 0.8 and another at 1.0, our overall score must be better than 0.8.

OK, what if we average them? Our result in this case would be 0.9. Better than 0.8... that's good, but if you think a little deeper you'll realise why this doesn't really work: What if a user had, say, simply not entered a value for phone number in the ERP? Then our average will be the average of 0.9 and, probably, 0.0. But that doesn't properly represent what's going on here.

So, we can add some more to the algorithm. We average the match scores, but ignore nulls. OK, so do we treat blanks as nulls? Many more questions arise.

It seems clear to me that there is some mathematical theory to understand here. Unfortunately, I'm no mathematician. But surely something in information theory covers this. The more information I have, the more confident I should be able to be that two things are the same or not the same. If I perfectly match two entities each with 1000 attributes, surely that is somehow a "strong" match than two entities each with only 10 attributes.

But there doesn't seem to be any algorithm out there. Or at least, in the searching that I have done, all MDM/fuzzy matching solutions seemed to rest their case at the binary operation, there wasn't even a hint of what I'm talking about here, so I stopped looking.

Can anyone provide input? Either theoretical (confirm/deny my hypothesis), or practical (send me to a site that makes available an nary-attribute-matching algorithm).

D-503
Posts: 84
Joined: Sun Apr 15, 2012 11:35 pm UTC

### Re: fuzzy match confidence w/ multiple attributes

Fuzzy matching seems like a heuristic approach to me since it's based on the assumption that if some form of string distance (e.g. Levenschtein) is small then the strings are the same. It fails in a lot of places (Sam and Sammy has a larger lev distance than Sam and Pam), it just happens to work pretty well most of the time.
I think what you're doing is building a sort of meta-heuristic, so I don't think there is a single correct way to do it, it's more or less whatever works. Heuristics have the advantage of allowing you to leverage your domain knowledge to come up with inference rules (e.g. mismatched emails should not be weighted heavily because may people have multiple emails) however, I think you will find that as you add more intuitions to your heuristic you will encounter more unexpected problems with how they interact.

If you have training data available, you could look into using machine learning algorithms to tackle the problem. For example, maybe you could train a Bayesian classifier to recognize whether two sets of attributes match given the vector of their string distances.

allmhuran
Posts: 3
Joined: Mon Feb 10, 2014 11:32 pm UTC

### Re: fuzzy match confidence w/ multiple attributes

I certainly agree that if you can add semantic rules you can get a much more effective matching algorithm.

But I think that it is also true that, whether or not you include the semantics on a per-data-element basis and create appropriate rules, matching on a combination of related attributes should be, in principle, more reliable than matching on each attribute independently and then trying to figure out how to combine the scores, since you've added more information to the system - even if that information is only the fact that the attributes are, in fact, related.

Similarly, surely I can say that from an information theory perspective, even perfect matches (at an attribute level) become more certain (at an entity level) the more attributes I have. Given two entiies, A{a,b} and B{a,b,c,d}, a perfect match between two A entities (on two attributes) is weaker than a perfect match between two B entities (on four attributes). But it does raise interesting questions - what if there really are only two attributes of A? Then an exact match on two attributes implies identity. I've never really heard of a theory that considers identity to be something quantitative - it's a yes or no answer. And yet, my B entity, with two more attributes, seems to have a higher certainty of identity than my A match.

I suppose I can reconcile this by supposing that the model is unlikely to persist all possible attributes of the entity in the real world, such that the only "true" identity in the real world would be the one where all possible attributes match. The model contains only a proper subset of those attributes, so we can never have perfect identity.

But this, then, somewhat flies in the face of the concept of natural keys... we'd be saying that the only way to be certain of the identity of the entity would be by the superkey containing every possible attribute, whereas natural keys in models necessarily imply that a subset of attributes is all that is required.

End rambling, thanks for the feedback.

lgw
Posts: 437
Joined: Mon Apr 12, 2010 10:52 pm UTC

### Re: fuzzy match confidence w/ multiple attributes

I've worked with "fuzzy matching" in real-world commercial code (email archiving/discovery). You can create an e.g. 64-bit signature for a document such that each minor change in the document is likely to flip 1 bit (for the first few changes). Thus if the number of bits "set" in the XOR of the signatures for the 2 documents is small, they are quite likely to be "similar" as judged by human intuition.

The original work was done at Stanford (damned if I can find a cite). The approach works for any sequence of symbols, which for my case meant treating a document as a sequence of words. You need a body of "related" documents from which to build word frequency counts - e.g., all the emails for a specific company are a good set when comparing two emails for that same company. Each word in a document affects one bit (based on a hash of the word, or some other pseudo-random but deterministic manner) in the signature to a magnitude related to the rarity of the word. So a common word missing or added might not even flip a bit, while a rare word would certainly flip a bit.

Of course, it's by design in this case that "foo is a bar" and "foo is not a bar" are seen as quite similar. The method ignores the semantics of the document, but that was a feature for the task of legal discovery.

To make use of that you also need a data structure whereby given an input number, you can find all other numbers in the set with a small number of "differing bits", which is a fun problem to itself.
"In no set of physics laws do you get two cats." - doogly

Tub
Posts: 475
Joined: Wed Jul 27, 2011 3:13 pm UTC

### Re: fuzzy match confidence w/ multiple attributes

allmhuran wrote:Consider: Let's say we also have a phone number attribute. The CRM system says the entity is {"ACNE Corp", "555-1234"}, while the ERP has an entity {"ACME Corp", "555-1234"}. Taken on pairs of attributes, we will get two fuzzy match scores - let's say 0.8 for trading name, and 1.0 for phone number. How, then, should we determine the overall score?

I think you need to be smarter than just using a generic mathematical model. At this point I'd stop thinking about probabilities and introduce a different measurement, score if you want.

A perfectly matched email address is a strong indicator for uniqueness, while even 1 char off is the difference between dude13@aol.com and dude93@aol.com.

A perfectly matched first name is not worth much at all. A different first name is a huge problem though - even if address, phone number and last name are identical. You don't want to bill the wrong family member.

For phone numbers, one character off is not a match. But you'll want a custom comparison function that ignores non-numeric characters (555-12345 vs 55512345) and understands international prefixes and the different shorthands.

So for each attribute, you'll need a function that assigns a score (positive or negative) based on the difference, with score 0 for missing data.
Then you pick a cutoff point and return all customers whose scores are higher, ideally sorted. And always have a human confirm the match!