Link to home
Start Free TrialLog in
Avatar of hbz
hbzFlag for United States of America

asked on

Computing whether two orders in a database come from the same person


Basically, I have a database of orders.  I now want to find which orders have come from the same person (billing information).  

A seemingly straightforward problem, but consider that fact that people might order using different credit cards, might be using coporate credit cards that don't have their names on them, etc.  I think I'll have to use several functions (or one function) that computes a 'confidence' that two orders are similar.

.. And I get the very strong gut feeling that this will be a matrix problem.

Any suggestions to point me in the right direction?

- hbz
SOLUTION
Avatar of grg99
grg99

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of hbz

ASKER


grg99:

"I would use some ranking function that compares their billing information..."

Unless I can think of something better, I have these three string-based functions:

f1: Leftmost 3 characters of first name and leftmost 4 characters of last name
f2: Leftmost 10 characters of email
f3: Lastname + state

So that if f1(order #1) = f1(order #2), f1 calls them equal.  Now, these functions are not equal.  f2 has a very low false positive rate, for example.  So I will probably (?) use a weighted sum of these f = f1 + f2 + f3

You'll have to forgive the rustiness/nonexistence of my math skills, but here is my reasoning regarding some sort of matrix is that if I'm comparing all pairs of orders against all others, I'd essentially have a 2-d matrix, N x N where N is the number of orders being compared.  Each element Xij would be the confidence returned by this function.  It would be symmetric about its diagonal.  Doing some Googling, I ran across some pages that mention "Cluster Analysis" and "distance matrices" which seem to be related to my problem.  ...But they are too dense for me to read.  Am I heading off the deep end here, perhaps making the problem too complicated?

- hbz
Avatar of grg99
grg99

Me too, my head hurts when I try to picture all this.  I'd stick with your idea of a weighted sum.

BTW there's often a lot of variability involved in domain names parts, for example I get mail at abc@xyz.edu, abc@maincampus.xyz.edu, abc@kremvax.xyz.edu, which are all really the same thing.  You may want to allow some flexibility here.

SOLUTION
Avatar of aburr
aburr
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi hbz,

What about starting with something simple and obvious, like shipping address?

Kent
aburr explained your approach very well, but I don't think it fits here, mostly because it's strange the use of a mathematical method to get an intermediate result: <<have the program spit out all pairs with a high score for human decision>> (aburr), when there is a procedural algorithm to get a final result: have the program do it all for you.

The main problem with your approach is the uncertainty introduced by the use of the hash functions; maybe they are useful for solving fuzzy problems like those described by grg99, but I think you never should take aaaabccccc@d.com = aaaabccccc1@d.com (f2) or Gene Hackman = Genevieve Hackerman (f1)!!

Data Consolidation is a tricky field. In many respects it's similar to Data Compression: usually a less than optimal compression is better than too much compression. Specific knowledge can be very useful too: <<something simple and obvious, like shipping address>> (Kdo).
My cents...
this is not doable as phrased.

> Any suggestions to point me in the right direction?
Yes, but first let me suggest the wrong TA was selected, a database question should go to database TA.  There at least used to be some people in MS-Access TA who worked on similar problems

> I think I'll have to use several functions (or one function) that computes a 'confidence' that two orders are similar.
no

>.. And I get the very strong gut feeling that this will be a matrix problem.
no

Not math, but identification, accuracy, and database.
My IMO is that it is a problem for human, not computer or calculator
Typically, this is resolved by creating a unique key that links the variety of records together.

Unfortunately many people have too similar names, such as the Jans Doe example, John Smith.
I had recent situation at work, person had over twenty letters in each of first and last name. Neith name had I seen before, so thought was the names were uncommon. Yet the two names were identical, as were their zipcodes, city, place of business/work, address, building, floor, hall, etc.  At their own recommendation, what we did was reduce records for the first name of one to be only three letters. Their interest was not same, it was for insuring that other people could tell them apart, such as for eMail and even paper mail.

That is among reasons I say the problem needs human involvement to resolve.  Use the human interest to get the records associated properly. They are interested, more than you can imagine.

I know some peolpe who have similar issue in banking management.  For only a single account at single bank, they would like to participate in handling transactions remotely. But since they have computers at home, and at work, the bank will still reject them based upon such things as different telephone numbers and changing of internet addresses.  They want resolution, the business seems to care less, forcing, restricting their options.

>  might be using coporate credit cards that don't have their names on them, etc

I don't accept that.

> You'll have to resign yourself to some false positive matches..

This is unacceptable,  Defining a final resolution using lawyers and courtrooms is too expensive an option

>  I now want to find which orders have come from the same person (billing information).  

Know well we are not faced with resolving a federal case of handling fraudulent activity and criminals or suspects trying to hide.

Where a person is truely interested in continuing a business relationship, purchasing from a company using multiple IDs, the answer is to ask the persons themselves which IDs to match together.  Where some companies still restrict all billing to one address, others permit separate billings, to separate addresses. A person can, say, have the same 'home address' for each record, the same 'work address' for each, and yet a different 'billing address'. As far as the database goes. The reality could be that some or all addresses may be identical for many individuals.

What ties them all together must be a unique database field for each individual, that can be associated with each transaction, bill, and payment.  The billing company can 'borrow' a number from any of the IDs associated with person to do this.  Better practise is to have the company invent its own ID for cross-referencing, one that is defined to best suit their purposes.

The proof of the record match can include the payment of the bill without question.

For the math part, it is not useful in grouping them together. This is easily accomplished by merel selecting the link to records, or sorting them and finding first best. Now, the math (eg matrix) does come in handy for resolving of duplications, and for making efficiencies inside database by eliminating or at least reducing any redundancies.

But if you assume you can be confident in using computers to consistently match diverse records to individuals, you will fail.
Please stop sending me bills for the purchases of SunShine!                          :-)
Nevermind, I'll just take all future business elsewhere!                                  :-(
(unless you give me address of SunShine, looks like lots of good stuff was ordered)              ;-)
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
RAPUTA >  Nonetheless, the author wanted to use matrices to solve their problem
Not so, see here:
hbz >  strong gut feeling that this will be a matrix problem. Any suggestions to point me in the right direction?

It is not a statement desiring to find way to use matrix, it is a question of whether matrix can produce reasonable solution

> but I think there is no space for changing the boundaries of the problem

Only true in malinformed malconstructed material of classroom exercise.

> consolidation issues happen frequently in information processing

True, witness my reference to elimination of redundancies (although that is not practiced very well at all)
Also there is potential for the finding of missing persons, or rather, criminal investigations of those seeking to hide identity

> Any two records can be put in the same group if they share at least one key information

well, the key is to identify keys. For sure. In the example as given I claim there are none. Two different credit cards have two different numbers.  Neither first name nor last is unique, and combining them does not solve the problem. Addresses are not unique.

In crime solving as well, even upon making of matches, the presumption remains that it is not true, for being innocent until found guilty. So investigators seek further evidence to corroborate the identification process to make the match.  Again, it is a people issue, not one for a formula or a function or an algorithm, but it can be a process. The process, in business, includes the people involved or the business is subject to failure, as witnessed by the dot.com debacle.

Here is a personal example. I had a credit card. I took a vacation and bought presents. In between shopping stops, my card was denied right to purchase. Upon denial, I asked what problem was. I was not informed, but quite likely some formula was used. I was not in my home town. Still, being disaffected, I cancelled the card and all future business with that company.  I changed scheme, and got credit card from bank where I had money deposited, effectively merging both accounts, so that all of the data fields line up perfectly between the two. One year I successfully use card for trip to city spending huge amount of money.  The next year I go to same city, spending money, but only one day instead of several. As I neared home I was low on fuel and stopped to fill it up. Once I had obtained the fuel, I found my card was no longer good for purchase. Fortunately I found enough change to make it out of there. Upon face to face confrontation with banker, a few phone calls were made and my card was reenabled and I was told that the reason it was denied was that my purchases did not match prior 'pattern'. It was either unknown or at least unexplained what that pattern was. But their pattern program denied them their ability to be useful in my fuel transaction.

>  for each email or shipping address there is only one person receiving there (safe hipothesis),

Not true at all in the real world. Many of same name even at same address, and many do share a common eMail.

There are a couple of other similar situations regarding pattern matching in the area of fraud. One is for credit card theft (physical one) and the other is for identity theft.  Now in each there can be a pattern that could lead an algorithm to report 500 orders per day from same home, or 500 people claiming to live in same room, etc.  Now you have a use for the pattern, but the purpose is to identify which ones are different, to distinguish the valid holder of the identity from the fraudulent one, yet still to leave it for the human review to determine, or assess, the probability of match or fraud using other tools.

> Then it's possible to find an algorithm for grouping records together.

You can group, but you cannot succeed.

People who deploy such methods are responsible for having workcrews declared dead, permitting dead people to receive checks and even to vote, and worse.  If done by government, too bad, they are sorry, you still stuck keeping them around,  if done by a business, the business will fail.

I recall some ISP example, father and son having same name, somehow the son got added to a list, and the father was being sued for not paying for internet service. Father had no computer or even telephone line (they would uae a pay phone down the street if needed). Ends up with the government putting ISP out of business for cause of fraudulent billing.

One must be duly careful about making assumptions for nearness of matches.  Personal contact always helps, and many card companies actually require a valid telephone number, and will not permit activation of card until receipt answers that very phone. Then, then they can associate the telephone number as a key, but as in by above example on banking, they may still need to add a flexibility to satisfy market conditions
Avatar of hbz

ASKER


Wow you guys have been busy!  Sorry for the delay -- I've been busy, too.  Here is some more information about my problem.  The goal is to find past orders which belong to the same person so that we can recognize them as high-frequency customers and reward them with a little "holiday bonus" as a token of appreciation.  As I've said before, the buyer may be corporate, may use different credit cards (which may have different billing names/addresses on them), may input different shipping addresses when they buy things as gifts for others, etc.

-----

> Kdo: What about starting with something simple and obvious, like shipping address?

As stated above, the shipping address would not be the solution by itself, since they may ship to different addresses.   However, a match on shipping address should be considered a strong match, I think.

> RAPUTA: have the program do it all for you.

I cannot let it be completely automated, since the client wants to get as close as possible to 100%.  This means that whatever algorithm I use must have a modest false positive rate.  There must be a human at the end of the process.

> ... The main problem with your approach is the uncertainty introduced by the use of the hash functions.... I think you never should take aaaabccccc@d.com = aaaabccccc1@d.com (f2) or Gene Hackman = Genevieve Hackerman (f1)!!

My functions are based on the data I have on hand, and I notice that, w.r.t. to email addresses and even their own names, that they tend to get the first few characters right, and that misspellings occur later on.  For example, they will type joe_schmoe@yahoocom (and leave out the second period in the address).  I realize that my functions are not very pure in the mathmatical sense.  As I've said before, I'm not accustomed to this type of analysis.  If you can suggest better ones, I'd love to hear them!

> SunBow: Yes, but first let me suggest the wrong TA was selected, a database question should go to database TA

Perhaps.  I thought about it, but it seems like a much more theoretical problem.  Tell you what:  I'll crosspost some points to the DB area.

> ... Typically, this is resolved by creating a unique key that links the variety of records together.... Better practise is to have the company invent its own ID for cross-referencing.

This would be the ideal.  Unfortunately, no such field exists in the database.

-----

Anyways, I feel like this has generated a lot of good dialogue, and may be a good intellectual exercise.  On the other hand, I don't feel any closer to any sort of satisfactory solution.  I'll crosspost to the DB area to get their input...

- hbz
I saw your question in the DB area.

I've spent lots of time working of similar projects.  The truth is there is no easy answer, it has to be a combination of some matching algorithms and some human work and you'll end with a somewhat satisfactory outcome.

1. Match orders by name.  As said before you have to have some tolerance for titles, variations in spelling, etc.  There are functions like SOUNDEX() that make a phonetical comparison but they don't work that well in my experience.

2. Find matches using the credit card.  Maybe they gave you a different name but the card matches.

3. Find matches using the phone number or the e-mail.

4. Find matches using the billing address and maybe the shipping address.

5. Have people review all your matches, just make it easy for them, give them a list with all the info so they can compare easily.

And as I said, having done many projects of this sort there are no universal answers, every database is different.  Some times I have a great deal of success with only the name and zip code.  Other times the phone numbers are good.  It all depends on what pieces of your database are reliable and which aren't.
>> but I think there is no space for changing the boundaries of the problem
> Only true in malinformed malconstructed material of classroom exercise.
real life is full of wrong or incomplete stored data, and many people want to use those data

>>  for each email or shipping address there is only one person receiving there (safe hipothesis),
> Not true at all in the real world. Many of same name even at same address, and many do share a common eMail.
I don't know how a postal address works in the USA, but here in Spain an address is a key for those leaving there: first name + last name + street name + build number + flat number + door number + postal code.
Hi,

... this is just a view from behind the curtains ...

My customers have a unique ID ... when they order online they have to logon -> so I know who my customers are ...

With the same unique ID they can order by phone, by fax, by mail ... by CtoB and BtoB and PtoB interfaces ...

... through that sole ID I know how many orders they've actually processed ... if it comes to a certain amount I can treat them like "benefited" clients/customers/partners ...

... I'd like to explain and to help but I'm not getting the point right now ...


Best regards,
Raisor
Hi

Would be really helpful to see some of database structure, table names, field names etc...

Agree with Raisor, if your db is normalised, then I expect there will be a primary key in  the table Customers that is represented as a foreign key in the table Orders , you can use this key to do a join on the two tables, returning some summary info, hmmm, maybe count of orders grouped by customer in descending order

using he the Northwind sample database you would use a query with source SQL like this.

SELECT Customers.ContactName, Count(Orders.OrderID) AS CountOfOrderID
FROM Customers INNER JOIN Orders ON Customers.CustomerID = Orders.CustomerID
GROUP BY Customers.ContactName
ORDER BY Count(Orders.OrderID) DESC;


Which returns output something like this:
Contact Name      CountOfOrderID
Jose Pavarotti      31
Roland Mendel      30
Horst Kloss                      28
Maria Larsson      19
Patricia McKenna      19
Paula Wilson      18
Carlos Hernández      18
Christina Berglund      18


Probably enough data to mail out the holiday bonus

take care...


Alan (PageEd Databases)
I think you can safely assume that the order came from the same person if the shipping address is the same.|

Another possible combination - name and zipcode. Better, if you had a function to calculate the proximity of zipcodes. This is for the case when I order something from home and then from work for the company.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
OK perhaps I'm completely off track, but what's wrogn with making your application require a user to first register to use it, and then log in each time they wish to place an order. That would make it easy to associate orders and completely eliminate the need for any "fuzzy logic" algorithms to compute the possibility of two orders coming from the same customer.

You could also requre the use of a valid email address to verify each order. That way the bvast majority of customers would use the same email address each time, certainly giving you at least as good a matching as any confidence match approach with 1/10th the complexity.

In short, you should get customers to identify themselves before allowing them to place an order. That way you could associate credit cards with paritcular customers, easily allowing muktiple cards per user. A sample table structure is as follows:

USERS
userid
name
surname
address

CREDIT_CARDS
cardid
userid -> USERS.userid
carnumber
expiry
cardtype

ORDERS
orderid
cardid CARDS -> cardid
details

As you can see users can have as many cards associated with them as they want, and orders can be associated with different cards and then linked to the owner by finding the owner of that card.

Hope this helps.
I think many of you are missing the point that the database already exists and he wants to detect people that have made several orders.  Obviously he doesn't have a unique customer ID and putting it now would be a good idea but still wouldn't solve the problem.
chanito  >I think many of you are missing the point that the database already exists and he wants to detect people that have made several orders. Obviously he doesn't have a unique customer ID

:-))

hbz > recognize them as high-frequency customers and reward them with a little "holiday bonus" as a token of appreciation

Sorry, still apples & oranges, better to either give them duplicate discounts (or bonuses) or to advertise the coming of bonuses, suggesting they notify the business of their multiple accounts.

Since we do not know these people, we also have to remember other rules of the game as well, such as the separation of work from pleasure.  So it goes, that businesses can and do ask for all such bonuses to be given to the business itself, and not to the worker. They justify by it was the nature of the business transactions that produced the reward, so they are entitles to it.  This has another little sense to it, that businesses do not want employees makinh transactions that may be more costly than necessary, just to enable their own gain through discounts and other persk. This includes the likes of bonus travel miles, for example, from a public carrier, a rather non-tangible asset.   I do wonder, however, for the time when the business offers a free cup of beverage, how best to turn some of these perks over to the boss for proper accounting.

The general rule is, that these two forms of expense cannot be mixed. So it remains incumbent to have the customer perform the declaration that it is alright. How, for example, would a recipient declare it for income tax recordkeeping purposes?

> must have a modest false positive rate.

but guessing wrong just once, can be such a problem for a recipient that they raise a major lawsuit, such as class action.  It could be viewed as discriminatory to one of different religion, race, or gender, for example.  Many first names are not gender specific. Make a mistake on that, well most people are tolerant and will shrug it off, but others may have a chip on shoulder or just simply want to benefit from their friendship with a class action lawyer

OTOH, for discounts or holiday bonus, the problem I raised for shared eMail and shared name and address is not as applicable. For even a grandfather, mother, and child to all share same name, they will typically agree that their bonusses can be pooled.  For one, it is usually the Senior's money that was used to make purchases, and for another, if they did not get along so well they should have moved out on their own. But if you have a choice of a Sr vs a Jr, you'd probably do better with merging benefits to the former, who can pass it on if so desired
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I think SunBow's advice is good and sound.  Just to add to my previous comment, if you want to do some programatic matching use whatever criteria, my advice is to first come up with an analysis.  For example:

Using credit card number you find 400 suspects of making say 900 orders.  First thing you could try to do is see if other data match, for example phone number, e-mail, etc.  And the most important thing is to create a list with the 900 entries and group them by the matching criteria and have someone (or you in the worst case) go through the list and see if you're detecting John P. Smith is the same as Jon Smith which sound right to me, or in the other hand you're matching Jane Smith as the same person too (maybe the daughter of John P.)
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Sorry for being late.  I was away for a while.  I would tend to think that the problem has no satisfactory mathematical solution.  I would make a different suggestion: identify the orders which clearly come from the same person.  For example because the names are identical, or differ by one letter but the addresses are identical, or were done with the same credit card number, etc.

Calculate the number of people who are going to get a reward by dividing your budget for rewards by the cost of each reward (all included: packaging, shipping, what have you).  Then draw that number of names from the list you prepared.

One risk will be that some of them will get more than one reward.  You can limit the damage by saying that only one price per person can be received (e.g. if you already have received a letter like this, please ignore it, etc.).  Some people might still cash in more than one reward, but it cannot then be helped.

The other problem (a certainty) is that some people who ordered several times will not be included in the list.  But then, who said that a promotion must be perfect?  On all your bills or catalogues you could clearly state that if people buy with different means they will risk to miss out on rewards.

I think that you should not [over]engineer a solution.  Ultimately, you do have a budget which you cannot exceed, or?  I would keep it simple.
The question has obviouly been answered but it will be difficult to pick out the relative comtributions of the responders.
Avatar of hbz

ASKER


It's been a while... The project that prompted this question has since been cancelled, so there was little motivation to pursue an answer to this question.  I guess what I was looking for was something a little more theoretical than what was discussed.  The most interesting web page I saw was http://www.dcs.shef.ac.uk/~sam/stringmetrics.html, but again I had no time to dive into it.  Thank you, everyone, however, for the interesting discussion!

- hbz
Hello hbz!!
Thank you for accepting my comments!!