Link to home
Start Free TrialLog in
Avatar of phoffric
phoffric

asked on

Associate lists of names with list of phone numbers from text messages

There are 40 members in a group I belong to. I get group event text messages with either 4,6,8, or 12 phone numbers and a corresponding number of names (call the names N01, N02, ... , N40). So far I have about a dozen event messages.


My goal is to create contact information associating the names with the correct phone number. Pencil and paper is one approach. Hoping that there is a programmatic approach to ease the effort.


Even getting a name down to two possible phone numbers is a good start so that contact info will show two names that can be resolved possibly in a future text message.


I am taking a wild guess that using Excel is an approach that might simplify this task; possibly C++, or Python also. (I don't know Excel; I do know C++, and have tiny exposure to Python.)


A trivial example:

I get a text message #1 with 4 numbers with names A,B,C,D.

I get a text message #2 with 4 numbers with names A,B,C,Z.

Z is associated with a number in #2 that is not in #1

D is associated with a number in #1 that is not in #2.


Avatar of byundt
byundt
Flag of United States of America image

In Excel, this would be a lookup table with formulas to return the results.

You could alternatively use a macro to avoid needing formulas.

If you could produce a sample workbook in Excel that shows the data layout and desired results, that would help us decide whether formulas or a macro would be better. You may use the numbers 1 through xx for the phone numbers and the letters a through z for the names.

Please also post which version of Excel you are using (e.g. Microsoft 365 subscription, Excel 2016, 2019, or 2021 perpetual).


Avatar of skullnobrains
skullnobrains

you have many programmatic approaches to this.

in sql or excel, you would create a simple 3 columns structure : message, user, number and play with a pivot table to grab the numbers that has the more representations or appears in all relevant mesages.

with a real programming language, you would rather use some simple algorithm to loop over each collection and grab the intersections for each user, reduce the collections removing known information and loop.

more advanced techniques such as the one you mention in the question are more complex to program but complementary.

is that an algoritmic question or real life ? for real algorithmic, you would rather work in the opposite way. start by mapping all users to all numbers and remove the combinations you know to be wrong. this is way more efficient. finding a match merely boils to removing the possibility others match the same number.

note that none of these algos will manage to solve all solvable situations. and additionally, there is no guarantee the system is solvable.
Avatar of phoffric

ASKER

Using: Microsoft® Excel® for Microsoft 365 MSO (Version 2206 Build 16.0.15330.20216) 64-bit

>> is that an algoritmic question or real life ?
Real life.

It will take me awhile to get the phone numbers, since all I can do is get Details in the group message and then take a picture. (I didn't see how to copy the phone number text on my Samsung Note 8.) At least I can get a text of the names. I determined that the order of the phone numbers does not correspond to the order of the names.

>> produce a sample workbook in Excel that shows the data layout and desired results, 
I haven't started yet, so could you let me know the best layout. Which one of these do you think is a better start? I'll use 01,...,40 for the 40 phone numbers. Going with my trivial example in the OP:

I get a text message #1 with 4 numbers with names A,B,C,D.

I get a text message #2 with 4 numbers with names A,B,C,Z.


I could have one text message per line; maybe something like this (using CSV notation):
01, 02, 03, 04,=,A,B,C,D
04, 02, 09, 01,=,B,Z,A,C

Or, I could have:
TextMsg1:
01,A
02,B
03,C
04,D

TextMsg2:
04,B
02,Z
09,A
01,C
===
>> desired results:
Any format that gives better information is appreciated.

By manual inspection, if I take the non-intersection of both sides, I get:

03,09 ~ D,Z, and in this case it is clear that:
03 ~ D
09 ~ Z
and
01, 02, 04 ~ A,B,C (still to be determined using future texts). But this association is better than the original unknowns.

As I save contacts on the phone, then future lists will have less unknown numbers, and less data entry into my spreadsheet.
I am going to try using the last 4 digits (maybe sometimes 5 if I find a duplicate) and the first names with a possible initial for the last name (since first names are likely to be duplicates).

Attached is a csv file, phoneNames.txt, that uses this format: 01, 02, 03, 04,=,A,B,C,D

In Notepad++, I can click on a number and see other lines where that number is, so manually, I can start collecting information and, I think there is at least one direct  association.

One wrinkle (minor, I think) is that there may be sometimes more names than phone numbers. Maybe that occurs when a substitute comes in and not part of the initial group broadcast message.

My hope is that as I add new lines, then that new information will help reduce the possibilities of the number of phone numbers against a particular individual.
In one message of 4 names, I had the phone numbers of the 3 others, so that left one, Bob.
I see another way manually to determine a phone number/name pairing. If I double-click on a name and see it in 4 rows, and only one of the numbers in the row appear in the same 4 rows, then that determines the pair, I think.
Here is my updated CSV file. phoneNames_02.txt

Just to be clear... I am not asking you to manually figure out the name/number association for me. I am hoping to be a little more general. Maybe not full automation (although that would be nice; but something that helps organize the data to make it easier to figure out the associations.

Manually, I was able to figure out Bob (which was obvious since I already knew the 3 other players' numbers), and Maureen, since there was only one number in that record that did not apear elsewhere, and her name only appeared once. I also figured out Will since his name appeared 4x, and only one number appeared 4x and also in the same 4 rows.

Possibly, there are other ways to narrow down the unique pairs and even narrow down to two names/two numbers.

Here is the corresponding spreadsheet.
phoneNames_02.xlsm
The attached workbook matches a phone number to a person if every time a person is listed so too is that phone number and there is only one such phone number for that person.

I am running Beta Channel version of Microsoft 365 and have the TEXTSPLIT function as a result. If you don't have TEXTSPLIT, I can offer alternative formulas using FILTERXML (Windows 2013 and later) or a one-line VBA user-defined function.

Weird matching.xlsx
If the approach looks promising and you don't have TEXTSPLIT, I hope you would permit the VBA user-defined function. Using FILTERXML to split text at a delimiter is a gross abuse of that function, and absolutely won't be understandable to anyone not an Excel expert.
>> I hope you would permit the VBA user-defined function. 
Of course that is fine and probably better.

Initially, I used CSV files. Then I switched to excel. If the latter approach is better, then that is fine with me. Keep in mind, this is a non-professional and personal endeavor. No need to be concerned about appearances.

I didn't get any errors when I opened your file. I did not understand what you were saying in your last two posts. Using VBA is fine with me.

No idea how you decided who is "confirmed".
No idea why the 9 (i.e., 0009) column has no X's in it.

I am not sure how to maintain this list as I get new records to add.

Also, I believe Maureen should have been confirmed. At least that is what I wrote above:
Maureen, since there was only one number in that record that did not apear elsewhere, and her name only appeared once.
FYI - As new records come in, I plan on entering the record directly in EXCEL, so no need to deal with parsing txt or csv files.

I see that you identify a unique association between one person and one phone number. (I don't really understand approach yet.)
Even getting a name down to two possible phone numbers is a good start so that contact info will show two names that can be resolved possibly in a future text message. 
The idea is that if I have persons, say, A and B, who are associated with only, say, N1 and N2, then if I need to get in touch with A, at least I have a 50-50 chance of texting N1 and asking if this is A. If not, then I know A's number is N2. Could that also be included?


It may be that I am misunderstanding your problem.

I assumed that the phone numbers and names in each row are jumbled. In other words, the first name doesn't necessarily correspond to the first number. If there are four names and four numbers, each of the four names could have any of the four numbers.

If that is the case, I figure out the correspondence if different groups of people fill out their phone numbers. I count the number of times that Maureen is associated with each of the different phone numbers. If that count for a given phone number agrees with the number of times that Maureen signed in, the formula puts an X in the table in C24:U45. If there is enough data so that there is only one X for Maureen in that row, that must be her phone number. You'll notice that I figured out phone numbers for Abe, Bob, Mike Y, Tim and Will using that method.

John has an X under four phone numbers. Pick one, and you have a 25% change of guessing correctly.

The attached workbook uses a VBA function called Splitter. I also used LAMBDA and LET functions in the worksheet formulas, but you ought to have those in your Microsoft 365.

Weird matching.xlsm
If the most recent workbook looks promising, I can eliminate the need for you to do any formula tweaking by putting the raw data in an Excel Table and modifying the formulas to use structured references. You will still need to copy the IF(COUNTIF formula across and down, but everything else is just data entry.

Weird matching.xlsm



Thanks for you description. Next weekend, I will try to understand  your algorithm, follow some of the code, and test out the results.

>> If there are four names and four numbers, each of the four names could have any of the four numbers. 
Yes, that is the case .

Starting with phoneNames_02.xlsm  located in https://www.experts-exchange.com/questions/29243184/Associate-lists-of-names-with-list-of-phone-numbers-from-text-messages.html#a43443753#a43443718, how do I enter new data and make corrections to existing data, and then get some results? Is it OK to just use .xlsm instead of .csv file? Or do i have to repeat columns H1:H7 and L1:L7 as you do?

Suppose all there was available were these two entries (from the OP):

2,1, 3,4,=,A,B,C,D

1,2,5,3,=,A,B,C,Z
Wouldn't this imply that D is confirmed with 4 and Z is confirmed with 5? If we plug these two lines as the only two lines in the spreadsheet, what would be your results?

re: Maureen. Initially, she is associated with 6 numbers.
6707 appears 4x.
8068 appears 3x.
9695 appears 2x.
7711 appears 2x.
7338 appears 2x.
8022 appears 1x
Would this imply that Maureen is confirmed with 8022 given that all the other 5 numbers appear more than once and Maureen's name appears only once.

Thank you byundt !




How do I enter new data and make corrections to existing data, and then get some results? My most recent Comment was addressed towards making it much easier for you to do data entry.

Is it OK to just use .xlsm instead of .csv file? Great!

Or do i have to repeat columns H1:H7 and L1:L7 as you do? If you put the raw data in a Table as shown in the most recent workbook, the formulas in H1:H7 and L1:L7 will copy down automatically as you enter data. I assumed that you would have one column for names and the other for phone numbers, but if it really come in with the equals sign, I can modify the formulas for the columns for Names and Phone Numbers as needed.

 Your logic with identifying the phone number for Maureen is an interesting alternative to the way I was doing it.
Could you go over in non excel terms the algorithm used to determine Tim and Will. Having trouble understanding your approach.

For Tim, I now see what skull was taking about...
grab the intersections for each user,

I can see barely the spreadsheet on my phone. Will have a larger screen by Friday. 

Here is what I have thought about so far. If the program takes the intersection of two records, and...
  1. only one name/number pair is left, we have one association for that name.
  2. the intersection is discarded leaving only one name/number pair left in the two records,  then we have two associations. (This is the approach in the OP to determine D and Z.)
The approach to get the association with Maureen looks a little more complicated.

Could you use two (or more) algorithms in a single solution? As associations are made,  they can be pruned from the records. I wonder if that will lead sometimes to more discoveries? 
Just thought of this possibility...

If the program takes the intersection of two records, and...

3. Exactly two name/number pairs are left, we have both name associated with two numbers. (And that was a secondary goal mentioned in the OP.)
Even getting a name down to two possible phone numbers is a good start
But to take this one step further... couldn't this 2x2 association be treated as its own pseudo record? And then the intersection of this pseudo record could be made with other real and/or pseudo records thereby increasing the chances that another unique association can be made.

Just guessing here. Will try to come up with an example next weekend if you don't beat me to it.
 As associations are made,  they can be pruned from the records. I wonder if that will lead sometimes to more discoveries?

that is one advantage of the approach that starts with all numbers for everyone and removes non pertinent matches. when you find a match, you just remove the phone number from the possible phone numbers of everyone else.

none of the approaches are sure to find all the matches so yes combining multiple ones make sense. yourinitial approach and the above should work well together.

i do not believe implementing this with excel formulas is going to be easy. i would personally use the format "message,name" and a second table with "message,number" but i tend to think sql rather than excel.


You know Skull, although this question was just a practical problem to keep me from getting dizzy looking at lots of numbers, I'm starting to take an interest in this algorithm. There may be more tricks to come up with. I came up with two approaches, and I will try to understand byundt's approach. I think both of my approaches may require checking all combinations of two rows looking for the intersection and also the non-intersection.

If VBA is not able to get the best solution (or no solution if that is the case), then I am open to other languages. Although I was once good at SQL, I don't want to add that to my environment. I did include Python and C++ since they have useful libraries to handle lists, graphs. But C++ is probably overkill, so I'd be good with Python.

If you and byundt believe that VBA is not the right choice, and if neither of you are versed in Python, then we can proceed with an algorithm pseudo code, where we prove somehow (by hand?) that the particular algorithm trick works all the time. I have a need to learn Python on my job, so in a month, I should be able to help with the Python. I'd be happy with a VBA solution as long as I can understand the algorithm behind it.

Another approach is that I ask the individuals for their numbers (and I have done that), but usually everyone is too tired after a couple of hours in 95 degree humid sunny day (soon to be 100 degrees).
his approach compares the number of occurences of numbers and people. it is very efficient but more error prone. it may produce very good results as a first pass algorithm if the results are confirmed individually or if it is used to remove non matches.

one rather nice and efficient might be this : loop through the collections and create for each user and phone number the list of matching messages. sort each of those lists. hash the results. compare the hashes. ... quite trivial in python and actually a nice coding exercise.
One possible improvement to my approach is to perform the search recursively. Once matched, exclude the phone number and name from the pattern search. This recursive feature can be done using LAMBDA function, VBA or a Data table. You can also do it using Iteration in the calculation mode, but I really hate to go down that path.
>>  but I really hate to go down that path.
I hate hate, so please let's not go down that path (whatever it means :) ).
Looks like VBA is fairly powerful.
So far, I think the algorithm, as I understand it consists of 3 or 4 parts. Hope I got it right:
  1. For every intersection of every pair of rows (i.e., messages), one pair at a time, do:
    1. If intersection produces only one name and one number, then we have a match --> move to match list.
    2. If match, then remove name/number from other messages (i.e., pruning)
    3. If intersection produces two names on each row and two numbers, then:
      1. Create a new message consisting of the two names and two number
      2. Remove the two names and two numbers from the original two messages.
    4. Note: Normally, if there are N messages, this step has C( N, 2) = number of combinations of N things taken two at a time. But, since we are creating new messages with less information, N is sometimes increasing, and the new message has to be compare with all other messages.
  2. For every intersection of every pair of rows, examine the names/numbers not in the intersection; and do:
    1. If the non-intersecting names/number are one per message (as in OP), then we have two matches.
    2. For each of the two matches, remove name/number from other messages.
    3. Example:
      • 1234  ABCD ==>12 AB  and 34 CD
        1278  ABXY ==> 12 AB and 78 XY
      • We replace the original two messages on the LHS with the 3 unique messages on the RHS.
  3. The Maureen Case: For a given message, if every name/number exists in another message except for one, then that name/number is a match.
    1. I have a sense that this can be broken down further (like a name/number exists in only two messages),, but not sure.

The 4th part is byundt's original proposal, which skull understands. Maybe understanding the code helps?? Maybe you can write it down in simple pseudo code steps explaining in simple terms how we can get a match for Mike Y and Tim  using your method.
I forgot..
After going through the 3 main parts of the algorithm, wash, repeat, and rinse until no new records or matches occur.
What we are left with will be, I think, a set of matches (probability of match = 100%); a set of messages having two associationgs (probability of match = 50%);  and so on.
My approach is to tabulate the number of times in the raw data a possible phone number appears with a given person. If that number is the same as the number of times in the raw data that that person appears--we may have a match.

I extracted the unique names and phone numbers from your raw data and built a table with a different name in each row and different phone number in each column. An "X" means that every time a name appears in the raw data, so too does that particular phone number. If there is only one "X" on a row, we definitely have a match.

This approach works when people randomly attend meetings. It is foiled when two or more people are joined at the hip and always attend (or miss) the same meetings. If Jack and Jill always attend together, the best the method can do is say that one of them has phone 1234 and the other has 5678--but we can't tell which is which.

By adding recursion to the solution approach, you are in essence shrinking the number of names and phone numbers on each line of your raw data. That makes it easier to match up the remaining people and phone numbers.
Could you get concrete with your explanation using Will and Tim? That might help. Will review this weekend. Thanks.
your algorithm is imho too complicated.

assuming the data is structured
messages = [ {
  "numbers" : [ ... , ... , ... ] ,
  " persons" : [ ... , ... , ... ]
} ,... ]

lazy algorythm using signatures

foreach messages as message
  foreach message.number as num
    collect [ num ] [ message ] = whatever
  foreach message.persons as who
    collect [ who ] [ message ] = whatever

foreach keys ( collect ) as key
  results [ collect.key.sort().join(':') ] = key

sort(results)

print. whatever user and num have identical signatures are a match or possible match if multiple users have the same sig.

in the above case, i merely join the message ids but you can also hash that.

the above algo will not fail and be very efficient with a few hundred or thousand messages but will choke on huge numbers of messages.

>> lazy algorythm using signatures
I googled  this expression. Closest I got was related to encryption. Do you have a tutorial on this algorithm? 
I looked up your psuedo code and it looks like actual python code. I'll try to run it this weekend to see what happens if I can figure out how to use pycharm, which I have. 
it is merely the second algorithm i suggested which i described as lazy given the trivial implementation and the fact i did not bother using hashes.

if you have many messages, use something like md5(msgs.sort.join)

sorry, i am on a mobile phone and walking in the street so i cannot provide better code
>> My approach is to tabulate the number of times in the raw data a possible phone number appears with a given person. If that number is the same as the number of times in the raw data that that person appears--we may have a match.

===
Is this right?
You confirmed Abe with 9095 because:
Abe's name appeared exactly 2 times.
9095 appeared exactly 2 times, and where it appeared, Abe's name was in the message.
===
Ditto for Tim since he also appeared exactly 2 times with 4647.
===
Mike Y is confirmed because...
Mike Y appeared exactly 3 times, and 8068 appeared exactly 3 times and where it appeared, Mike Y's name was in the message.
===

===
If so, then why isn't Taka confirmed:
Taka's name appeared exactly 3 times.
4750 appeared exactly 3 times, and where it appeared, Taka 's name was in the message. 
===


I added a new message of names/numbers to the latest weird spreadsheet
 and all the X's disappeared.
>> collect [ num ] [ message ] = "Whatever"
What is whatever in multiple places?

>> it is very efficient but more error prone.

Two kinds of errors...
  1. wrong association of name/number pair. -- not good.
  2. missing an association that another algorithm would find. -- not as bad.

Which were you referring to? If #2, what is the alternate or additional algorithm that would find some missing associations? 
Taka wasn't matched because there were two phone numbers (4750 & 6707) that both occurred every time Taka was mentioned and only when Taka was mentioned.

The concept of recursion comes in play, however. Will was matched to 6707, therefore Taka must be 4750.

I'm noodling on how to use a Data Table to make the logic recursive.
Recursive feature added. After 4 iterations, I can now match 10 names to numbers; 12 remain unmatched. 
Weird matching.xlsm
In your July 23 weird spreadsheet, am I supposed to now ignore B41:B62, and just look at 17:34?

In 17:34, I see that Chuck is 1231, but also Paul is 1231. How to interpret?
How to add and/or correct original edits?
whatever can be whatever you want including null, zero, the empty string. i store the mapping between number or people and messages in array keys. the values are not used. this is just the way i designed the pseudo code but you can change this easily.

i meant #1 which is why matches obtained need to be checked.

regarding #2, the problem is not guaranteed to have a solution. i think my last algo won't miss anything but it is only efficient for small numbers of people and numbers and the performance will degrade rapidely past a few hundreds probably.

said algo can be implemented in excel as well quite easily (much easier than the one you are both working on once recursion comes into play with a pivot or match+redirect+index) and it will be efficient but i see no point in using excel for such a task.
In the latest spreadsheet (July 23), I added a new row with names in A and numbers in B.
All tables had all their entries erased. How to get entries repopulated?

About the Python pseudo code. Maybe if I take a Python course, I will understand what you are saying better. That can be another question after the course, I guess. This will be useful, not for this question, but to better understand your comment that Python is the better way to go.
any language including vba is the way to go. you can implement the same in excel. it is just more complex.

just map both names and numbers to the list of messages they are involved in. sort each of the messages lists, possibly hash the lists and compare them.
>> Taka's name appeared exactly 3 times.
>> 4750 appeared exactly 3 times, and where it appeared, Taka 's name was in the message.  

>>  >> Taka wasn't matched because there were two phone numbers (4750 & 6707) that both occurred every time Taka was mentioned and only when Taka was mentioned. 

On the other hand, 6707 was in 4 messages, and so, naturally, in one of those messages, Taka's name was not present. So, shouldn't 6707 have been eliminated based on that evidence without having to do any recursion (which, may or may not produce better results for Taka)?
ASKER CERTIFIED SOLUTION
Avatar of byundt
byundt
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
If you encounter an error message prohibiting you from adding (or deleting) rows from raw data Table1, clear the array formula in the Data Table before trying to edit Table1.
Hi Brad,
Thanks for the sheet. Too late to review now. Will do first thing next weekend. All the explanations you provided in the sheet and here should help a lot. I am not able to understand the language, but I appreciate the sheet!
Talk to you soon.

Hi Skull,
I am certain to have to study Python for my job. At that time, we can revisit your pythonese approach, and I'll give it a try in another question (unless you want to provide a program here using any input you want - I'm flexible). I appreciate your comments, and sometimes think I got the gist of what you are recommending. Until I put it into practice, I won't be certain of the gist. I think I may even have gotten your ideas about pivots, but again, not positive until I think about it myself. You mentioned that my pseudo-code was too complicated. Be that as it may, do you think it does the job? If so, that my be my first Python program (since I am pretty sure I undertand my pseudo-code), and if it does the job, then I can compare it with the Brad's spreadsheet, and then with your more optimal approach. I'll stick with Python since this will be a good starter project for me given your pseudo-code.

Thanks again for your help!

Paul
@Brad,

Thanks for your dedication in solving this unusual problem.

I forgot about my multiple monthly reports (a real drag) that I just finished.

I'll look at the results next weekend. No matter, no rush, because I am temporarily on the DL list.

Paul
@Brad,

Thanks for your dedication in solving this unusual problem.

I forgot about my multiple monthly reports (a real drag) that I just finished.

I'll look at the results next weekend. No matter, no rush, because I am temporarily on the DL list.

Paul
feel free to post here if you try the programmatic way.

note that you can probably also grab the numers and messages from the mailing list directly in python. which may proove trivial if the list is available through imap and regular password auth or painful otherwise