asked on
Associate lists of names with list of phone numbers from text messages
There are 40 members in a group I belong to. I get group event text messages with either 4,6,8, or 12 phone numbers and a corresponding number of names (call the names N01, N02, ... , N40). So far I have about a dozen event messages.
My goal is to create contact information associating the names with the correct phone number. Pencil and paper is one approach. Hoping that there is a programmatic approach to ease the effort.
Even getting a name down to two possible phone numbers is a good start so that contact info will show two names that can be resolved possibly in a future text message.
I am taking a wild guess that using Excel is an approach that might simplify this task; possibly C++, or Python also. (I don't know Excel; I do know C++, and have tiny exposure to Python.)
A trivial example:
I get a text message #1 with 4 numbers with names A,B,C,D.
I get a text message #2 with 4 numbers with names A,B,C,Z.
Z is associated with a number in #2 that is not in #1
D is associated with a number in #1 that is not in #2.
in sql or excel, you would create a simple 3 columns structure : message, user, number and play with a pivot table to grab the numbers that has the more representations or appears in all relevant mesages.
with a real programming language, you would rather use some simple algorithm to loop over each collection and grab the intersections for each user, reduce the collections removing known information and loop.
more advanced techniques such as the one you mention in the question are more complex to program but complementary.
is that an algoritmic question or real life ? for real algorithmic, you would rather work in the opposite way. start by mapping all users to all numbers and remove the combinations you know to be wrong. this is way more efficient. finding a match merely boils to removing the possibility others match the same number.
note that none of these algos will manage to solve all solvable situations. and additionally, there is no guarantee the system is solvable.
ASKER
>> is that an algoritmic question or real life ?
Real life.
It will take me awhile to get the phone numbers, since all I can do is get Details in the group message and then take a picture. (I didn't see how to copy the phone number text on my Samsung Note 8.) At least I can get a text of the names. I determined that the order of the phone numbers does not correspond to the order of the names.
>> produce a sample workbook in Excel that shows the data layout and desired results,
I haven't started yet, so could you let me know the best layout. Which one of these do you think is a better start? I'll use 01,...,40 for the 40 phone numbers. Going with my trivial example in the OP:
I get a text message #1 with 4 numbers with names A,B,C,D.
I get a text message #2 with 4 numbers with names A,B,C,Z.
I could have one text message per line; maybe something like this (using CSV notation):
01, 02, 03, 04,=,A,B,C,D
04, 02, 09, 01,=,B,Z,A,C
Or, I could have:
TextMsg1:
01,A
02,B
03,C
04,D
TextMsg2:
04,B
02,Z
09,A
01,C
===
>> desired results:
Any format that gives better information is appreciated.
By manual inspection, if I take the non-intersection of both sides, I get:
03,09 ~ D,Z, and in this case it is clear that:
03 ~ D
09 ~ Z
and
01, 02, 04 ~ A,B,C (still to be determined using future texts). But this association is better than the original unknowns.
As I save contacts on the phone, then future lists will have less unknown numbers, and less data entry into my spreadsheet.
ASKER
In Notepad++, I can click on a number and see other lines where that number is, so manually, I can start collecting information and, I think there is at least one direct association.
My hope is that as I add new lines, then that new information will help reduce the possibilities of the number of phone numbers against a particular individual.
ASKER
I see another way manually to determine a phone number/name pairing. If I double-click on a name and see it in 4 rows, and only one of the numbers in the row appear in the same 4 rows, then that determines the pair, I think.
Here is my updated CSV file. phoneNames_02.txt
Just to be clear... I am not asking you to manually figure out the name/number association for me. I am hoping to be a little more general. Maybe not full automation (although that would be nice; but something that helps organize the data to make it easier to figure out the associations.
Manually, I was able to figure out Bob (which was obvious since I already knew the 3 other players' numbers), and Maureen, since there was only one number in that record that did not apear elsewhere, and her name only appeared once. I also figured out Will since his name appeared 4x, and only one number appeared 4x and also in the same 4 rows.
Possibly, there are other ways to narrow down the unique pairs and even narrow down to two names/two numbers.
ASKER
phoneNames_02.xlsm
I am running Beta Channel version of Microsoft 365 and have the TEXTSPLIT function as a result. If you don't have TEXTSPLIT, I can offer alternative formulas using FILTERXML (Windows 2013 and later) or a one-line VBA user-defined function.
Weird matching.xlsx
ASKER
Of course that is fine and probably better.
Initially, I used CSV files. Then I switched to excel. If the latter approach is better, then that is fine with me. Keep in mind, this is a non-professional and personal endeavor. No need to be concerned about appearances.
I didn't get any errors when I opened your file. I did not understand what you were saying in your last two posts. Using VBA is fine with me.
No idea how you decided who is "confirmed".
No idea why the 9 (i.e., 0009) column has no X's in it.
I am not sure how to maintain this list as I get new records to add.
Also, I believe Maureen should have been confirmed. At least that is what I wrote above:
Maureen, since there was only one number in that record that did not apear elsewhere, and her name only appeared once.
ASKER
I see that you identify a unique association between one person and one phone number. (I don't really understand approach yet.)
Even getting a name down to two possible phone numbers is a good start so that contact info will show two names that can be resolved possibly in a future text message.The idea is that if I have persons, say, A and B, who are associated with only, say, N1 and N2, then if I need to get in touch with A, at least I have a 50-50 chance of texting N1 and asking if this is A. If not, then I know A's number is N2. Could that also be included?
I assumed that the phone numbers and names in each row are jumbled. In other words, the first name doesn't necessarily correspond to the first number. If there are four names and four numbers, each of the four names could have any of the four numbers.
If that is the case, I figure out the correspondence if different groups of people fill out their phone numbers. I count the number of times that Maureen is associated with each of the different phone numbers. If that count for a given phone number agrees with the number of times that Maureen signed in, the formula puts an X in the table in C24:U45. If there is enough data so that there is only one X for Maureen in that row, that must be her phone number. You'll notice that I figured out phone numbers for Abe, Bob, Mike Y, Tim and Will using that method.
John has an X under four phone numbers. Pick one, and you have a 25% change of guessing correctly.
The attached workbook uses a VBA function called Splitter. I also used LAMBDA and LET functions in the worksheet formulas, but you ought to have those in your Microsoft 365.
Weird matching.xlsm
Weird matching.xlsm
ASKER
>> If there are four names and four numbers, each of the four names could have any of the four numbers.
Yes, that is the case .
Starting with phoneNames_02.xlsm located in https://www.experts-exchange.com/questions/29243184/Associate-lists-of-names-with-list-of-phone-numbers-from-text-messages.html#a43443753#a43443718, how do I enter new data and make corrections to existing data, and then get some results? Is it OK to just use .xlsm instead of .csv file? Or do i have to repeat columns H1:H7 and L1:L7 as you do?
Suppose all there was available were these two entries (from the OP):
Wouldn't this imply that D is confirmed with 4 and Z is confirmed with 5? If we plug these two lines as the only two lines in the spreadsheet, what would be your results?2,1, 3,4,=,A,B,C,D
1,2,5,3,=,A,B,C,Z
re: Maureen. Initially, she is associated with 6 numbers.
6707 appears 4x.
8068 appears 3x.
9695 appears 2x.
7711 appears 2x.
7338 appears 2x.
8022 appears 1x
Would this imply that Maureen is confirmed with 8022 given that all the other 5 numbers appear more than once and Maureen's name appears only once.
Thank you byundt !
Is it OK to just use .xlsm instead of .csv file? Great!
Or do i have to repeat columns H1:H7 and L1:L7 as you do? If you put the raw data in a Table as shown in the most recent workbook, the formulas in H1:H7 and L1:L7 will copy down automatically as you enter data. I assumed that you would have one column for names and the other for phone numbers, but if it really come in with the equals sign, I can modify the formulas for the columns for Names and Phone Numbers as needed.
Your logic with identifying the phone number for Maureen is an interesting alternative to the way I was doing it.
ASKER
For Tim, I now see what skull was taking about...
grab the intersections for each user,
I can see barely the spreadsheet on my phone. Will have a larger screen by Friday.
Here is what I have thought about so far. If the program takes the intersection of two records, and...
- only one name/number pair is left, we have one association for that name.
- the intersection is discarded leaving only one name/number pair left in the two records, then we have two associations. (This is the approach in the OP to determine D and Z.)
Could you use two (or more) algorithms in a single solution? As associations are made, they can be pruned from the records. I wonder if that will lead sometimes to more discoveries?
ASKER
If the program takes the intersection of two records, and...
3. Exactly two name/number pairs are left, we have both name associated with two numbers. (And that was a secondary goal mentioned in the OP.)
Even getting a name down to two possible phone numbers is a good startBut to take this one step further... couldn't this 2x2 association be treated as its own pseudo record? And then the intersection of this pseudo record could be made with other real and/or pseudo records thereby increasing the chances that another unique association can be made.
Just guessing here. Will try to come up with an example next weekend if you don't beat me to it.
As associations are made, they can be pruned from the records. I wonder if that will lead sometimes to more discoveries?
that is one advantage of the approach that starts with all numbers for everyone and removes non pertinent matches. when you find a match, you just remove the phone number from the possible phone numbers of everyone else.
none of the approaches are sure to find all the matches so yes combining multiple ones make sense. yourinitial approach and the above should work well together.
i do not believe implementing this with excel formulas is going to be easy. i would personally use the format "message,name" and a second table with "message,number" but i tend to think sql rather than excel.
ASKER
If VBA is not able to get the best solution (or no solution if that is the case), then I am open to other languages. Although I was once good at SQL, I don't want to add that to my environment. I did include Python and C++ since they have useful libraries to handle lists, graphs. But C++ is probably overkill, so I'd be good with Python.
If you and byundt believe that VBA is not the right choice, and if neither of you are versed in Python, then we can proceed with an algorithm pseudo code, where we prove somehow (by hand?) that the particular algorithm trick works all the time. I have a need to learn Python on my job, so in a month, I should be able to help with the Python. I'd be happy with a VBA solution as long as I can understand the algorithm behind it.
Another approach is that I ask the individuals for their numbers (and I have done that), but usually everyone is too tired after a couple of hours in 95 degree humid sunny day (soon to be 100 degrees).
one rather nice and efficient might be this : loop through the collections and create for each user and phone number the list of matching messages. sort each of those lists. hash the results. compare the hashes. ... quite trivial in python and actually a nice coding exercise.
ASKER
I hate hate, so please let's not go down that path (whatever it means :) ).
Looks like VBA is fairly powerful.
ASKER
- For every intersection of every pair of rows (i.e., messages), one pair at a time, do:
- If intersection produces only one name and one number, then we have a match --> move to match list.
- If match, then remove name/number from other messages (i.e., pruning)
- If intersection produces two names on each row and two numbers, then:
- Create a new message consisting of the two names and two number
- Remove the two names and two numbers from the original two messages.
- Note: Normally, if there are N messages, this step has C( N, 2) = number of combinations of N things taken two at a time. But, since we are creating new messages with less information, N is sometimes increasing, and the new message has to be compare with all other messages.
- For every intersection of every pair of rows, examine the names/numbers not in the intersection; and do:
- If the non-intersecting names/number are one per message (as in OP), then we have two matches.
- For each of the two matches, remove name/number from other messages.
- Example:
- 1234 ABCD ==>12 AB and 34 CD
1278 ABXY ==> 12 AB and 78 XY - We replace the original two messages on the LHS with the 3 unique messages on the RHS.
- 1234 ABCD ==>12 AB and 34 CD
- The Maureen Case: For a given message, if every name/number exists in another message except for one, then that name/number is a match.
- I have a sense that this can be broken down further (like a name/number exists in only two messages),, but not sure.
The 4th part is byundt's original proposal, which skull understands. Maybe understanding the code helps?? Maybe you can write it down in simple pseudo code steps explaining in simple terms how we can get a match for Mike Y and Tim using your method.
ASKER
After going through the 3 main parts of the algorithm, wash, repeat, and rinse until no new records or matches occur.
What we are left with will be, I think, a set of matches (probability of match = 100%); a set of messages having two associationgs (probability of match = 50%); and so on.
I extracted the unique names and phone numbers from your raw data and built a table with a different name in each row and different phone number in each column. An "X" means that every time a name appears in the raw data, so too does that particular phone number. If there is only one "X" on a row, we definitely have a match.
This approach works when people randomly attend meetings. It is foiled when two or more people are joined at the hip and always attend (or miss) the same meetings. If Jack and Jill always attend together, the best the method can do is say that one of them has phone 1234 and the other has 5678--but we can't tell which is which.
By adding recursion to the solution approach, you are in essence shrinking the number of names and phone numbers on each line of your raw data. That makes it easier to match up the remaining people and phone numbers.
ASKER
assuming the data is structured
messages = [ {
"numbers" : [ ... , ... , ... ] ,
" persons" : [ ... , ... , ... ]
} ,... ]
lazy algorythm using signatures
foreach messages as message
foreach message.number as num
collect [ num ] [ message ] = whatever
foreach message.persons as who
collect [ who ] [ message ] = whatever
foreach keys ( collect ) as key
results [ collect.key.sort().join(':') ] = key
sort(results)
print. whatever user and num have identical signatures are a match or possible match if multiple users have the same sig.
in the above case, i merely join the message ids but you can also hash that.
the above algo will not fail and be very efficient with a few hundred or thousand messages but will choke on huge numbers of messages.
ASKER
I googled this expression. Closest I got was related to encryption. Do you have a tutorial on this algorithm?
I looked up your psuedo code and it looks like actual python code. I'll try to run it this weekend to see what happens if I can figure out how to use pycharm, which I have.
if you have many messages, use something like md5(msgs.sort.join)
sorry, i am on a mobile phone and walking in the street so i cannot provide better code
ASKER
===
Is this right?
You confirmed Abe with 9095 because:
Abe's name appeared exactly 2 times.
9095 appeared exactly 2 times, and where it appeared, Abe's name was in the message.
Ditto for Tim since he also appeared exactly 2 times with 4647.
===
Mike Y is confirmed because...
===
===
If so, then why isn't Taka confirmed:
Taka's name appeared exactly 3 times.
4750 appeared exactly 3 times, and where it appeared, Taka 's name was in the message.
===
ASKER
ASKER
What is whatever in multiple places?
ASKER
Two kinds of errors...
- wrong association of name/number pair. -- not good.
- missing an association that another algorithm would find. -- not as bad.
Which were you referring to? If #2, what is the alternate or additional algorithm that would find some missing associations?
The concept of recursion comes in play, however. Will was matched to 6707, therefore Taka must be 4750.
I'm noodling on how to use a Data Table to make the logic recursive.
Weird matching.xlsm
ASKER
In 17:34, I see that Chuck is 1231, but also Paul is 1231. How to interpret?
How to add and/or correct original edits?
i meant #1 which is why matches obtained need to be checked.
regarding #2, the problem is not guaranteed to have a solution. i think my last algo won't miss anything but it is only efficient for small numbers of people and numbers and the performance will degrade rapidely past a few hundreds probably.
said algo can be implemented in excel as well quite easily (much easier than the one you are both working on once recursion comes into play with a pivot or match+redirect+index) and it will be efficient but i see no point in using excel for such a task.
ASKER
All tables had all their entries erased. How to get entries repopulated?
About the Python pseudo code. Maybe if I take a Python course, I will understand what you are saying better. That can be another question after the course, I guess. This will be useful, not for this question, but to better understand your comment that Python is the better way to go.
just map both names and numbers to the list of messages they are involved in. sort each of the messages lists, possibly hash the lists and compare them.
ASKER
>> 4750 appeared exactly 3 times, and where it appeared, Taka 's name was in the message.
>> >> Taka wasn't matched because there were two phone numbers (4750 & 6707) that both occurred every time Taka was mentioned and only when Taka was mentioned.
On the other hand, 6707 was in 4 messages, and so, naturally, in one of those messages, Taka's name was not present. So, shouldn't 6707 have been eliminated based on that evidence without having to do any recursion (which, may or may not produce better results for Taka)?
ASKER
Thanks for the sheet. Too late to review now. Will do first thing next weekend. All the explanations you provided in the sheet and here should help a lot. I am not able to understand the language, but I appreciate the sheet!
Talk to you soon.
Hi Skull,
I am certain to have to study Python for my job. At that time, we can revisit your pythonese approach, and I'll give it a try in another question (unless you want to provide a program here using any input you want - I'm flexible). I appreciate your comments, and sometimes think I got the gist of what you are recommending. Until I put it into practice, I won't be certain of the gist. I think I may even have gotten your ideas about pivots, but again, not positive until I think about it myself. You mentioned that my pseudo-code was too complicated. Be that as it may, do you think it does the job? If so, that my be my first Python program (since I am pretty sure I undertand my pseudo-code), and if it does the job, then I can compare it with the Brad's spreadsheet, and then with your more optimal approach. I'll stick with Python since this will be a good starter project for me given your pseudo-code.
Thanks again for your help!
Paul
ASKER
Thanks for your dedication in solving this unusual problem.
I forgot about my multiple monthly reports (a real drag) that I just finished.
I'll look at the results next weekend. No matter, no rush, because I am temporarily on the DL list.
Paul
ASKER
Thanks for your dedication in solving this unusual problem.
I forgot about my multiple monthly reports (a real drag) that I just finished.
I'll look at the results next weekend. No matter, no rush, because I am temporarily on the DL list.
Paul
note that you can probably also grab the numers and messages from the mailing list directly in python. which may proove trivial if the list is available through imap and regular password auth or painful otherwise
You could alternatively use a macro to avoid needing formulas.
If you could produce a sample workbook in Excel that shows the data layout and desired results, that would help us decide whether formulas or a macro would be better. You may use the numbers 1 through xx for the phone numbers and the letters a through z for the names.
Please also post which version of Excel you are using (e.g. Microsoft 365 subscription, Excel 2016, 2019, or 2021 perpetual).