Leo Torres
asked on
Python dataframe join select where not exists
I have 2 dataframes below I would like to set the NxtCase value
The NxtCase value has to Equal the missing letter Case based on ID and not what currently exists in Case column.
For example ID =1 and Case= 'A' then the value NxtCase = 'B' Any ideas on how I may set this value?
dfObj = pd.DataFrame(columns=['ID', 'Case','NxtCase'])
dfObj = dfObj.append({'ID': 1, 'Case': 'A', 'NxtCase': ''}, ignore_index=True) dfObj = dfObj.append({'ID': 2, 'Case': 'C', 'NxtCase': ''}, ignore_index=True) dfObj = dfObj.append({'ID': 3, 'Case': 'E', 'NxtCase': ''}, ignore_index=True)
dfObj2 = pd.DataFrame(columns=['CaseID', 'Case'])
dfObj2 = dfObj2.append({'ID': 1, 'Case': 'A' }, ignore_index=True) dfObj2 = dfObj2.append({'ID': 1, 'Case': 'B'}, ignore_index=True) dfObj2 = dfObj2.append({'ID': 2, 'Case': 'C'}, ignore_index=True) dfObj2 = dfObj2.append({'ID': 2, 'Case': 'D'}, ignore_index=True) dfObj2 = dfObj2.append({'ID': 3, 'Case': 'E'}, ignore_index=True) dfObj2 = dfObj2.append({'ID': 3, 'Case': 'F'}, ignore_index=True)
My objective is to compare dfObj to dfObj2 and set the NxtCase value.The NxtCase value has to Equal the missing letter Case based on ID and not what currently exists in Case column.
For example ID =1 and Case= 'A' then the value NxtCase = 'B' Any ideas on how I may set this value?
Here's the code re-formatted.
However I don't think it's quite right as this is what the 2 data frames look like when output.
dfObj:
ID Case NxtCase
0 1 A
1 2 C
2 3 E
dfObj2:
CaseID Case ID
0 NaN A 1.0
1 NaN B 1.0
2 NaN C 2.0
3 NaN D 2.0
4 NaN E 3.0
5 NaN F 3.0
import pandas as pd
dfObj = pd.DataFrame(columns=['ID', 'Case','NxtCase'])
dfObj = dfObj.append({'ID': 1, 'Case': 'A', 'NxtCase': ''}, ignore_index=True)
dfObj = dfObj.append({'ID': 2, 'Case': 'C', 'NxtCase': ''}, ignore_index=True)
dfObj = dfObj.append({'ID': 3, 'Case': 'E', 'NxtCase': ''}, ignore_index=True)
dfObj2 = pd.DataFrame(columns=['CaseID', 'Case'])
dfObj2 = dfObj2.append({'ID': 1, 'Case': 'A' }, ignore_index=True)
dfObj2 = dfObj2.append({'ID': 1, 'Case': 'B'}, ignore_index=True)
dfObj2 = dfObj2.append({'ID': 2, 'Case': 'C'}, ignore_index=True)
dfObj2 = dfObj2.append({'ID': 2, 'Case': 'D'}, ignore_index=True)
dfObj2 = dfObj2.append({'ID': 3, 'Case': 'E'}, ignore_index=True)
dfObj2 = dfObj2.append({'ID': 3, 'Case': 'F'}, ignore_index=True)
print(dfObj)
print(dfObj2)
However I don't think it's quite right as this is what the 2 data frames look like when output.
dfObj:
ID Case NxtCase
0 1 A
1 2 C
2 3 E
dfObj2:
CaseID Case ID
0 NaN A 1.0
1 NaN B 1.0
2 NaN C 2.0
3 NaN D 2.0
4 NaN E 3.0
5 NaN F 3.0
ASKER
Your right I messed up the column names thank you.
import pandas as pd
dfObj = pd.DataFrame(columns=['ID', 'Case','NxtCase'])
dfObj = dfObj.append({'ID': 1, 'Case': 'A', 'NxtCase': ''}, ignore_index=True)
dfObj = dfObj.append({'ID': 2, 'Case': 'C', 'NxtCase': ''}, ignore_index=True)
dfObj = dfObj.append({'ID': 3, 'Case': 'E', 'NxtCase': ''}, ignore_index=True)
dfObj2 = pd.DataFrame(columns=['ID', 'Case'])
dfObj2 = dfObj2.append({'ID': 1, 'Case': 'A' }, ignore_index=True)
dfObj2 = dfObj2.append({'ID': 1, 'Case': 'B'}, ignore_index=True)
dfObj2 = dfObj2.append({'ID': 2, 'Case': 'C'}, ignore_index=True)
dfObj2 = dfObj2.append({'ID': 2, 'Case': 'D'}, ignore_index=True)
dfObj2 = dfObj2.append({'ID': 3, 'Case': 'E'}, ignore_index=True)
dfObj2 = dfObj2.append({'ID': 3, 'Case': 'F'}, ignore_index=True)
print(dfObj)
print(dfObj2)
This was What I intended
ID Case NxtCase
1 A
2 C
3 E
This is my desired result
This was What I intended
ID Case NxtCase
1 A B
2 C D
3 E F
ASKER
Your right I messed up the column names thank you.
import pandas as pd
dfObj = pd.DataFrame(columns=['ID', 'Case','NxtCase'])
dfObj = dfObj.append({'ID': 1, 'Case': 'A', 'NxtCase': ''}, ignore_index=True)
dfObj = dfObj.append({'ID': 2, 'Case': 'C', 'NxtCase': ''}, ignore_index=True)
dfObj = dfObj.append({'ID': 3, 'Case': 'E', 'NxtCase': ''}, ignore_index=True)
dfObj2 = pd.DataFrame(columns=['ID', 'Case'])
dfObj2 = dfObj2.append({'ID': 1, 'Case': 'A' }, ignore_index=True)
dfObj2 = dfObj2.append({'ID': 1, 'Case': 'B'}, ignore_index=True)
dfObj2 = dfObj2.append({'ID': 2, 'Case': 'C'}, ignore_index=True)
dfObj2 = dfObj2.append({'ID': 2, 'Case': 'D'}, ignore_index=True)
dfObj2 = dfObj2.append({'ID': 3, 'Case': 'E'}, ignore_index=True)
dfObj2 = dfObj2.append({'ID': 3, 'Case': 'F'}, ignore_index=True)
print(dfObj)
print(dfObj2)
This was What I intended
ID Case NxtCase
1 A
2 C
3 E
This is my desired result
ID Case NxtCase
1 A B
2 C D
3 E F
Leo
What should happen if there isn't a 'next case'?
For example, if the ID was 1 and the case was 'B'?
What should happen if there isn't a 'next case'?
For example, if the ID was 1 and the case was 'B'?
ASKER
Hello Norie.. Thank you
Not Sure I understand your question or concern.
Let me try a different approach
1. There will always be a matching case
2. Each case has 1 ID I have 1 case A i need the query to return the other part of the ID. Lets not use letters
This is the initial table you have to find there partner
1 Jake
2 Cary
3 Mike
4 Nicole
Her is the second table with Partner and Key
1 Jake
2 Jordan
3 Mike
1 Susan
2 Cary
4 Nicole
3 Ana
4 Henry
If you were given these names and the ID represent a couple let say the similar ID would represent a couple.
So that being said these couples are together
Jake/Susan
Jordan/Cary
Mike/Ana
Henry/Nicole
Same as other example just using people instead. The IDs could be any number. See in the first table I already have 1 partner I need the logic to find the other partner and ignore the partner I already have.
I hope that explains it better.
Not Sure I understand your question or concern.
Let me try a different approach
1. There will always be a matching case
2. Each case has 1 ID I have 1 case A i need the query to return the other part of the ID. Lets not use letters
This is the initial table you have to find there partner
1 Jake
2 Cary
3 Mike
4 Nicole
Her is the second table with Partner and Key
1 Jake
2 Jordan
3 Mike
1 Susan
2 Cary
4 Nicole
3 Ana
4 Henry
If you were given these names and the ID represent a couple let say the similar ID would represent a couple.
So that being said these couples are together
Jake/Susan
Jordan/Cary
Mike/Ana
Henry/Nicole
Same as other example just using people instead. The IDs could be any number. See in the first table I already have 1 partner I need the logic to find the other partner and ignore the partner I already have.
I hope that explains it better.
Leo
So are do you mean the ID should be disregarded?
So are do you mean the ID should be disregarded?
ASKER
Not sure I understand by disregarded . You need the ID to match the couple and then compare who you have from the couple to return the other person.
ASKER
I think this is easy I may just not be explaining it correctly. I am not that versed in python but I can do this is SQL. The final select table is what I am trying to accomplish in python
Create table #tmp1(
ID int,
[Case] varchar(20),
[NxtCase] varchar(20)
)
Create table #tmp2(
ID int,
[Case] varchar(20),
)
Insert into #tmp1(ID,[Case])
Select 23, 'A'
Union
Select 51, 'H'
Union
Select 63, 'L'
Insert into #tmp2(ID,[Case])
Select 23, 'A'
Union
Select 51, 'H'
Union
Select 63, 'L'
union
Select 23, 'D'
Union
Select 51, 'O'
Union
Select 63, 'E'
update t1
Set [NxtCase] = t2.[Case]
from #tmp1 t1
join #tmp2 t2
on t1.ID = t2.ID
and t1.[Case] <> t2.[Case]
Select * from #tmp1
Have a look at this SQL.
Create table #tmp1(
ID int,
[Case] varchar(20),
[NxtCase] varchar(20)
)
Create table #tmp2(
ID int,
[Case] varchar(20),
)
Insert into #tmp1(ID,[Case])
Select 23, 'D'
Union
Select 51, 'H'
Union
Select 63, 'L'
Select * from #tmp1
Insert into #tmp2(ID,[Case])
Select 23, 'A'
Union
Select 51, 'H'
Union
Select 63, 'L'
union
Select 23, 'D'
Union
Select 51, 'O'
Union
Select 63, 'E'
Union
Select 63, 'G'
Union
Select 23, 'Z'
Select * from #tmp2
update t1
Set [NxtCase] = t2.[Case]
from #tmp1 t1
join #tmp2 t2
on t1.ID = t2.ID
and t1.[Case] <> t2.[Case]
Select * from #tmp1
ASKER
Ah ok I see what your concern is now for 63. There are multiple options to return. This is easy. In my case this will not exists 63 will only have 1 pair. Everything will only have 2 parts. There is no case where there is only 1 or 3 parts or more. If that happens then there is a data issue. So my example still holds. Thank you I should have made this more clear.
No problem, that definitely clarifies things.
Don't have time to have a proper look at it right now but I'll try and come up with something tomorrow.
Don't have time to have a proper look at it right now but I'll try and come up with something tomorrow.
ASKER
any updates?
I've been looking at it but I can't find a simple way to do it.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thank you Norie just saw comment. I thought it was just me I could not figure this out in python. I will test this
I was thinking are there addon out there that provide the ability to write SQL on python that way this type of manipulation would be easier(at least for me). Thank you I will test today. Thank you!
I was thinking are there addon out there that provide the ability to write SQL on python that way this type of manipulation would be easier(at least for me). Thank you I will test today. Thank you!
You can use SQL in Python but I didn't look into that as I thought you wanted a 'pure' Python method, plus you would need a bit of setup depending on which database you wanted to use.
This is what I came up with using sqlite which comes as standard with Python.
Wasn't able to get the UPDATE bit to work but the correct results appear to be returned.
Wasn't able to get the UPDATE bit to work but the correct results appear to be returned.
import pandas as pd
import sqlite3
dfObj = pd.DataFrame(columns=['ID', 'Case','NxtCase'])
dfObj = dfObj.append({'ID': 1, 'Case': 'A', 'NxtCase': ''}, ignore_index=True)
dfObj = dfObj.append({'ID': 2, 'Case': 'C', 'NxtCase': ''}, ignore_index=True)
dfObj = dfObj.append({'ID': 3, 'Case': 'E', 'NxtCase': ''}, ignore_index=True)
dfObj2 = pd.DataFrame(columns=['CaseID', 'Case'])
dfObj2 = dfObj2.append({'ID': 1, 'Case': 'A' }, ignore_index=True)
dfObj2 = dfObj2.append({'ID': 1, 'Case': 'B'}, ignore_index=True)
dfObj2 = dfObj2.append({'ID': 2, 'Case': 'C'}, ignore_index=True)
dfObj2 = dfObj2.append({'ID': 2, 'Case': 'D'}, ignore_index=True)
dfObj2 = dfObj2.append({'ID': 3, 'Case': 'E'}, ignore_index=True)
dfObj2 = dfObj2.append({'ID': 3, 'Case': 'F'}, ignore_index=True)
conn = sqlite3.connect('test.db')
dfObj.to_sql('Table1', conn, index=False)
dfObj2.to_sql('Table2', conn, index=False)
query = """
SELECT table1.[Case], table2.[Case]
from table1
Left join table2
on table1.[ID] = table2.[ID]
and table1.[Case] <> table2.[Case];
"""
curr = conn.cursor()
data = curr.execute(query).fetchall()
print(data)
Copy paste of the code above provokes syntax errors, please clean up.
and name pd is not defined... I can Guess what you mean, but this was not a guessing game. You would like an answer... Code debugging is a different subject.