We help IT Professionals succeed at work.

Python : how to remove duplicated row from output.

Newton
Newton asked
on
Hi,
I got 2 files i.e File1 and File2, as shown below

File1
A | APPLE
B | ORANGE

File2
A | 10
B | 15
D | 20
A | 10

I need following output
Output 1
A | APPLE | 10
B | ORANGE | 15

But I am getting this below output.
A | APPLE | 10
B | ORANGE | 15
A | APPLE | 10

How can I remove the duplicate rows from the output and direct only the duplicate output to a new file.

My code is as follow

Import pandas as pd
df1 = pd.read_csv('file1.txt', sep='|')
df2 = pd.read_csv('file2.txt', sep='|')
Merge12 = pd.merge(df1, df2, how='left', on='A')
Merge12.to_csv('output.txt')
Comment
Watch Question

Subodh Tiwari (Neeraj)Excel & VBA Expert
CERTIFIED EXPERT
Most Valuable Expert 2018
Awarded 2015

Commented:

You may remove duplicate rows either from the df1 and df2 first and then merge them or remove duplicate rows from the resultant dataframe.


Merge12.drop_duplicates(keep='first', inplace=True)

Author

Commented:
Below code worked.

Import pandas as pd
df1 = pd.read_csv('file1.txt', sep='|')
df2 = pd.read_csv('file2.txt', sep='|')
df2.drop_duplicates(keep='first',inplace=True)
Merge12 = pd.merge(df1, df2, how='left', on='A')
Merge12.to_csv('output.txt')

Now I want to write only the duplicated row to a new file, is below code is coreect way of doing?

Import pandas as pd
df1 = pd.read_csv('file1.txt', sep='|')
df2 = pd.read_csv('file2.txt', sep='|')
df2.drop_duplicates(keep='first',inplace=True)
Merge12 = pd.merge(df1, df2, how='left', on='A')
Merge12.to_csv('output.txt')

df3 = pd.read_csv('file.txt', sep='|')
df3.drop_duplicates(keep='first', inplace=False)
df3.to_csv('duplicatedrow.txt')
Excel & VBA Expert
CERTIFIED EXPERT
Most Valuable Expert 2018
Awarded 2015
Commented:

drop_duplicates will remove the duplicate rows from the dataframe.and the resultant dataframe will contain all the unique rows in it.


If you want to save the duplicate rows from a dataframe, you may try something like this...


df3 = pd.read_csv('file.txt', sep='|')
df3 = df3[df3.duplicated(keep='first')]
df3.to_csv('duplicatedrow.txt')

The saved file now will contain the rows which were duplicate rows in the df3 dataframe. 

Author

Commented:
Thank You Neeraj.
Subodh Tiwari (Neeraj)Excel & VBA Expert
CERTIFIED EXPERT
Most Valuable Expert 2018
Awarded 2015

Commented:

You're welcome Newton!