maurice cristen
asked on
Windows7 os:gawk code to remove duplicates?
I need a gawk code to remove this type of duplicates:
john:john
mary:1234
sdfds:sadfsf
john:john
mary:1234
ASKER
i have a huge txt file 1,8 gb remove duplicates and keep the order,unique?can u guide me plz
ASKER
like in the sample i provided
Keeping the order is trickier.
Normally you would sort the file, then use uniq to remove duplicates. But the resulting file will not have the original order.
Normally you would sort the file, then use uniq to remove duplicates. But the resulting file will not have the original order.
ASKER
App.Merge can do that from hashkiller forum....good results but not keeping the order ,sortin alphabetical order,maybe i try that if u say no sort is trikcing
gawk '!seen[$0]++' file.txt > results.txt
It should work, don't know on 1GB+ files. Try it.
ASKER
C:\Program Files (x86)\GnuWin32\bin>awk '!seen[$0]++' file.txt > results.txt
awk: '!seen[$0]++'
awk: ^ invalid char ''' in expression
awk: '!seen[$0]++'
awk: ^ invalid char ''' in expression
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
maybe portable gnu utilities work, can you send me your folder pls with sendspace.com or...blah blah..i will delete the link after
ASKER
if dones't work anyway i will keep your code and i will try on my other laptop and i will sucessfull close this thread and give u points!
1)
Comment from "Dan Craciun" works fine using cygwin
cygwin commands inside C:\Windows\System32\cmd.ex e and without using D:\cygwin\bin\uniq.exe
4)
C:\Windows\System32\cmd.ex e providing help on SORT. Hence "sort -u file.txt" inside C:\Windows\System32\cmd.ex e will provide following error:
Input file specified two times.
So, use full path while executing any commands at any operating systems.
5)
Use related redirection based on the requirement:
a. /full/path/command > outputfile.txt
b. /full/path/command > outputfile.txt 2>error.txt
output and error file being same:
c. /full/path/command > outputfile.txt 2>&1
Run following background
d. /full/path/command > outputfile.txt 2>&1 &
Display error and output in terminal as well as in output file.
e. /full/path/command 2>&1 | /usr/bin/tee -a outputfile.txt
Comment from "Dan Craciun" works fine using cygwin
D:\cygwin\home\murugesandinesh> D:\cygwin\bin\gawk.exe '!seen[$0]++' file.txt
john:john
mary:1234
sdfds:sadfsf
D:\cygwin\home\murugesandinesh>
[\code]
2)
cygwin commands inside C:\Windows\System32\cmd.exe
[code]
D:\cygwin\home\murugesandinesh> D:\cygwin\bin\sort.exe file.txt | D:\cygwin\bin\uniq.exe
john:john
mary:1234
sdfds:sadfsf
D:\cygwin\home\murugesandinesh>
3)cygwin commands inside C:\Windows\System32\cmd.ex
D:\cygwin\home\murugesandinesh> D:\cygwin\bin\sort.exe -u file.txt
john:john
mary:1234
sdfds:sadfsf
D:\cygwin\home\murugesandinesh>
4)
C:\Windows\System32\cmd.ex
Input file specified two times.
So, use full path while executing any commands at any operating systems.
5)
Use related redirection based on the requirement:
a. /full/path/command > outputfile.txt
b. /full/path/command > outputfile.txt 2>error.txt
output and error file being same:
c. /full/path/command > outputfile.txt 2>&1
Run following background
d. /full/path/command > outputfile.txt 2>&1 &
Display error and output in terminal as well as in output file.
e. /full/path/command 2>&1 | /usr/bin/tee -a outputfile.txt
ASKER
thank you
Because if so you don't need awk, you need uniq.
HTH,
Dan