yelbow
asked on
perl parsing file - conditionally splitting to new file(s)
I have a text file which is as follows
The first column (i.e. 000001) is the record ID. The subfields are then denoted with the $$. So, in the first line, subfield1 is orange, subfield2 is apple, subfield3 is pear
I'm wanting (and the file could be a few hundred lines or just a couple) to split the file based on the record ID, and subfield 2.
So, the first time it encounters a unique ID/subfield 2 pair (e.g. 000001 / apple) it should write out to one. The second (and subsequent) times it finds this combo, it should write out to a different file.
Also, if it finds a repeat of the ID, but subfield 2 is different - each occurrence of this should write to a separate file.
For example, the file above would be parsed into the following files:
How can this be achieved in perl?
any help much appreciated (as always!)
000001 $$1orange$$2apple2$$3pear
000002 $$1orange$$2apple2$$3pear
000003 $$1orange$$2apple2$$3pear
000002 $$1pear$$2apple2$$3orange
000003 $$1elephant$$2tiger$$3gira ffe
000003 $$1elephant$$2tiger$$lion
000003 $$1elephant$$2monkey$$lion
000002 $$1orange$$2apple2$$3pear
000003 $$1orange$$2apple2$$3pear
000002 $$1pear$$2apple2$$3orange
000003 $$1elephant$$2tiger$$3gira
000003 $$1elephant$$2tiger$$lion
000003 $$1elephant$$2monkey$$lion
The first column (i.e. 000001) is the record ID. The subfields are then denoted with the $$. So, in the first line, subfield1 is orange, subfield2 is apple, subfield3 is pear
I'm wanting (and the file could be a few hundred lines or just a couple) to split the file based on the record ID, and subfield 2.
So, the first time it encounters a unique ID/subfield 2 pair (e.g. 000001 / apple) it should write out to one. The second (and subsequent) times it finds this combo, it should write out to a different file.
Also, if it finds a repeat of the ID, but subfield 2 is different - each occurrence of this should write to a separate file.
For example, the file above would be parsed into the following files:
File1 (the first occurrence of ID/Subfield2 being seen):
000001 $$1orange$$2apple2$$3pea
000002 $$1orange$$2apple2$$3pear
File2 (second occurrence of an ID being seen, but subfield2 is different):
000003 $$1elephant$$2tiger$$3gira ffe
File3 (third occurrence of an ID being seen, but subfield2 is different):
000003 $$1elephant$$2monkey$$3gir affe
LastFile (2nd occurrence of ID/Subfield2 being seen):
000003 $$1orange$$2apple2$$3pear
000002 $$1pear$$2apple2$$3orange
000003 $$1elephant$$2tiger$$lion
000001 $$1orange$$2apple2$$3pea
000002 $$1orange$$2apple2$$3pear
File2 (second occurrence of an ID being seen, but subfield2 is different):
000003 $$1elephant$$2tiger$$3gira
File3 (third occurrence of an ID being seen, but subfield2 is different):
000003 $$1elephant$$2monkey$$3gir
LastFile (2nd occurrence of ID/Subfield2 being seen):
000003 $$1orange$$2apple2$$3pear
000002 $$1pear$$2apple2$$3orange
000003 $$1elephant$$2tiger$$lion
How can this be achieved in perl?
any help much appreciated (as always!)
perl -F'/\s*\$\$/' -ape '$f=++$f{"@F[0,2]"};open STDOUT,">>$f" or warn "$f $!"' file
ASKER
Hi Ozo,
Thanks for this - however, it doesn't quite work (and looking back my example may have been a little dodgy)
It gives me the following five files:
'000001$$1orange'
cat '000002$$1orange'
cat '000003$$1orange'
cat '000002$$1pear'
cat '000003$$1elephant'
The output of the first 3 files should all be combined in 1, as it's the first time each of them has been seen.
The fourth file is fine, as it's the second time the ID 000002 has appeared, but subfield 2 is different.
The 5th file has "000003 / tiger" which hasn't been seen before, so should be in the first file etc
Thanks for this - however, it doesn't quite work (and looking back my example may have been a little dodgy)
It gives me the following five files:
'000001$$1orange'
000001 $$1orange$$2apple2$$3pear
cat '000002$$1orange'
000002 $$1orange$$2apple2$$3pear
cat '000003$$1orange'
000003 $$1orange$$2apple2$$3pear
cat '000002$$1pear'
000002 $$1pear$$2apple2$$3orange
cat '000003$$1elephant'
000003 $$1elephant$$2tiger$$3gira ffe
000003 $$1elephant$$2tiger$$lion
000003 $$1elephant$$2monkey$$lion
000003 $$1elephant$$2tiger$$lion
000003 $$1elephant$$2monkey$$lion
The output of the first 3 files should all be combined in 1, as it's the first time each of them has been seen.
The fourth file is fine, as it's the second time the ID 000002 has appeared, but subfield 2 is different.
The 5th file has "000003 / tiger" which hasn't been seen before, so should be in the first file etc
I think http:#a39719995 is now corrected, but I'm still a little confused about the example.
ASKER
Hi,
That now gives:
cat 2
which isn't quite right. You're absolutely right about the example though - so, I'll try that again (sorry!)
Original File:
Resulting files:
File 1 (the first occurrence of each pair of ID/subfield2):
File 2 (second occurrence of an ID being seen, but subfield2 is different)
File 3 (third occurrence of an ID being seen, but subfield2 is different)
File 4 (the second (and subsequent) occurences of an ID/subfield2 pair)
Does that help? Sorry if I'm being unclear.
The ultimate aim is to only have the ID appear once in each file, unless it's in the final file which is a catchall of the second and subsequent occurrences of the ID/Subfield2 pairings.
That now gives:
cat 2
000002 $$1pear$$2apple2$$3orange
000003 $$1elephant$$2tiger$$lion
cat 1000003 $$1elephant$$2tiger$$lion
000001 $$1orange$$2apple2$$3pear
000002 $$1orange$$2apple2$$3pear
000003 $$1orange$$2apple2$$3pear
000003 $$1elephant$$2tiger$$3gira ffe
000003 $$1elephant$$2monkey$$lion
000002 $$1orange$$2apple2$$3pear
000003 $$1orange$$2apple2$$3pear
000003 $$1elephant$$2tiger$$3gira
000003 $$1elephant$$2monkey$$lion
which isn't quite right. You're absolutely right about the example though - so, I'll try that again (sorry!)
Original File:
000001 $$1orange$$2apple2$$3pear
000002 $$1orange$$2apple2$$3pear
000003 $$1orange$$2apple2$$3pear
000002 $$1pear$$2apple2$$3orange
000003 $$1elephant$$2tiger$$3gira ffe
000003 $$1elephant$$2tiger$$3lion
000003 $$1elephant$$2monkey$$3lio n
000003 $$1lion$$2monkey$$3giraffe
000002 $$1orange$$2apple2$$3pear
000003 $$1orange$$2apple2$$3pear
000002 $$1pear$$2apple2$$3orange
000003 $$1elephant$$2tiger$$3gira
000003 $$1elephant$$2tiger$$3lion
000003 $$1elephant$$2monkey$$3lio
000003 $$1lion$$2monkey$$3giraffe
Resulting files:
File 1 (the first occurrence of each pair of ID/subfield2):
000001 $$1orange$$2apple2$$3pear
000002 $$1orange$$2apple2$$3pear
000003 $$1orange$$2apple2$$3pear
000002 $$1orange$$2apple2$$3pear
000003 $$1orange$$2apple2$$3pear
File 2 (second occurrence of an ID being seen, but subfield2 is different)
000003 $$1elephant$$2tiger$$3gira ffe
File 3 (third occurrence of an ID being seen, but subfield2 is different)
000003 $$1elephant$$2monkey$$lion
File 4 (the second (and subsequent) occurences of an ID/subfield2 pair)
000002 $$1pear$$2apple2$$3orange
000003 $$1elephant$$2tiger$$lion
000003 $$1lion$$2monkey$$3giraffe
000003 $$1elephant$$2tiger$$lion
000003 $$1lion$$2monkey$$3giraffe
Does that help? Sorry if I'm being unclear.
The ultimate aim is to only have the ID appear once in each file, unless it's in the final file which is a catchall of the second and subsequent occurrences of the ID/Subfield2 pairings.
Now it sounds like you are describing something like
perl -F'/\s*\$\$/' -ape '$i=++$i{$F[0]};$f=++$f{"@ F[0,2]"};o pen STDOUT,">>$i.$f" or warn "$i.$f $!"'
But I'm still confused
00003 $$1elephant$$2tiger$$3gira ffe is the first occurrence of the pair 000003/2tiger, so why isn't it in the first file?
perl -F'/\s*\$\$/' -ape '$i=++$i{$F[0]};$f=++$f{"@
But I'm still confused
00003 $$1elephant$$2tiger$$3gira
ASKER
Hi,
It's because it's the second occurrence of the ID 00003 - and in all files bar the "catchall" file (where its seen both the id and subfield2 before), the ID can only exist once.
I'm, roughly, trying to achieve this:
It's because it's the second occurrence of the ID 00003 - and in all files bar the "catchall" file (where its seen both the id and subfield2 before), the ID can only exist once.
I'm, roughly, trying to achieve this:
For each line
if I've seen this ID before then
if I've seen this ID/Subfield2 combo before then
write out to a single catchall file
else
write out to new filex
endif
else
Write out to file1
endif
end for
Here's another interpretation:
perl -F'/\s*\$\$/' -ape '$f=$f{"@F[0,2]"}++||$F[2] ;open STDOUT,">>$f" or warn "$f $!"'
perl -F'/\s*\$\$/' -ape '$f=$f{"@F[0,2]"}++||$F[2]
ASKER
Ooh, that's nearly there (thanks so much for your patience), only it writes out *all* of the lines to the "catchall" file?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Are you sure catchall file and new file aren't switched in http:#a39720116?
It matches your most recent example better when they are switched.
It matches your most recent example better when they are switched.
ASKER
That seems perfect, thanks so much for your help and patience
perl -F'/\s*\$\$/' -ape '$s=$f{"@F[0,2]"}++;$f=$f{ $F[0]}++?$ s?"catchca ll":++$n:" file1";ope n STDOUT,">>$f" or warn "$f $!"'