Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

perl parsing file - conditionally splitting to new file(s)

Posted on 2013-12-15
12
Medium Priority
?
236 Views
Last Modified: 2013-12-15
I have a text file which is as follows


000001      $$1orange$$2apple2$$3pear
000002      $$1orange$$2apple2$$3pear
000003      $$1orange$$2apple2$$3pear
000002      $$1pear$$2apple2$$3orange
000003      $$1elephant$$2tiger$$3giraffe
000003      $$1elephant$$2tiger$$lion
000003      $$1elephant$$2monkey$$lion


The first column (i.e. 000001) is the record ID.  The subfields are then denoted with the $$.  So, in the first line, subfield1 is orange, subfield2 is apple, subfield3 is pear

I'm wanting (and the file could be a few hundred lines or just a couple) to split the file based on the record ID, and subfield 2.

So, the first time it encounters a unique ID/subfield 2 pair (e.g. 000001 / apple) it should write out to one.  The second (and subsequent) times it finds this combo, it should write out to a different file.

Also, if it finds a repeat of the ID, but subfield 2 is different - each occurrence of this should write to a separate file.

For example, the file above would be parsed into the following files:

File1 (the first occurrence of ID/Subfield2 being seen):

000001      $$1orange$$2apple2$$3pea
000002      $$1orange$$2apple2$$3pear

File2 (second occurrence of an ID being seen, but subfield2 is different):
000003      $$1elephant$$2tiger$$3giraffe

File3 (third occurrence of an ID being seen, but subfield2 is different):
000003      $$1elephant$$2monkey$$3giraffe

LastFile (2nd occurrence of ID/Subfield2 being seen):
000003      $$1orange$$2apple2$$3pear
000002      $$1pear$$2apple2$$3orange
000003      $$1elephant$$2tiger$$lion

How can this be achieved in perl?

any help much appreciated (as always!)
0
Comment
Question by:yelbow
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 5
12 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 39719995
perl -F'/\s*\$\$/' -ape '$f=++$f{"@F[0,2]"};open STDOUT,">>$f" or warn "$f $!"' file
0
 

Author Comment

by:yelbow
ID: 39720023
Hi Ozo,

Thanks for this - however, it doesn't quite work (and looking back my example may have been a little dodgy)

It gives me the following five files:

'000001$$1orange'
000001      $$1orange$$2apple2$$3pear

cat '000002$$1orange'
000002     $$1orange$$2apple2$$3pear

cat '000003$$1orange'
000003      $$1orange$$2apple2$$3pear

cat '000002$$1pear'
000002      $$1pear$$2apple2$$3orange

cat '000003$$1elephant'
000003      $$1elephant$$2tiger$$3giraffe
000003      $$1elephant$$2tiger$$lion
000003      $$1elephant$$2monkey$$lion

The output of the first 3 files should all be combined in 1, as it's the first time each of them has been seen.

The fourth file is fine, as it's the second time the ID 000002 has appeared, but subfield 2 is different.

The 5th file has "000003 / tiger" which hasn't been seen before, so should be in the first file etc
0
 
LVL 84

Expert Comment

by:ozo
ID: 39720030
I think http:#a39719995 is now corrected, but I'm still a little confused about the example.
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:yelbow
ID: 39720049
Hi,

That now gives:
cat 2
000002      $$1pear$$2apple2$$3orange
000003      $$1elephant$$2tiger$$lion
cat 1
000001      $$1orange$$2apple2$$3pear
000002      $$1orange$$2apple2$$3pear
000003      $$1orange$$2apple2$$3pear
000003      $$1elephant$$2tiger$$3giraffe
000003      $$1elephant$$2monkey$$lion

which isn't quite right.  You're absolutely right about the example though - so, I'll  try that again (sorry!)

Original File:
000001      $$1orange$$2apple2$$3pear
000002      $$1orange$$2apple2$$3pear
000003      $$1orange$$2apple2$$3pear
000002      $$1pear$$2apple2$$3orange
000003      $$1elephant$$2tiger$$3giraffe
000003      $$1elephant$$2tiger$$3lion
000003      $$1elephant$$2monkey$$3lion
000003      $$1lion$$2monkey$$3giraffe

Resulting files:

File 1 (the first occurrence of each pair of ID/subfield2):
000001      $$1orange$$2apple2$$3pear
000002      $$1orange$$2apple2$$3pear
000003      $$1orange$$2apple2$$3pear


File 2 (second occurrence of an ID being seen, but subfield2 is different)
000003      $$1elephant$$2tiger$$3giraffe

File 3 (third occurrence of an ID being seen, but subfield2 is different)
000003      $$1elephant$$2monkey$$lion

File 4 (the second (and subsequent) occurences of an ID/subfield2 pair)
000002      $$1pear$$2apple2$$3orange
000003      $$1elephant$$2tiger$$lion
000003      $$1lion$$2monkey$$3giraffe

Does that help?  Sorry if I'm being unclear.

The ultimate aim is to only have the ID appear once in each file, unless it's in the final file which is a catchall of the second and subsequent occurrences of the ID/Subfield2 pairings.
0
 
LVL 84

Expert Comment

by:ozo
ID: 39720101
Now it sounds like you are describing something like
perl -F'/\s*\$\$/' -ape '$i=++$i{$F[0]};$f=++$f{"@F[0,2]"};open STDOUT,">>$i.$f" or warn "$i.$f $!"'
But I'm still confused
00003      $$1elephant$$2tiger$$3giraffe is the first occurrence of the pair 000003/2tiger, so why isn't it in the first file?
0
 

Author Comment

by:yelbow
ID: 39720116
Hi,

It's because it's the second occurrence of the ID 00003 - and in all files bar the "catchall" file (where its seen both the id and subfield2 before), the ID can only exist once.

I'm, roughly, trying to achieve this:

For each line
	if I've seen this ID before then
		if I've seen this ID/Subfield2 combo before then
			write out to a single catchall file
		else
			write out to new filex
		endif
	else
		Write out to file1
	endif
end for

Open in new window

0
 
LVL 84

Expert Comment

by:ozo
ID: 39720124
Here's another interpretation:
perl -F'/\s*\$\$/' -ape '$f=$f{"@F[0,2]"}++||$F[2];open STDOUT,">>$f" or warn "$f $!"'
0
 

Author Comment

by:yelbow
ID: 39720139
Ooh, that's nearly there (thanks so much for your patience), only it writes out *all* of the lines to the "catchall" file?
0
 
LVL 84

Accepted Solution

by:
ozo earned 2000 total points
ID: 39720141
This should be a literal translation of http:#a39720116
perl -F'/\s*\$\$/' -ape '$f=$f{$F[0]}++?$f{"@F[0,2]"}++?"catchcall":++$n:"file1";open STDOUT,">>$f" or warn "$f $!"'
0
 
LVL 84

Expert Comment

by:ozo
ID: 39720164
Are you sure catchall file and new file aren't switched in  http:#a39720116?
It matches your most recent example better when they are switched.
0
 

Author Closing Comment

by:yelbow
ID: 39720167
That seems perfect, thanks so much for your help and patience
0
 
LVL 84

Expert Comment

by:ozo
ID: 39720170
perl -F'/\s*\$\$/' -ape '$s=$f{"@F[0,2]"}++;$f=$f{$F[0]}++?$s?"catchcall":++$n:"file1";open STDOUT,">>$f" or warn "$f $!"'
0

Featured Post

On Demand Webinar: Networking for the Cloud Era

Did you know SD-WANs can improve network connectivity? Check out this webinar to learn how an SD-WAN simplified, one-click tool can help you migrate and manage data in the cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

670 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question