enthuguy
asked on
awk with column based string search
Hi EE,
Please help with below, have partially achieved but would request your help pls :)
Below is the sample file I receive. Would like to split the file into two... based on below criteria :\
1. Read line by line
2. Go to column (position) 89
3. if next four char is "GIRR" > take this line by creating a new file
4. Rest of the lines should go into another file
5. From below sample file. first two lines in file1.txt. Last line should be in file2.txt
So far I tried below awk and it works fine by searching the string "GIRR" in the entire line and creates two files. Please help/suggest how to check "GIRR" from column 89
Sample1.txt
Please help with below, have partially achieved but would request your help pls :)
Below is the sample file I receive. Would like to split the file into two... based on below criteria :\
1. Read line by line
2. Go to column (position) 89
3. if next four char is "GIRR" > take this line by creating a new file
4. Rest of the lines should go into another file
5. From below sample file. first two lines in file1.txt. Last line should be in file2.txt
So far I tried below awk and it works fine by searching the string "GIRR" in the entire line and creates two files. Please help/suggest how to check "GIRR" from column 89
awk -F, '{if(/GIRR/)print > "file2.txt";else print > "file1.txt"}' sample1.txt
Sample1.txt
00003004MRS PL GIRRAAAAAAA 44 WIDFORD ST GLENROY XYZ 1234 12340000094070544 LHOM00000000106599090000000001001 00000000+0000000000000.0033200000000106599090310A L669975878 65434519MA03NAB TAILORED TEST WALK MRS AAAAAAAAAAA RTRPP MRS UUUUU AAAAAAAAAAA AAAAAAAAAAA UUUUU AAAAAA JP AAAAAAAAAAA UUUUU AAAAAA PP AAAAAAAAAAA RRRR RTRTRTR SP +0000000000124.886543456540400083004134534948+000001234.006543456540400N
00003004MR DA AAAAAAAAAAA 44 WIDFORD STREET GLENROY XYZ 1234 12340000094070544 LHOM00000000106599090000000001001 00000000+0000000000000.0033200000000106599090310 L738333503 65434519MA03NAB TAILORED TEST WALK MR AAAAAAAAAAA SCDSP MR RRRR AAAAAAAAAAA AAAAAAAAAAA UUUUU AAAAAA JP AAAAAAAAAAA UUUUU AAAAAA PP AAAAAAAAAAA RRRR RTRTRTR SP +0000000000124.886543456540400083004134534948+000001234.006543456540400N
00002062MR J BBBBBB 44A GIRRAWICK ROAD GIRRAWICK ABC 3214 32140000047005096 LHPO00000000746488504400000001001 00000000+0000000000000.0033100000000746488508220 L401871402 65434519MA04NAB 100% OFFSET WALK TYTYTY PACKAGE MR BBBBBB WCTPP MR YYYYY BBBBBB BBBBBB YYYYY JP BBBBBB YYYYY PP CCCCCCCC RRRRR GGGGGG SP +0000000001420.446543444931900082062152887579+000001420.446543444951900N
ASKER
Thanks arnold,
any reference to perl please
any reference to perl please
a simple
Note your awk example splits based on a comma, -F, while the file you provide as an example is a clear Tab-Separated-Values (TSV) and not CSV (Comma separated Values).
the split below will need to have \t instead of the , ..
Note your awk example splits based on a comma, -F, while the file you provide as an example is a clear Tab-Separated-Values (TSV) and not CSV (Comma separated Values).
the split below will need to have \t instead of the , ..
#!/usr/bin/perl
my @array ; # to use as a place holder of the line data
open FILE, "</path/to/file" || die "Unable to open file for reading: $!\n";
open OUT1, ">/path.to/output1" || die "Unable to open output file 1 for writing: $!\n";
open OUT2, ">/path to/output2" || die "Unable to open the second output file for writing: $!\n";
while (<FILE>) {
chomp(); #strip the line feed, carrieage return from the end of the line
@array =split(/,/); # the split when no variable is present $_ is implied which contains the line read in in the while loop
if ( $#array >= 88 && $array[88] =~ /GRRR/ ) {
print OUT1 "$_\n";
}
else {
print OUT2 "$_\n";
}
}
close (OUT1);
close(OUT2);
close (FILE);
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks arnold for your help and time to provide me a perl solution. Nice learning experience :)
@Bill, thx so much. Like this smarter version of awk :)
@Bill, thx so much. Like this smarter version of awk :)
ASKER
HI @Bill
could you help how to pass those two file names as an arguments to awk file pls
Tried below but still not able to get it working
splitfile.awk outputfile1.txt outputfile2.txt < sample1.txt
could you help how to pass those two file names as an arguments to awk file pls
Tried below but still not able to get it working
splitfile.awk outputfile1.txt outputfile2.txt < sample1.txt
#!/bin/awk -f
BEGIN {
outFile[1] = "ARGV[1]"
outFile[2] = "ARGV[2]"
useFile = 1
}
{
if (substr($0, 89, 4) == "GIRR") {
useFile = 2
}
print $0>>outFile[useFile]
}
ASKER
this one
1. Creates only outputfile1.txt
2. Goes to an infinite loop and created above file with 77768 rows in just few 3 seconds :)
3. I think, it dumps the whole file without evaluating the condition
cat splitfile.awk
1. Creates only outputfile1.txt
2. Goes to an infinite loop and created above file with 77768 rows in just few 3 seconds :)
3. I think, it dumps the whole file without evaluating the condition
cat splitfile.awk
./splitfile.awk sample1.txt outputfile1.txt outputfile2.txt
#!/bin/awk -f
BEGIN {
outFile[1] = ARGV[1]
outFile[2] = ARGV[2]
useFile = 1
}
{
if (substr($0, 89, 4) == "GIRR") {
useFile = 2
}
print $0>>outFile[useFile]
}
ASKER
# slightly better
# both outputfile got created.
# outputfile1 with two non GIRR rows
# outputfile2 with infinite loop rows both GIRR and non
cat splitnew.awk
# both outputfile got created.
# outputfile1 with two non GIRR rows
# outputfile2 with infinite loop rows both GIRR and non
gawk -f splitnew.awk sample1.txt outputfile1.txt outputfile2.txt
cat splitnew.awk
BEGIN {
outFile[1] = ARGV[2]
outFile[2] = ARGV[3]
useFile = 1
}
{
if (substr($0, 89, 4) == "GIRR") {
useFile = 2
}
print $0>>outFile[useFile]
}
ASKER
help please :)
ASKER
Looks like this one worked as expected. doing testing with actual file now
splitnew.awk
splitnew.awk
BEGIN {
outFile[1] = ARGV[2]
outFile[2] = ARGV[3]
ARGV[2]=""
ARGV[3]=""
useFile = 1
}
{
if (substr($0, 89, 4) == "GIRR") {
useFile = 2
}
print $0>>outFile[useFile]
}
ASKER
With slight modofication it worked fine
BEGIN {
outFile[1] = ARGV[2]
outFile[2] = ARGV[3]
ARGV[2]=""
ARGV[3]=""
useFile = 1
}
{
if (substr($0, 89, 4) == "GIRR") {
useFile = 2
}
print $0>>outFile[useFile]
useFile = 1
}
BEGIN {
outFile[1] = ARGV[2]
outFile[2] = ARGV[3]
ARGV[2]=""
ARGV[3]=""
useFile = 1
}
{
if (substr($0, 89, 4) == "GIRR") {
useFile = 2
}
print $0>>outFile[useFile]
useFile = 1
}
Sorry, it was night time here, I was sleeping... 😴
Glad you worked out passing in the file names.
Another apporach I typically used is to pass things in as variables rather than positional parms. Adds readability, etc. So you could do:
»bp
Glad you worked out passing in the file names.
Another apporach I typically used is to pass things in as variables rather than positional parms. Adds readability, etc. So you could do:
gawk -f EE29184290.awk -v fileout1=file1.txt -v fileout2=file2.txt sample1.txt
With this AWK script:BEGIN {
outFile[1] = fileout1
outFile[2] = fileout2
useFile = 1
}
{
if (substr($0, 89, 4) == "GIRR") {
useFile = 2
}
print $0>>outFile[useFile]
}
»bp
@enthuguy,
How about a nice cryptic Perl one-liner (possibly prefixed by a "rm...").
@arnold,
"...the file you provide as an example is a clear Tab-Separated-Values (TSV) and not CSV (Comma separated Values)."
How do you know it's tab separated? When I copy and paste it, I don't get any tabs. Has it been changed since you copied it?
I tried running your script and it didn't work for me. Everything ended up in output2. Does it work for you?
How about a nice cryptic Perl one-liner (possibly prefixed by a "rm...").
rm file2.txt
perl -pe 'open STDOUT, ">>file2.txt" if /^.{88}GIRR/' sample.txt >file1.txt
Note: The "rm..." line is required only if this file2.txt could contain something beforehand, e.g. during a re-run.@arnold,
"...the file you provide as an example is a clear Tab-Separated-Values (TSV) and not CSV (Comma separated Values)."
How do you know it's tab separated? When I copy and paste it, I don't get any tabs. Has it been changed since you copied it?
I tried running your script and it didn't work for me. Everything ended up in output2. Does it work for you?
spacing, it could be \t for tab
\s+ for any white space
though if white space is used the address column will be broken down into multile street number, street name, street/road/ave/blvs/etc.
tp be a CSV it needs commas (,) between elements to indicate columns
column1,column2,column3
item1,item2,item3
thoug h recall having a similar pattern of data dump that had a different preset scheme where using FORMATTED extract
i.e. scanf (10%s30%detc.
deals with the first 10 items are a single string, the next 30 chracters have to be numeric, etc...
you are missing the SPLIT
I do not think the pattern you use
if /^.{88}GIRR/ is valid for what you want
not sure the regular expression .(88) means locate the 88th element and ...
note . means any one character
.* means any number of characters
(88) you are trying to match on 88 and if matched to assign it to a match variable, in this case it will be $1
you can still use a one liner, though you still need to split the line into an @ARRAY and then evaluate the 88th element in the array for the pattern.
\s+ for any white space
though if white space is used the address column will be broken down into multile street number, street name, street/road/ave/blvs/etc.
tp be a CSV it needs commas (,) between elements to indicate columns
column1,column2,column3
item1,item2,item3
thoug h recall having a similar pattern of data dump that had a different preset scheme where using FORMATTED extract
i.e. scanf (10%s30%detc.
deals with the first 10 items are a single string, the next 30 chracters have to be numeric, etc...
you are missing the SPLIT
I do not think the pattern you use
if /^.{88}GIRR/ is valid for what you want
not sure the regular expression .(88) means locate the 88th element and ...
note . means any one character
.* means any number of characters
(88) you are trying to match on 88 and if matched to assign it to a match variable, in this case it will be $1
you can still use a one liner, though you still need to split the line into an @ARRAY and then evaluate the 88th element in the array for the pattern.
@arnold.
Questions:
Q1. Does my script work when you run it, It works for me.
Q2. Does your script work when you run it? It doesn't work for me, as mentioned in my last post.
Questions:
Q1. Does my script work when you run it, It works for me.
Q2. Does your script work when you run it? It doesn't work for me, as mentioned in my last post.
P.S.
"(88) you are trying to match on 88 and if matched to assign it to a match variable, in this case it will be $1"
I didn't use (88). I used {88}.
"(88) you are trying to match on 88 and if matched to assign it to a match variable, in this case it will be $1"
I didn't use (88). I used {88}.
I do not have the data in the format you have. if I copy and paste, it treats everything as a white space.
this is why I said it did not look like a CSV but a TSV, though after your post and my comment indicated that this might be a different data source where it is based on specific spacing where a specific element is. I think there was a recent thread using a similar output stream, though I can not remember the topic, content which I think had a similar question in terms of loading and changing a specific column to a different value .....
where you do not want to replace a pattern, you want to separate the content based on a content pattern in a specific location, column.
this is why I said it did not look like a CSV but a TSV, though after your post and my comment indicated that this might be a different data source where it is based on specific spacing where a specific element is. I think there was a recent thread using a similar output stream, though I can not remember the topic, content which I think had a similar question in terms of loading and changing a specific column to a different value .....
where you do not want to replace a pattern, you want to separate the content based on a content pattern in a specific location, column.
in the small font, {88} inte as (88)
you want to go .{88} 88 characters in, versus the 88 column when splitting on commas means completely different things.
you want to go .{88} 88 characters in, versus the 88 column when splitting on commas means completely different things.
@arnold,
I'm confused. If you could please quote each of my questions below, and put your answers below each, that might help clear up the confusion.
Questions:
"I do not have the data in the format you have. if I copy and paste, it treats everything as a white space."
Q3. What do you mean by "everything"? Are you saying that even the visible characters like alphanumerics convert to a white space when you paste them, or what exactly?
Q4. What do you mean by "a white space"? The definition I'm familiar with for "white space" is: spaces, tabs, CR & LF. Which of those are you referring to?
Q5. How do you know you don't have the data in the format I have? As I said in my first post, "When I copy and paste it, I don't get any tabs". Sounds to me as you might have the same data that I have. If you don't think so, then feel free to post what you have as an attachment when you copy and paste it, so I can compare.
"this is why I said it did not look like a CSV but a TSV, though after your post and my comment indicated that this might be a different data source where it is based on specific spacing where a specific element is. I think there was a recent thread using a similar output stream, though I can not remember the topic, content which I think had a similar question in terms of loading and changing a specific column to a different value .....
where you do not want to replace a pattern, you want to separate the content based on a content pattern in a specific location, column."
Q6. Does that mean that you now realise that you have no good reason to believe this data contains tabs, because you've just mixed this question up with a different one? So you know it's not meant to be a CSV or TSV file, because there are no delimiters, right? Column 1 is character 1, column 2 is character 2, etc.
"in the small font, {88} inte as (88)"
Q7. Please explain what you mean. For starters, "inte" is not a word, and if you mean "into", then I still don't understand your sentence. Also, what small font?
And I repeat:
Q1. Does my script work when you run it? It works for me.
Q2. Does your script work when you run it? It doesn't work for me, as mentioned in my first post.
I'm confused. If you could please quote each of my questions below, and put your answers below each, that might help clear up the confusion.
Questions:
"I do not have the data in the format you have. if I copy and paste, it treats everything as a white space."
Q3. What do you mean by "everything"? Are you saying that even the visible characters like alphanumerics convert to a white space when you paste them, or what exactly?
Q4. What do you mean by "a white space"? The definition I'm familiar with for "white space" is: spaces, tabs, CR & LF. Which of those are you referring to?
Q5. How do you know you don't have the data in the format I have? As I said in my first post, "When I copy and paste it, I don't get any tabs". Sounds to me as you might have the same data that I have. If you don't think so, then feel free to post what you have as an attachment when you copy and paste it, so I can compare.
"this is why I said it did not look like a CSV but a TSV, though after your post and my comment indicated that this might be a different data source where it is based on specific spacing where a specific element is. I think there was a recent thread using a similar output stream, though I can not remember the topic, content which I think had a similar question in terms of loading and changing a specific column to a different value .....
where you do not want to replace a pattern, you want to separate the content based on a content pattern in a specific location, column."
Q6. Does that mean that you now realise that you have no good reason to believe this data contains tabs, because you've just mixed this question up with a different one? So you know it's not meant to be a CSV or TSV file, because there are no delimiters, right? Column 1 is character 1, column 2 is character 2, etc.
"in the small font, {88} inte as (88)"
Q7. Please explain what you mean. For starters, "inte" is not a word, and if you mean "into", then I still don't understand your sentence. Also, what small font?
And I repeat:
Q1. Does my script work when you run it? It works for me.
Q2. Does your script work when you run it? It doesn't work for me, as mentioned in my first post.
Hi again arnold.
"you want to go .{88} 88 characters in, versus the 88 column when splitting on commas means completely different things."
Q8. Who said we should be splitting on commas? The sample input data isn't comma separated. Yes, I want to ignore the first 88 characters in my match, which is why I used /^.{88}GIRR/ which does that. If you think it doesn't do that, prove it. I've proved it works because I'm getting the expected output when I run it.
Please answer Q1 - Q8 above, so we can sort out the misunderstandings. Clear unabbreviated answers, please. They don't need to be long, as long as they actually answer the questions.
Hi enthugy,
Sorry I was late with my solution. I only noticed your question yesterday.
Have you tested my 1-2 line solution?
Did it work for you?
Have you tested arnold's solution?
Did it work for you?
I await your reply.
Thanks.
tel2
"you want to go .{88} 88 characters in, versus the 88 column when splitting on commas means completely different things."
Q8. Who said we should be splitting on commas? The sample input data isn't comma separated. Yes, I want to ignore the first 88 characters in my match, which is why I used /^.{88}GIRR/ which does that. If you think it doesn't do that, prove it. I've proved it works because I'm getting the expected output when I run it.
Please answer Q1 - Q8 above, so we can sort out the misunderstandings. Clear unabbreviated answers, please. They don't need to be long, as long as they actually answer the questions.
Hi enthugy,
Sorry I was late with my solution. I only noticed your question yesterday.
Have you tested my 1-2 line solution?
Did it work for you?
Have you tested arnold's solution?
Did it work for you?
I await your reply.
Thanks.
tel2
For me to be able to test, Please attach the file with the data Sample1.txt
Column is what lead me to figure it was separated and think there was a mention of CSV
Looking at your awk you have the separator as a comma (-F,) means the data fields are separated by a comma, <comma separated values)
The column is on either end of the comma.
Perl one liner you use /^.{88} which means skip the first 88 characters no matter what they are to then evaluate the pattern.
Column is what lead me to figure it was separated and think there was a mention of CSV
Looking at your awk you have the separator as a comma (-F,) means the data fields are separated by a comma, <comma separated values)
The column is on either end of the comma.
Perl one liner you use /^.{88} which means skip the first 88 characters no matter what they are to then evaluate the pattern.
Hi arnold,
That doesn't really answer my questions, but OK, I'll respond.
"For me to be able to test, Please attach the file with the data Sample1.txt"
I can do that if you really need it, but why don't you just copy and paste it like I did? One way to do it is: click "Select all" under the sample data in enthuguy's original post, then do a copy, and paste it on your machine. If you can't do that, tell me why you can't, and I'll do it for you. Don't expect any separators, because there aren't any.
"Column is what lead me to figure it was separated and think there was a mention of CSV"
Are you referring to enthuguy's use of the word "column" in his question title and original post? If so, that can mean different things to different people and in different situations. If you look at the sample data, you can see he wasn't referring to columns which have separators.
Where did enthuguy mention CSV?
"Looking at your awk you have the separator as a comma (-F,) means the data fields are separated by a comma, <comma separated values)
The column is on either end of the comma."
It's not my awk, it's enthuguy's awk, and as he indicated, it's not quite working, and he was asking for us to fix it. His of "-F," was unnecessary given the data he supplied, so that's one thing which should be fixed (i.e. removed).
"Perl one liner you use /^.{88} which means skip the first 88 characters no matter what they are to then evaluate the pattern."
I know. That's what enthuguy wants, right?
Hi enthuguy,
I don't know much awk, so I didn't study your solution much originally when I posted my Perl solution, but I've had a look at it now, and have tested the following solutions, successfully;
That doesn't really answer my questions, but OK, I'll respond.
"For me to be able to test, Please attach the file with the data Sample1.txt"
I can do that if you really need it, but why don't you just copy and paste it like I did? One way to do it is: click "Select all" under the sample data in enthuguy's original post, then do a copy, and paste it on your machine. If you can't do that, tell me why you can't, and I'll do it for you. Don't expect any separators, because there aren't any.
"Column is what lead me to figure it was separated and think there was a mention of CSV"
Are you referring to enthuguy's use of the word "column" in his question title and original post? If so, that can mean different things to different people and in different situations. If you look at the sample data, you can see he wasn't referring to columns which have separators.
Where did enthuguy mention CSV?
"Looking at your awk you have the separator as a comma (-F,) means the data fields are separated by a comma, <comma separated values)
The column is on either end of the comma."
It's not my awk, it's enthuguy's awk, and as he indicated, it's not quite working, and he was asking for us to fix it. His of "-F," was unnecessary given the data he supplied, so that's one thing which should be fixed (i.e. removed).
"Perl one liner you use /^.{88} which means skip the first 88 characters no matter what they are to then evaluate the pattern."
I know. That's what enthuguy wants, right?
Hi enthuguy,
I don't know much awk, so I didn't study your solution much originally when I posted my Perl solution, but I've had a look at it now, and have tested the following solutions, successfully;
awk '{if(/^.{88}GIRR/)print >"file2.txt";else print >"file1.txt"}' sample.txt
The above solution is a slight tweak of your solution, and works with your sample data, but will only put lines into file2.txt if they contain "GIRR" starting in column 89, which, depending on exactly what you mean, may not meet your requirement "Rest of the lines should go into another file". You want ALL lines starting from the "GIRR" line to go into file2.txt, even if some of those lines don't contain "GIRR", right? If so, you could use my Perl solution (in my first post), or this further tweaked awk version:awk '{if(/^.{88}GIRR/)found=1;if(found)print >"file2.txt";else print >"file1.txt"}' sample.txt
I used the variable "found" above, to indicate that we found the "GIRR" starting in column 89, so from then on all lines should go to file2.txt, but you could change that variable name to something shorter if you like.
In the criteria, I find perl provides more flexibility...