asked on

awk with column based string search

Hi EE,

Please help with below, have partially achieved but would request your help pls :)

Below is the sample file I receive. Would like to split the file into two... based on below criteria :\

1. Read line by line
2. Go to column (position) 89
3. if next four char is "GIRR" > take this line by creating a new file
4. Rest of the lines should go into another file
5. From below sample file. first two lines in file1.txt. Last line should be in file2.txt

So far I tried below awk and it works fine by searching the string "GIRR" in the entire line and creates two files. Please help/suggest how to check "GIRR" from column 89

awk -F, '{if(/GIRR/)print > "file2.txt";else print > "file1.txt"}' sample1.txt

Open in new window

Sample1.txt

00003004MRS PL GIRRAAAAAAA                      44 WIDFORD ST                           GLENROY XYZ 1234                                                                                                        12340000094070544                                                                                                                                                      LHOM00000000106599090000000001001    00000000+0000000000000.0033200000000106599090310A  L669975878                                                            65434519MA03NAB TAILORED TEST WALK                            MRS AAAAAAAAAAA                         RTRPP                               MRS            UUUUU          AAAAAAAAAAA                                                                                         AAAAAAAAAAA         UUUUU               AAAAAA              JP                                                                                 AAAAAAAAAAA         UUUUU               AAAAAA              PP                                                                                 AAAAAAAAAAA         RRRR                RTRTRTR             SP                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           +0000000000124.886543456540400083004134534948+000001234.006543456540400N
00003004MR DA AAAAAAAAAAA                       44 WIDFORD STREET                       GLENROY XYZ 1234                                                                                                        12340000094070544                                                                                                                                                      LHOM00000000106599090000000001001    00000000+0000000000000.0033200000000106599090310   L738333503                                                            65434519MA03NAB TAILORED TEST WALK                            MR AAAAAAAAAAA                          SCDSP                               MR             RRRR           AAAAAAAAAAA                                                                                         AAAAAAAAAAA         UUUUU               AAAAAA              JP                                                                                 AAAAAAAAAAA         UUUUU               AAAAAA              PP                                                                                 AAAAAAAAAAA         RRRR                RTRTRTR             SP                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           +0000000000124.886543456540400083004134534948+000001234.006543456540400N
00002062MR J BBBBBB                             44A GIRRAWICK ROAD                      GIRRAWICK ABC 3214                                                                                                      32140000047005096                                                                                                                                                      LHPO00000000746488504400000001001    00000000+0000000000000.0033100000000746488508220   L401871402                                                            65434519MA04NAB 100% OFFSET WALK TYTYTY PACKAGE               MR BBBBBB                               WCTPP                               MR             YYYYY          BBBBBB                                                                                              BBBBBB              YYYYY                                   JP                                                                                 BBBBBB              YYYYY                                   PP                                                                                 CCCCCCCC RRRRR      GGGGGG                                  SP                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           +0000000001420.446543444931900082062152887579+000001420.446543444951900N

Open in new window

arnold

You would need to use a begin bock through which you would be evaluating data and dealing with.

In the criteria, I find perl provides more flexibility...

enthuguy

ASKER

Thanks arnold,
any reference to perl please

arnold

a simple
Note your awk example splits based on a comma, -F, while the file you provide as an example is a clear Tab-Separated-Values (TSV) and not CSV (Comma separated Values).

the split below will need to have \t instead of the , ..

#!/usr/bin/perl
my @array ; # to use as a place holder of the line data

open FILE, "</path/to/file" || die "Unable to open file for reading: $!\n";
open OUT1, ">/path.to/output1" || die "Unable to open output file 1 for writing: $!\n";
open OUT2, ">/path to/output2" || die "Unable to open the second output file for writing: $!\n";

while (<FILE>) {
chomp(); #strip the line feed, carrieage return from the end of the line
@array =split(/,/); # the split when no variable is present $_ is implied which contains the line read in in the while loop
if ( $#array >= 88 && $array[88] =~ /GRRR/ ) {
            print OUT1 "$_\n";
}
else {
          print OUT2 "$_\n";
}
}
close (OUT1);
close(OUT2);
close (FILE);

Open in new window

ASKER CERTIFIED SOLUTION

Bill Prew

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

enthuguy

ASKER

Thanks arnold for your help and time to provide me a perl solution. Nice learning experience :)

@Bill, thx so much. Like this smarter version of awk :)

enthuguy

ASKER

HI @Bill
could you help how to pass those two file names as an arguments to awk file pls
Tried below but still not able to get it working

splitfile.awk outputfile1.txt outputfile2.txt < sample1.txt

#!/bin/awk -f
BEGIN {
    outFile[1] = "ARGV[1]"
    outFile[2] = "ARGV[2]"
    useFile = 1
}

{
    if (substr($0, 89, 4) == "GIRR") {
        useFile = 2
    }

    print $0>>outFile[useFile]
}

Open in new window

enthuguy

ASKER

this one
1. Creates only outputfile1.txt
2. Goes to an infinite loop and created above file with 77768 rows in just few 3 seconds :)
3. I think, it dumps the whole file without evaluating the condition

cat splitfile.awk

./splitfile.awk sample1.txt outputfile1.txt outputfile2.txt

Open in new window

#!/bin/awk -f
BEGIN {
    outFile[1] = ARGV[1]
    outFile[2] = ARGV[2]
    useFile = 1
}


{
    if (substr($0, 89, 4) == "GIRR") {
        useFile = 2
    }


    print $0>>outFile[useFile]
}

Open in new window

enthuguy

ASKER

# slightly better
# both outputfile got created.
# outputfile1 with two non GIRR rows
# outputfile2 with infinite loop rows both GIRR and non

gawk -f splitnew.awk sample1.txt outputfile1.txt outputfile2.txt

Open in new window

cat splitnew.awk

BEGIN {
    outFile[1] = ARGV[2]
    outFile[2] = ARGV[3]
    useFile = 1
}

{
    if (substr($0, 89, 4) == "GIRR") {
        useFile = 2
    }

    print $0>>outFile[useFile]
}

Open in new window

enthuguy

ASKER

help please :)

enthuguy

ASKER

Looks like this one worked as expected. doing testing with actual file now

splitnew.awk

BEGIN {
    outFile[1] = ARGV[2]
    outFile[2] = ARGV[3]
    ARGV[2]=""
    ARGV[3]=""
    useFile = 1
}


{
    if (substr($0, 89, 4) == "GIRR") {
        useFile = 2
    }


    print $0>>outFile[useFile]
}

Open in new window

enthuguy

ASKER

With slight modofication it worked fine

BEGIN {
outFile[1] = ARGV[2]
outFile[2] = ARGV[3]
ARGV[2]=""
ARGV[3]=""
useFile = 1
}

{
if (substr($0, 89, 4) == "GIRR") {
useFile = 2
}

print $0>>outFile[useFile]
useFile = 1
}

Bill Prew

Sorry, it was night time here, I was sleeping... 😴

Glad you worked out passing in the file names.

Another apporach I typically used is to pass things in as variables rather than positional parms. Adds readability, etc. So you could do:

gawk -f EE29184290.awk -v fileout1=file1.txt -v fileout2=file2.txt sample1.txt

Open in new window

With this AWK script:

BEGIN {
    outFile[1] = fileout1
    outFile[2] = fileout2
    useFile = 1
}

{
    if (substr($0, 89, 4) == "GIRR") {
        useFile = 2
    }

    print $0>>outFile[useFile]
}

Open in new window

»bp

tel2

@enthuguy,
How about a nice cryptic Perl one-liner (possibly prefixed by a "rm...").

rm file2.txt
perl -pe 'open STDOUT, ">>file2.txt" if /^.{88}GIRR/' sample.txt >file1.txt

Open in new window

Note: The "rm..." line is required only if this file2.txt could contain something beforehand, e.g. during a re-run.

@arnold,
"...the file you provide as an example is a clear Tab-Separated-Values (TSV) and not CSV (Comma separated Values)."
How do you know it's tab separated? When I copy and paste it, I don't get any tabs. Has it been changed since you copied it?
I tried running your script and it didn't work for me. Everything ended up in output2. Does it work for you?

arnold

spacing, it could be \t for tab
\s+ for any white space
though if white space is used the address column will be broken down into multile street number, street name, street/road/ave/blvs/etc.

tp be a CSV it needs commas (,) between elements to indicate columns
column1,column2,column3
item1,item2,item3

thoug h recall having a similar pattern of data dump that had a different preset scheme where using FORMATTED extract
i.e. scanf (10%s30%detc.
deals with the first 10 items are a single string, the next 30 chracters have to be numeric, etc...

you are missing the SPLIT

I do not think the pattern you use
if /^.{88}GIRR/ is valid for what you want
not sure the regular expression .(88) means locate the 88th element and ...

note . means any one character
.* means any number of characters
(88) you are trying to match on 88 and if matched to assign it to a match variable, in this case it will be $1

you can still use a one liner, though you still need to split the line into an @ARRAY and then evaluate the 88th element in the array for the pattern.

tel2

@arnold.
Questions:
Q1. Does my script work when you run it, It works for me.
Q2. Does your script work when you run it? It doesn't work for me, as mentioned in my last post.

tel2

P.S.
"(88) you are trying to match on 88 and if matched to assign it to a match variable, in this case it will be $1"
I didn't use (88). I used {88}.

arnold

I do not have the data in the format you have. if I copy and paste, it treats everything as a white space.

this is why I said it did not look like a CSV but a TSV, though after your post and my comment indicated that this might be a different data source where it is based on specific spacing where a specific element is. I think there was a recent thread using a similar output stream, though I can not remember the topic, content which I think had a similar question in terms of loading and changing a specific column to a different value .....
where you do not want to replace a pattern, you want to separate the content based on a content pattern in a specific location, column.

arnold

in the small font, {88} inte as (88)
you want to go .{88} 88 characters in, versus the 88 column when splitting on commas means completely different things.

tel2

@arnold,

I'm confused. If you could please quote each of my questions below, and put your answers below each, that might help clear up the confusion.

Questions:

"I do not have the data in the format you have. if I copy and paste, it treats everything as a white space."
Q3. What do you mean by "everything"? Are you saying that even the visible characters like alphanumerics convert to a white space when you paste them, or what exactly?
Q4. What do you mean by "a white space"? The definition I'm familiar with for "white space" is: spaces, tabs, CR & LF. Which of those are you referring to?
Q5. How do you know you don't have the data in the format I have? As I said in my first post, "When I copy and paste it, I don't get any tabs". Sounds to me as you might have the same data that I have. If you don't think so, then feel free to post what you have as an attachment when you copy and paste it, so I can compare.

"this is why I said it did not look like a CSV but a TSV, though after your post and my comment indicated that this might be a different data source where it is based on specific spacing where a specific element is. I think there was a recent thread using a similar output stream, though I can not remember the topic, content which I think had a similar question in terms of loading and changing a specific column to a different value .....
where you do not want to replace a pattern, you want to separate the content based on a content pattern in a specific location, column."
Q6. Does that mean that you now realise that you have no good reason to believe this data contains tabs, because you've just mixed this question up with a different one? So you know it's not meant to be a CSV or TSV file, because there are no delimiters, right? Column 1 is character 1, column 2 is character 2, etc.

"in the small font, {88} inte as (88)"
Q7. Please explain what you mean. For starters, "inte" is not a word, and if you mean "into", then I still don't understand your sentence. Also, what small font?

And I repeat:
Q1. Does my script work when you run it? It works for me.
Q2. Does your script work when you run it? It doesn't work for me, as mentioned in my first post.

tel2

Hi again arnold.
"you want to go .{88} 88 characters in, versus the 88 column when splitting on commas means completely different things."
Q8. Who said we should be splitting on commas? The sample input data isn't comma separated. Yes, I want to ignore the first 88 characters in my match, which is why I used /^.{88}GIRR/ which does that. If you think it doesn't do that, prove it. I've proved it works because I'm getting the expected output when I run it.
Please answer Q1 - Q8 above, so we can sort out the misunderstandings. Clear unabbreviated answers, please. They don't need to be long, as long as they actually answer the questions.

Hi enthugy,
Sorry I was late with my solution. I only noticed your question yesterday.
Have you tested my 1-2 line solution?
Did it work for you?
Have you tested arnold's solution?
Did it work for you?

I await your reply.

Thanks.
tel2

arnold

For me to be able to test, Please attach the file with the data Sample1.txt

Column is what lead me to figure it was separated and think there was a mention of CSV

Looking at your awk you have the separator as a comma (-F,) means the data fields are separated by a comma, <comma separated values)
The column is on either end of the comma.

Perl one liner you use /^.{88} which means skip the first 88 characters no matter what they are to then evaluate the pattern.

tel2

Hi arnold,

That doesn't really answer my questions, but OK, I'll respond.

"For me to be able to test, Please attach the file with the data Sample1.txt"
I can do that if you really need it, but why don't you just copy and paste it like I did? One way to do it is: click "Select all" under the sample data in enthuguy's original post, then do a copy, and paste it on your machine. If you can't do that, tell me why you can't, and I'll do it for you. Don't expect any separators, because there aren't any.

"Column is what lead me to figure it was separated and think there was a mention of CSV"
Are you referring to enthuguy's use of the word "column" in his question title and original post? If so, that can mean different things to different people and in different situations. If you look at the sample data, you can see he wasn't referring to columns which have separators.
Where did enthuguy mention CSV?

"Looking at your awk you have the separator as a comma (-F,) means the data fields are separated by a comma, <comma separated values)
The column is on either end of the comma."
It's not my awk, it's enthuguy's awk, and as he indicated, it's not quite working, and he was asking for us to fix it. His of "-F," was unnecessary given the data he supplied, so that's one thing which should be fixed (i.e. removed).

"Perl one liner you use /^.{88} which means skip the first 88 characters no matter what they are to then evaluate the pattern."
I know. That's what enthuguy wants, right?

Hi enthuguy,
I don't know much awk, so I didn't study your solution much originally when I posted my Perl solution, but I've had a look at it now, and have tested the following solutions, successfully;

awk '{if(/^.{88}GIRR/)print >"file2.txt";else print >"file1.txt"}' sample.txt

Open in new window

The above solution is a slight tweak of your solution, and works with your sample data, but will only put lines into file2.txt if they contain "GIRR" starting in column 89, which, depending on exactly what you mean, may not meet your requirement "Rest of the lines should go into another file". You want ALL lines starting from the "GIRR" line to go into file2.txt, even if some of those lines don't contain "GIRR", right? If so, you could use my Perl solution (in my first post), or this further tweaked awk version:

awk '{if(/^.{88}GIRR/)found=1;if(found)print >"file2.txt";else print >"file1.txt"}' sample.txt

Open in new window

I used the variable "found" above, to indicate that we found the "GIRR" starting in column 89, so from then on all lines should go to file2.txt, but you could change that variable name to something shorter if you like.