?
Solved

How can I format grep results from my input text file to an easy to read output file?

Posted on 2015-01-26
38
Medium Priority
?
329 Views
Last Modified: 2015-01-27
I am trying to:

1.) Add some text before the url (The following URL:), before the page number (on page #) and before the results (had the following search results). URL must be a clickable link by itself in the output file.
2.) If the URL exists more than once (even with different search criteria, eliminate it from an output file I want to write to.
3.) Get all piped data onto one line (or as close as possible) in the output file.

I have an input file that looks like:


 testsite.com/2015/heo/LG0010.pdf |
2 |
symbo

 testsite.com/2015/heo/GH0014.pdf |
7 |
hara

 testsite.com/2015/heo/GH0014.pdf |
2 |
sasa
0
Comment
Question by:devNOOB
  • 13
  • 12
  • 11
  • +1
38 Comments
 
LVL 48

Expert Comment

by:Tintin
ID: 40572071
From your example input file, could you please provide your desired output.
0
 

Author Comment

by:devNOOB
ID: 40572075
Sure. Something along the lines of......The URL at test site.com had the following search results (symb) on page x.

Thanks
0
 
LVL 48

Expert Comment

by:Tintin
ID: 40572095
How does that fit your requirement for:

3.) Get all piped data onto one line (or as close as possible) in the output file.

Also, where is the information for the page number?
0
Granular recovery for Microsoft Exchange

With Veeam Explorer for Microsoft Exchange you can choose the Exchange Servers and restore points you’re interested in, and Veeam Explorer will present the contents of those mailbox stores for browsing, searching and exporting.

 
LVL 85

Expert Comment

by:ozo
ID: 40572101
perl -00lne 'm/(.*?)\s*\|\s*(.*?)\s*\|\s*(.*)/&&push@{$u{$1}},"The following URL:$1|on page #$2|had the following search results:$3";END{$#$_||print $_->[0] for values %u}' <<END
testsite.com/2015/heo/LG0010.pdf |
2 |
symbo

 testsite.com/2015/heo/GH0014.pdf |
7 |
hara

 testsite.com/2015/heo/GH0014.pdf |
2 |
sasa
END
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 40572447
My "awk" version, as usual:

awk  'BEGIN {RS="";FS="|"}
           {print "The URL at", $1, "had the following search results: (" $5 ") on page", $3}
           ' inputfile | sort -k4,4 -u
0
 
LVL 85

Expert Comment

by:ozo
ID: 40572453
I thought you said "if the URL exists more than once (even with different search criteria, eliminate it"
 testsite.com/2015/heo/GH0014.pdf
exists more than once, but http:#a40572447 does not eliminate it.
Which of us misunderstood?
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 40572469
>>  http:#a40572447 does not eliminate it. <<

Sure? Please see "sort -k4,4 -u" at the end!
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 40572474
Note: The empty lines between the result blocks (as posted in the Q) are mandatory for my solution to work correctly!
0
 
LVL 85

Expert Comment

by:ozo
ID: 40572497
wk  'BEGIN {RS="";FS="|"}
           {print "The URL at", $1, "had the following search results: (" $5 ") on page", $3}
           ' input | sort -k4,4 -u
prints
The URL at  testsite.com/2015/heo/GH0014.pdf  had the following search results: (hara) on page 7
The URL at testsite.com/2015/heo/LG0010.pdf  had the following search results: (symbo) on page 2
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 40572523
OK, and where is the duplicate "GH0014" site?

My understanding is that only duplicates should be removed, but the "original" should be kept.
0
 
LVL 85

Expert Comment

by:ozo
ID: 40572524
The original input contained
testsite.com/2015/heo/LG0010.pdf |
2 |
symbo

 testsite.com/2015/heo/GH0014.pdf |
7 |
hara

Open in new window

in which the URL
 testsite.com/2015/heo/GH0014.pdf
exists more than once
thus "it", meaning the URL testsite.com/2015/heo/GH0014.pdf, should be emiminated from the output,
or so I interpreted the original problem statement

I also interpreted  "Get all piped data onto one line" to mean that the output data, with the text added, should continue to be pipe separated.
0
 
LVL 85

Expert Comment

by:ozo
ID: 40572555
Or, if we interpret "it" to mean "the URL ... even with different search criteria" then which of
testsite.com/2015/heo/GH0014.pdf |
7 |
hara
or
testsite.com/2015/heo/GH0014.pdf |
2 |
sasa
is the one which exists more than once and should be eliminated?
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 40572572
Good question  :-)
0
 

Author Comment

by:devNOOB
ID: 40573127
Thanks everyone. The Awk command came very close I am sending over my root input file that may be easier to work from:

/home/search/testsite.com/2015/heo/LG0010.pdf:2: symbo
/home/search/testsite.com/2015/heo/GH0014:7: hara
/home/search/testsite.com/2015/heo/GH0014:2: sasa
0
 

Author Comment

by:devNOOB
ID: 40573129
I am fine with just using this file without the pipes if needed. Your help is greatly appreciated.
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 40573143
awk  'BEGIN {FS=":"}
           {print "The URL at", $1, "had the following search results: (" $3 ") on page", $2}
           ' inputfile | sort -k4,4 -u

My understanding is still that only duplicates should be removed, but the "original" should be kept, since you didn't answer our questions.
0
 

Author Comment

by:devNOOB
ID: 40573233
To answer the question, I would like to keep just one instance of each URL, with one set of results, even if they have different results.
0
 

Author Comment

by:devNOOB
ID: 40573240
Hopefully, ridding the pipe file eliminates the question above related. Apologies on my tardiness.
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 40573267
OK, that's what my solutions do. But please note that each of the two different input formats you posted will require its appropriate awk code to get the desired output.
0
 
LVL 85

Expert Comment

by:ozo
ID: 40573271
Which instance should be kept when a URL exists more than once?
Are you changing your input format from | separating fields, blank line separating records, to : separating fields and newline separating records?
0
 

Author Comment

by:devNOOB
ID: 40573288
woolmilkporc, that worked. I need to get rid of the /home/search/. What is the best way to get that out to make the URL clickable?
0
 

Author Comment

by:devNOOB
ID: 40573290
ozo, It does not really matter, just as long as I have a link to check one time. Hopefully, using the second file I sent over makes thing easier without the pipes.
0
 
LVL 85

Expert Comment

by:ozo
ID: 40573292
What is the clickable URL you want in your output?
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 40573294
Removing "/home/search" is easy. But how can we know exactly which protocol prefix to use? http://? https://? or even ftp:// or file://?
0
 

Author Comment

by:devNOOB
ID: 40573297
https://(no www please.)
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 40573308
awk  'BEGIN {FS=":"}
            {sub("^/home/search/","https://",$1);
              print "The URL at", $1, "had the following search results: (" $3 ") on page", $2}
            ' inputfile | sort -k4,4 -u
0
 
LVL 85

Expert Comment

by:ozo
ID: 40573323
perl -lne 'm#(?:/home/search/)?(.*):(.*):\s*(.*)#&&!$dup{$1}++&&print"The URL at https://$1 had the following search results: ($3) on page $2"' < input > output
0
 

Author Comment

by:devNOOB
ID: 40573574
OK. How can I put a line break between each record?
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 40573583
There is one already (after each page number).
0
 
LVL 85

Expert Comment

by:ozo
ID: 40573584
In http:#a40573127 it looks like there is a line break between each record of the input, in which case http:#a40573323 would put a line break between each record of the output.
If you don't have a line break between each record of the input, how are the records distinguished?
0
 

Author Comment

by:devNOOB
ID: 40573686
with the comma?
0
 
LVL 85

Expert Comment

by:ozo
ID: 40573695
Where is the comma in http:#a40573127 ?
0
 

Author Comment

by:devNOOB
ID: 40573717
There's not. I was responding to woolmilkporc on that specific response. He stated in the awk statement there was already a line break. I was wondering where that was called from? Still learning.
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 40573739
There is an automatic line break after each input record has been processed,
The comma inserts a space (or OFS) between output fields.
0
 

Author Comment

by:devNOOB
ID: 40573749
If I need to add one more, do I do....?

awk  'BEGIN {FS=":"}
            {sub("^/home/search/","https://",$1);
              print "The URL at", $1, "had the following search results: (" $3 ") on page",, $2}
            ' inputfile | sort -k4,4 -u
0
 
LVL 85

Assisted Solution

by:ozo
ozo earned 1000 total points
ID: 40573763
perl -lne 'm#(?:/home/search/)?(.*):(.*):\s*(.*)#&&!$dup{$1}++&&print"The URL at https://$1 had the following search results: ($3) on page $2\n"' < input > output
0
 
LVL 68

Accepted Solution

by:
woolmilkporc earned 1000 total points
ID: 40573826
The commas have nothing to do with linebreaks.

OK. let's start over. If it should be awk, try this:

awk  'BEGIN {FS=":"}
            {sub("^/home/search/","https://",$1);
             A[$1]="The URL at " $1 " had the following search results: (" $3 ") on page " $2}
             END {for(n in A) print A[n] "\n"}
           '  inputfile

No more "sort -u" - we're using an array.
0
 

Author Comment

by:devNOOB
ID: 40574449
Thanks to all for the help. You all were great! Solved.
0

Featured Post

Windows Server 2016: All you need to know

Learn about Hyper-V features that increase functionality and usability of Microsoft Windows Server 2016. Also, throughout this eBook, you’ll find some basic PowerShell examples that will help you leverage the scripts in your environments!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Every server (virtual or physical) needs a console: and the console can be provided through hardware directly connected, software for remote connections, local connections, through a KVM, etc. This document explains the different types of consol…
I have written articles previously comparing SARDU and YUMI.  I also included a couple of lines about Easy2boot (easy2boot.com).  I have now been using, and enjoying easy2boot as my sole multiboot utility for some years and realize that it deserves …
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
In a previous video, we went over how to export a DynamoDB table into Amazon S3.  In this video, we show how to load the export from S3 into a DynamoDB table.
Suggested Courses
Course of the Month15 days, 10 hours left to enroll

850 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question