How can I format grep results from my input text file to an easy to read output file?

I am trying to:

1.) Add some text before the url (The following URL:), before the page number (on page #) and before the results (had the following search results). URL must be a clickable link by itself in the output file.
2.) If the URL exists more than once (even with different search criteria, eliminate it from an output file I want to write to.
3.) Get all piped data onto one line (or as close as possible) in the output file.

I have an input file that looks like:


 testsite.com/2015/heo/LG0010.pdf |
2 |
symbo

 testsite.com/2015/heo/GH0014.pdf |
7 |
hara

 testsite.com/2015/heo/GH0014.pdf |
2 |
sasa
devNOOBAsked:
Who is Participating?
 
woolmilkporcConnect With a Mentor Commented:
The commas have nothing to do with linebreaks.

OK. let's start over. If it should be awk, try this:

awk  'BEGIN {FS=":"}
            {sub("^/home/search/","https://",$1);
             A[$1]="The URL at " $1 " had the following search results: (" $3 ") on page " $2}
             END {for(n in A) print A[n] "\n"}
           '  inputfile

No more "sort -u" - we're using an array.
0
 
TintinCommented:
From your example input file, could you please provide your desired output.
0
 
devNOOBAuthor Commented:
Sure. Something along the lines of......The URL at test site.com had the following search results (symb) on page x.

Thanks
0
Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

 
TintinCommented:
How does that fit your requirement for:

3.) Get all piped data onto one line (or as close as possible) in the output file.

Also, where is the information for the page number?
0
 
ozoCommented:
perl -00lne 'm/(.*?)\s*\|\s*(.*?)\s*\|\s*(.*)/&&push@{$u{$1}},"The following URL:$1|on page #$2|had the following search results:$3";END{$#$_||print $_->[0] for values %u}' <<END
testsite.com/2015/heo/LG0010.pdf |
2 |
symbo

 testsite.com/2015/heo/GH0014.pdf |
7 |
hara

 testsite.com/2015/heo/GH0014.pdf |
2 |
sasa
END
0
 
woolmilkporcCommented:
My "awk" version, as usual:

awk  'BEGIN {RS="";FS="|"}
           {print "The URL at", $1, "had the following search results: (" $5 ") on page", $3}
           ' inputfile | sort -k4,4 -u
0
 
ozoCommented:
I thought you said "if the URL exists more than once (even with different search criteria, eliminate it"
 testsite.com/2015/heo/GH0014.pdf
exists more than once, but http:#a40572447 does not eliminate it.
Which of us misunderstood?
0
 
woolmilkporcCommented:
>>  http:#a40572447 does not eliminate it. <<

Sure? Please see "sort -k4,4 -u" at the end!
0
 
woolmilkporcCommented:
Note: The empty lines between the result blocks (as posted in the Q) are mandatory for my solution to work correctly!
0
 
ozoCommented:
wk  'BEGIN {RS="";FS="|"}
           {print "The URL at", $1, "had the following search results: (" $5 ") on page", $3}
           ' input | sort -k4,4 -u
prints
The URL at  testsite.com/2015/heo/GH0014.pdf  had the following search results: (hara) on page 7
The URL at testsite.com/2015/heo/LG0010.pdf  had the following search results: (symbo) on page 2
0
 
woolmilkporcCommented:
OK, and where is the duplicate "GH0014" site?

My understanding is that only duplicates should be removed, but the "original" should be kept.
0
 
ozoCommented:
The original input contained
testsite.com/2015/heo/LG0010.pdf |
2 |
symbo

 testsite.com/2015/heo/GH0014.pdf |
7 |
hara

Open in new window

in which the URL
 testsite.com/2015/heo/GH0014.pdf
exists more than once
thus "it", meaning the URL testsite.com/2015/heo/GH0014.pdf, should be emiminated from the output,
or so I interpreted the original problem statement

I also interpreted  "Get all piped data onto one line" to mean that the output data, with the text added, should continue to be pipe separated.
0
 
ozoCommented:
Or, if we interpret "it" to mean "the URL ... even with different search criteria" then which of
testsite.com/2015/heo/GH0014.pdf |
7 |
hara
or
testsite.com/2015/heo/GH0014.pdf |
2 |
sasa
is the one which exists more than once and should be eliminated?
0
 
woolmilkporcCommented:
Good question  :-)
0
 
devNOOBAuthor Commented:
Thanks everyone. The Awk command came very close I am sending over my root input file that may be easier to work from:

/home/search/testsite.com/2015/heo/LG0010.pdf:2: symbo
/home/search/testsite.com/2015/heo/GH0014:7: hara
/home/search/testsite.com/2015/heo/GH0014:2: sasa
0
 
devNOOBAuthor Commented:
I am fine with just using this file without the pipes if needed. Your help is greatly appreciated.
0
 
woolmilkporcCommented:
awk  'BEGIN {FS=":"}
           {print "The URL at", $1, "had the following search results: (" $3 ") on page", $2}
           ' inputfile | sort -k4,4 -u

My understanding is still that only duplicates should be removed, but the "original" should be kept, since you didn't answer our questions.
0
 
devNOOBAuthor Commented:
To answer the question, I would like to keep just one instance of each URL, with one set of results, even if they have different results.
0
 
devNOOBAuthor Commented:
Hopefully, ridding the pipe file eliminates the question above related. Apologies on my tardiness.
0
 
woolmilkporcCommented:
OK, that's what my solutions do. But please note that each of the two different input formats you posted will require its appropriate awk code to get the desired output.
0
 
ozoCommented:
Which instance should be kept when a URL exists more than once?
Are you changing your input format from | separating fields, blank line separating records, to : separating fields and newline separating records?
0
 
devNOOBAuthor Commented:
woolmilkporc, that worked. I need to get rid of the /home/search/. What is the best way to get that out to make the URL clickable?
0
 
devNOOBAuthor Commented:
ozo, It does not really matter, just as long as I have a link to check one time. Hopefully, using the second file I sent over makes thing easier without the pipes.
0
 
ozoCommented:
What is the clickable URL you want in your output?
0
 
woolmilkporcCommented:
Removing "/home/search" is easy. But how can we know exactly which protocol prefix to use? http://? https://? or even ftp:// or file://?
0
 
devNOOBAuthor Commented:
https://(no www please.)
0
 
woolmilkporcCommented:
awk  'BEGIN {FS=":"}
            {sub("^/home/search/","https://",$1);
              print "The URL at", $1, "had the following search results: (" $3 ") on page", $2}
            ' inputfile | sort -k4,4 -u
0
 
ozoCommented:
perl -lne 'm#(?:/home/search/)?(.*):(.*):\s*(.*)#&&!$dup{$1}++&&print"The URL at https://$1 had the following search results: ($3) on page $2"' < input > output
0
 
devNOOBAuthor Commented:
OK. How can I put a line break between each record?
0
 
woolmilkporcCommented:
There is one already (after each page number).
0
 
ozoCommented:
In http:#a40573127 it looks like there is a line break between each record of the input, in which case http:#a40573323 would put a line break between each record of the output.
If you don't have a line break between each record of the input, how are the records distinguished?
0
 
devNOOBAuthor Commented:
with the comma?
0
 
ozoCommented:
Where is the comma in http:#a40573127 ?
0
 
devNOOBAuthor Commented:
There's not. I was responding to woolmilkporc on that specific response. He stated in the awk statement there was already a line break. I was wondering where that was called from? Still learning.
0
 
woolmilkporcCommented:
There is an automatic line break after each input record has been processed,
The comma inserts a space (or OFS) between output fields.
0
 
devNOOBAuthor Commented:
If I need to add one more, do I do....?

awk  'BEGIN {FS=":"}
            {sub("^/home/search/","https://",$1);
              print "The URL at", $1, "had the following search results: (" $3 ") on page",, $2}
            ' inputfile | sort -k4,4 -u
0
 
ozoConnect With a Mentor Commented:
perl -lne 'm#(?:/home/search/)?(.*):(.*):\s*(.*)#&&!$dup{$1}++&&print"The URL at https://$1 had the following search results: ($3) on page $2\n"' < input > output
0
 
devNOOBAuthor Commented:
Thanks to all for the help. You all were great! Solved.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.