Solved

How can I format grep results from my input text file to an easy to read output file?

Posted on 2015-01-26
38
301 Views
Last Modified: 2015-01-27
I am trying to:

1.) Add some text before the url (The following URL:), before the page number (on page #) and before the results (had the following search results). URL must be a clickable link by itself in the output file.
2.) If the URL exists more than once (even with different search criteria, eliminate it from an output file I want to write to.
3.) Get all piped data onto one line (or as close as possible) in the output file.

I have an input file that looks like:


 testsite.com/2015/heo/LG0010.pdf |
2 |
symbo

 testsite.com/2015/heo/GH0014.pdf |
7 |
hara

 testsite.com/2015/heo/GH0014.pdf |
2 |
sasa
0
Comment
Question by:devNOOB
  • 13
  • 12
  • 11
  • +1
38 Comments
 
LVL 48

Expert Comment

by:Tintin
ID: 40572071
From your example input file, could you please provide your desired output.
0
 

Author Comment

by:devNOOB
ID: 40572075
Sure. Something along the lines of......The URL at test site.com had the following search results (symb) on page x.

Thanks
0
 
LVL 48

Expert Comment

by:Tintin
ID: 40572095
How does that fit your requirement for:

3.) Get all piped data onto one line (or as close as possible) in the output file.

Also, where is the information for the page number?
0
 
LVL 84

Expert Comment

by:ozo
ID: 40572101
perl -00lne 'm/(.*?)\s*\|\s*(.*?)\s*\|\s*(.*)/&&push@{$u{$1}},"The following URL:$1|on page #$2|had the following search results:$3";END{$#$_||print $_->[0] for values %u}' <<END
testsite.com/2015/heo/LG0010.pdf |
2 |
symbo

 testsite.com/2015/heo/GH0014.pdf |
7 |
hara

 testsite.com/2015/heo/GH0014.pdf |
2 |
sasa
END
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 40572447
My "awk" version, as usual:

awk  'BEGIN {RS="";FS="|"}
           {print "The URL at", $1, "had the following search results: (" $5 ") on page", $3}
           ' inputfile | sort -k4,4 -u
0
 
LVL 84

Expert Comment

by:ozo
ID: 40572453
I thought you said "if the URL exists more than once (even with different search criteria, eliminate it"
 testsite.com/2015/heo/GH0014.pdf
exists more than once, but http:#a40572447 does not eliminate it.
Which of us misunderstood?
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 40572469
>>  http:#a40572447 does not eliminate it. <<

Sure? Please see "sort -k4,4 -u" at the end!
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 40572474
Note: The empty lines between the result blocks (as posted in the Q) are mandatory for my solution to work correctly!
0
 
LVL 84

Expert Comment

by:ozo
ID: 40572497
wk  'BEGIN {RS="";FS="|"}
           {print "The URL at", $1, "had the following search results: (" $5 ") on page", $3}
           ' input | sort -k4,4 -u
prints
The URL at  testsite.com/2015/heo/GH0014.pdf  had the following search results: (hara) on page 7
The URL at testsite.com/2015/heo/LG0010.pdf  had the following search results: (symbo) on page 2
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 40572523
OK, and where is the duplicate "GH0014" site?

My understanding is that only duplicates should be removed, but the "original" should be kept.
0
 
LVL 84

Expert Comment

by:ozo
ID: 40572524
The original input contained
testsite.com/2015/heo/LG0010.pdf |
2 |
symbo

 testsite.com/2015/heo/GH0014.pdf |
7 |
hara

Open in new window

in which the URL
 testsite.com/2015/heo/GH0014.pdf
exists more than once
thus "it", meaning the URL testsite.com/2015/heo/GH0014.pdf, should be emiminated from the output,
or so I interpreted the original problem statement

I also interpreted  "Get all piped data onto one line" to mean that the output data, with the text added, should continue to be pipe separated.
0
 
LVL 84

Expert Comment

by:ozo
ID: 40572555
Or, if we interpret "it" to mean "the URL ... even with different search criteria" then which of
testsite.com/2015/heo/GH0014.pdf |
7 |
hara
or
testsite.com/2015/heo/GH0014.pdf |
2 |
sasa
is the one which exists more than once and should be eliminated?
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 40572572
Good question  :-)
0
 

Author Comment

by:devNOOB
ID: 40573127
Thanks everyone. The Awk command came very close I am sending over my root input file that may be easier to work from:

/home/search/testsite.com/2015/heo/LG0010.pdf:2: symbo
/home/search/testsite.com/2015/heo/GH0014:7: hara
/home/search/testsite.com/2015/heo/GH0014:2: sasa
0
 

Author Comment

by:devNOOB
ID: 40573129
I am fine with just using this file without the pipes if needed. Your help is greatly appreciated.
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 40573143
awk  'BEGIN {FS=":"}
           {print "The URL at", $1, "had the following search results: (" $3 ") on page", $2}
           ' inputfile | sort -k4,4 -u

My understanding is still that only duplicates should be removed, but the "original" should be kept, since you didn't answer our questions.
0
 

Author Comment

by:devNOOB
ID: 40573233
To answer the question, I would like to keep just one instance of each URL, with one set of results, even if they have different results.
0
 

Author Comment

by:devNOOB
ID: 40573240
Hopefully, ridding the pipe file eliminates the question above related. Apologies on my tardiness.
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 40573267
OK, that's what my solutions do. But please note that each of the two different input formats you posted will require its appropriate awk code to get the desired output.
0
Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

 
LVL 84

Expert Comment

by:ozo
ID: 40573271
Which instance should be kept when a URL exists more than once?
Are you changing your input format from | separating fields, blank line separating records, to : separating fields and newline separating records?
0
 

Author Comment

by:devNOOB
ID: 40573288
woolmilkporc, that worked. I need to get rid of the /home/search/. What is the best way to get that out to make the URL clickable?
0
 

Author Comment

by:devNOOB
ID: 40573290
ozo, It does not really matter, just as long as I have a link to check one time. Hopefully, using the second file I sent over makes thing easier without the pipes.
0
 
LVL 84

Expert Comment

by:ozo
ID: 40573292
What is the clickable URL you want in your output?
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 40573294
Removing "/home/search" is easy. But how can we know exactly which protocol prefix to use? http://? https://? or even ftp:// or file://?
0
 

Author Comment

by:devNOOB
ID: 40573297
https://(no www please.)
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 40573308
awk  'BEGIN {FS=":"}
            {sub("^/home/search/","https://",$1);
              print "The URL at", $1, "had the following search results: (" $3 ") on page", $2}
            ' inputfile | sort -k4,4 -u
0
 
LVL 84

Expert Comment

by:ozo
ID: 40573323
perl -lne 'm#(?:/home/search/)?(.*):(.*):\s*(.*)#&&!$dup{$1}++&&print"The URL at https://$1 had the following search results: ($3) on page $2"' < input > output
0
 

Author Comment

by:devNOOB
ID: 40573574
OK. How can I put a line break between each record?
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 40573583
There is one already (after each page number).
0
 
LVL 84

Expert Comment

by:ozo
ID: 40573584
In http:#a40573127 it looks like there is a line break between each record of the input, in which case http:#a40573323 would put a line break between each record of the output.
If you don't have a line break between each record of the input, how are the records distinguished?
0
 

Author Comment

by:devNOOB
ID: 40573686
with the comma?
0
 
LVL 84

Expert Comment

by:ozo
ID: 40573695
Where is the comma in http:#a40573127 ?
0
 

Author Comment

by:devNOOB
ID: 40573717
There's not. I was responding to woolmilkporc on that specific response. He stated in the awk statement there was already a line break. I was wondering where that was called from? Still learning.
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 40573739
There is an automatic line break after each input record has been processed,
The comma inserts a space (or OFS) between output fields.
0
 

Author Comment

by:devNOOB
ID: 40573749
If I need to add one more, do I do....?

awk  'BEGIN {FS=":"}
            {sub("^/home/search/","https://",$1);
              print "The URL at", $1, "had the following search results: (" $3 ") on page",, $2}
            ' inputfile | sort -k4,4 -u
0
 
LVL 84

Assisted Solution

by:ozo
ozo earned 250 total points
ID: 40573763
perl -lne 'm#(?:/home/search/)?(.*):(.*):\s*(.*)#&&!$dup{$1}++&&print"The URL at https://$1 had the following search results: ($3) on page $2\n"' < input > output
0
 
LVL 68

Accepted Solution

by:
woolmilkporc earned 250 total points
ID: 40573826
The commas have nothing to do with linebreaks.

OK. let's start over. If it should be awk, try this:

awk  'BEGIN {FS=":"}
            {sub("^/home/search/","https://",$1);
             A[$1]="The URL at " $1 " had the following search results: (" $3 ") on page " $2}
             END {for(n in A) print A[n] "\n"}
           '  inputfile

No more "sort -u" - we're using an array.
0
 

Author Comment

by:devNOOB
ID: 40574449
Thanks to all for the help. You all were great! Solved.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
Need to roll back kernel via script 4 41
UNIX SCP 5 47
Solaris 4.1.3 cloning and booting 13 35
Codiing Non-Existent Links 4 23
Using libpcap/Jpcap to capture and send packets on Solaris version (10/11) Library used: 1.      Libpcap (http://www.tcpdump.org) Version 1.2 2.      Jpcap(http://netresearch.ics.uci.edu/kfujii/Jpcap/doc/index.html) Version 0.6 Prerequisite: 1.      GCC …
Setting up Secure Ubuntu server on VMware 1.      Insert the Ubuntu Server distribution CD or attach the ISO of the CD which is in the “Datastore”. Note that it is important to install the x64 edition on servers, not the X86 editions. 2.      Power on th…
Learn how to get help with Linux/Unix bash shell commands. Use help to read help documents for built in bash shell commands.: Use man to interface with the online reference manuals for shell commands.: Use man to search man pages for unknown command…
Connecting to an Amazon Linux EC2 Instance from Windows Using PuTTY.

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now