Solved

script to use curl to follow links in bash

Posted on 2014-03-27
8
833 Views
1 Endorsement
Last Modified: 2014-05-01
Hi experts,
 
I am quite new to bash scripting and I need to write a script that will follow links using curl in order to grab information.

The script parses in standard input from a text file containing:

CSI3207
CSG5116
CSI3208

The script looks like this so far:

while read id
do
        curl -d "p_unit_cd=$id&p_ci_year=$1&cmdSubmit=Search" \
                http://apps.wcms.ecu.edu.au/semester-timetable/lookup |
done

Open in new window


This outputs information which contains links that I need to follow to get more information.

I am unsure how to approach the problem from here.  Can anyone help?

Thanks in advance
1
Comment
Question by:madstylex
  • 4
  • 3
8 Comments
 
LVL 48

Expert Comment

by:Tintin
ID: 39960734
That's a little more tricky as you have to parse the output to find out what links (and the format they are in) in order to perform another curl request.

Do you have an example of a link in the original HTML source that you want to retrieve?
0
 

Author Comment

by:madstylex
ID: 39960742
Yes I do, here is one of the links:

<a href="http://apps.wcms.ecu.edu.au/semester-timetable/lookup?sq_content_src=%2BdXJsPWh0dHAlM0ElMkYlMkYxMC42Ny4xMjQuMTMxJTNBNzc4MCUyRmFwcHMlMkZzbXNhcHBzJTJGc2VtZXN0ZXJfdGltZXRhYmxlJTJGdmlld19zZW1fdGFibGVfYWN0aXZpdGllcy5qc3AlM0ZwX3Vvb19pZCUzRDI5MDU3MCZhbGw9MQ%3D%3D#ML">

Open in new window


Just to put things into context, the data in the text file are unit codes.

The aim of the script is to fetch the information from the links and use grep to pull information out in order to automate timetable generation.
0
 
LVL 26

Expert Comment

by:skullnobrains
ID: 39960997
retrieve the links using sed

curl ... | tr " " "\n" | sed -n 's/href="\(http:\/\/[^"]*\)".*/\1/ipg'

this will extract all the links starting with "http://" from your page

then if you need to spawn extra curl queries based on this link, your existing code should do :

while read id
do
  curl  ... | sed ...
done \
| while read url
do
  # do another curl and proceed
done

if you need to explore arbitrary depths, you can use a recursive function
0
DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

 

Author Comment

by:madstylex
ID: 39961000
Thanks, I'll give that a bang and get back to this thread asap.
0
 

Author Comment

by:madstylex
ID: 39964604
Thanks,

The first part of the script works for getting all of the links starting with http.  How would I do it so that it grabs all of the links that look like this?  The one below looks like the ones I need to go into.  There shoud be 3 in total.

<a href="http://apps.wcms.ecu.edu.au/semester-timetable/lookup?sq_content_src=%2BdXJsPWh0dHAlM0ElMkYlMkYxMC42Ny4xMjQuMTMxJTNBNzc4MCUyRmFwcHMlMkZzbXNhcHBzJTJGc2VtZXN0ZXJfdGltZXRhYmxlJTJGdmlld19zZW1fdGFibGVfYWN0aXZpdGllcy5qc3AlM0ZwX3Vvb19pZCUzRDI5MDU3MCZhbGw9MQ%3D%3D#ML">

Open in new window

0
 
LVL 26

Expert Comment

by:skullnobrains
ID: 39966228
you don't give enough information for me to answer. you need to figure out what those 3 links have in common

for example, if you want all http links that contain the word "semester", you would replace what was inside the parenthesis in the sed with
http:\/\/[^"]*semester[^"]*

use the following info to adapt it to your needs.
[^"]* matches any string that does not contain a double quote
[...]  is a list of characters
^ negates the list
* repeats 0->n times what precedes

feel free to post if you don't manage but then give me enough information to help (input, code, output, expected output)
0
 

Author Comment

by:madstylex
ID: 39968948
This is what I have so far.  This script goes into the links that I need to grab the information from:

while read id
do
        curl -s -d "p_unit_cd=$id&p_ci_year=$1&cmdSubmit=Search" \
                http://apps.wcms.ecu.edu.au/semester-timetable/lookup | tr " " "\n" | sed -n 's/href="\(http:\/\/[^"]*\)".*/\1/ipg'| grep http://apps.wcms.ecu.edu.au/semester-timetable/lookup?sq_content

done \
|
while read url
do
        curl -s $url | sed 's/<[^>]*>//g' | grep -E 'CSG5116'

done

Open in new window


The second part (while read url), cuts out all of the metadata using sed, then greps across the links to match the required term.  This seems to be working, except how can I use grep to search for multiple terms?
0
 
LVL 26

Accepted Solution

by:
skullnobrains earned 500 total points
ID: 39970807
try something like this
grep 'foo\|bar\|baz'
which would grep either 'foo' 'bar' or 'baz'


you might rather use sed if you want to grab some data
for example : sed -n 's/.*CSG\(\[0-9][0-9]*).*/\1/pg'
would grab the numbers following the letters CSG

it might be easier to help if you posted a sample
0

Featured Post

Master Your Team's Linux and Cloud Stack

Come see why top tech companies like Mailchimp and Media Temple use Linux Academy to build their employee training programs.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Linux "time" command output redirection 16 146
Creating 2 files from output with Powershell 5 44
React or Angular? 6 71
Need To Change Centos 6 Timestamp Form In Log File 24 71
In this tutorial I will show you how to make a simple HTML bar chart with the usage of WhizBase, If you want more information about WhizBase please read my previous articles at http://www.experts-exchange.com/ARTH_5123186.html (http://www.experts-ex…
Recently, an awarded photographer, Selina De Maeyer (http://www.selinademaeyer.com/), completed a photo shoot of a beautiful event (http://www.sintjacobantwerpen.be/verslag-en-fotoreportage-van-de-sacramentsprocessie-door-antwerpen#thumbnails) in An…
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question