Solved

script to use curl to follow links in bash

Posted on 2014-03-27
8
817 Views
1 Endorsement
Last Modified: 2014-05-01
Hi experts,
 
I am quite new to bash scripting and I need to write a script that will follow links using curl in order to grab information.

The script parses in standard input from a text file containing:

CSI3207
CSG5116
CSI3208

The script looks like this so far:

while read id
do
        curl -d "p_unit_cd=$id&p_ci_year=$1&cmdSubmit=Search" \
                http://apps.wcms.ecu.edu.au/semester-timetable/lookup |
done

Open in new window


This outputs information which contains links that I need to follow to get more information.

I am unsure how to approach the problem from here.  Can anyone help?

Thanks in advance
1
Comment
Question by:madstylex
  • 4
  • 3
8 Comments
 
LVL 48

Expert Comment

by:Tintin
ID: 39960734
That's a little more tricky as you have to parse the output to find out what links (and the format they are in) in order to perform another curl request.

Do you have an example of a link in the original HTML source that you want to retrieve?
0
 

Author Comment

by:madstylex
ID: 39960742
Yes I do, here is one of the links:

<a href="http://apps.wcms.ecu.edu.au/semester-timetable/lookup?sq_content_src=%2BdXJsPWh0dHAlM0ElMkYlMkYxMC42Ny4xMjQuMTMxJTNBNzc4MCUyRmFwcHMlMkZzbXNhcHBzJTJGc2VtZXN0ZXJfdGltZXRhYmxlJTJGdmlld19zZW1fdGFibGVfYWN0aXZpdGllcy5qc3AlM0ZwX3Vvb19pZCUzRDI5MDU3MCZhbGw9MQ%3D%3D#ML">

Open in new window


Just to put things into context, the data in the text file are unit codes.

The aim of the script is to fetch the information from the links and use grep to pull information out in order to automate timetable generation.
0
 
LVL 26

Expert Comment

by:skullnobrains
ID: 39960997
retrieve the links using sed

curl ... | tr " " "\n" | sed -n 's/href="\(http:\/\/[^"]*\)".*/\1/ipg'

this will extract all the links starting with "http://" from your page

then if you need to spawn extra curl queries based on this link, your existing code should do :

while read id
do
  curl  ... | sed ...
done \
| while read url
do
  # do another curl and proceed
done

if you need to explore arbitrary depths, you can use a recursive function
0
 

Author Comment

by:madstylex
ID: 39961000
Thanks, I'll give that a bang and get back to this thread asap.
0
3 Use Cases for Connected Systems

Our Dev teams are like yours. They’re continually cranking out code for new features/bugs fixes, testing, deploying, testing some more, responding to production monitoring events and more. It’s complex. So, we thought you’d like to see what’s working for us.

 

Author Comment

by:madstylex
ID: 39964604
Thanks,

The first part of the script works for getting all of the links starting with http.  How would I do it so that it grabs all of the links that look like this?  The one below looks like the ones I need to go into.  There shoud be 3 in total.

<a href="http://apps.wcms.ecu.edu.au/semester-timetable/lookup?sq_content_src=%2BdXJsPWh0dHAlM0ElMkYlMkYxMC42Ny4xMjQuMTMxJTNBNzc4MCUyRmFwcHMlMkZzbXNhcHBzJTJGc2VtZXN0ZXJfdGltZXRhYmxlJTJGdmlld19zZW1fdGFibGVfYWN0aXZpdGllcy5qc3AlM0ZwX3Vvb19pZCUzRDI5MDU3MCZhbGw9MQ%3D%3D#ML">

Open in new window

0
 
LVL 26

Expert Comment

by:skullnobrains
ID: 39966228
you don't give enough information for me to answer. you need to figure out what those 3 links have in common

for example, if you want all http links that contain the word "semester", you would replace what was inside the parenthesis in the sed with
http:\/\/[^"]*semester[^"]*

use the following info to adapt it to your needs.
[^"]* matches any string that does not contain a double quote
[...]  is a list of characters
^ negates the list
* repeats 0->n times what precedes

feel free to post if you don't manage but then give me enough information to help (input, code, output, expected output)
0
 

Author Comment

by:madstylex
ID: 39968948
This is what I have so far.  This script goes into the links that I need to grab the information from:

while read id
do
        curl -s -d "p_unit_cd=$id&p_ci_year=$1&cmdSubmit=Search" \
                http://apps.wcms.ecu.edu.au/semester-timetable/lookup | tr " " "\n" | sed -n 's/href="\(http:\/\/[^"]*\)".*/\1/ipg'| grep http://apps.wcms.ecu.edu.au/semester-timetable/lookup?sq_content

done \
|
while read url
do
        curl -s $url | sed 's/<[^>]*>//g' | grep -E 'CSG5116'

done

Open in new window


The second part (while read url), cuts out all of the metadata using sed, then greps across the links to match the required term.  This seems to be working, except how can I use grep to search for multiple terms?
0
 
LVL 26

Accepted Solution

by:
skullnobrains earned 500 total points
ID: 39970807
try something like this
grep 'foo\|bar\|baz'
which would grep either 'foo' 'bar' or 'baz'


you might rather use sed if you want to grab some data
for example : sed -n 's/.*CSG\(\[0-9][0-9]*).*/\1/pg'
would grab the numbers following the letters CSG

it might be easier to help if you posted a sample
0

Featured Post

DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Issue with cURL on Windows 7 82
Problem logging tar errors 11 54
Best way to split and output to csv in bash 2 61
Configure Robocopy to excluding folders 6 46
Making a simple AJAX shopping cart Couple years ago I made my first shopping cart, I used iframe and JavaScript, it was very good at that time, there were no sessions or AJAX, I used cookies on clients machine. Today we have more advanced techno…
Active Directory replication delay is the cause to many problems.  Here is a super easy script to force Active Directory replication to all sites with by using an elevated PowerShell command prompt, and a tool to verify your changes.
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …

911 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now