Solved

script to use curl to follow links in bash

Posted on 2014-03-27
8
903 Views
1 Endorsement
Last Modified: 2014-05-01
Hi experts,
 
I am quite new to bash scripting and I need to write a script that will follow links using curl in order to grab information.

The script parses in standard input from a text file containing:

CSI3207
CSG5116
CSI3208

The script looks like this so far:

while read id
do
        curl -d "p_unit_cd=$id&p_ci_year=$1&cmdSubmit=Search" \
                http://apps.wcms.ecu.edu.au/semester-timetable/lookup |
done

Open in new window


This outputs information which contains links that I need to follow to get more information.

I am unsure how to approach the problem from here.  Can anyone help?

Thanks in advance
1
Comment
Question by:madstylex
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
8 Comments
 
LVL 48

Expert Comment

by:Tintin
ID: 39960734
That's a little more tricky as you have to parse the output to find out what links (and the format they are in) in order to perform another curl request.

Do you have an example of a link in the original HTML source that you want to retrieve?
0
 

Author Comment

by:madstylex
ID: 39960742
Yes I do, here is one of the links:

<a href="http://apps.wcms.ecu.edu.au/semester-timetable/lookup?sq_content_src=%2BdXJsPWh0dHAlM0ElMkYlMkYxMC42Ny4xMjQuMTMxJTNBNzc4MCUyRmFwcHMlMkZzbXNhcHBzJTJGc2VtZXN0ZXJfdGltZXRhYmxlJTJGdmlld19zZW1fdGFibGVfYWN0aXZpdGllcy5qc3AlM0ZwX3Vvb19pZCUzRDI5MDU3MCZhbGw9MQ%3D%3D#ML">

Open in new window


Just to put things into context, the data in the text file are unit codes.

The aim of the script is to fetch the information from the links and use grep to pull information out in order to automate timetable generation.
0
 
LVL 27

Expert Comment

by:skullnobrains
ID: 39960997
retrieve the links using sed

curl ... | tr " " "\n" | sed -n 's/href="\(http:\/\/[^"]*\)".*/\1/ipg'

this will extract all the links starting with "http://" from your page

then if you need to spawn extra curl queries based on this link, your existing code should do :

while read id
do
  curl  ... | sed ...
done \
| while read url
do
  # do another curl and proceed
done

if you need to explore arbitrary depths, you can use a recursive function
0
Webinar: Aligning, Automating, Winning

Join Dan Russo, Senior Manager of Operations Intelligence, for an in-depth discussion on how Dealertrack, leading provider of integrated digital solutions for the automotive industry, transformed their DevOps processes to increase collaboration and move with greater velocity.

 

Author Comment

by:madstylex
ID: 39961000
Thanks, I'll give that a bang and get back to this thread asap.
0
 

Author Comment

by:madstylex
ID: 39964604
Thanks,

The first part of the script works for getting all of the links starting with http.  How would I do it so that it grabs all of the links that look like this?  The one below looks like the ones I need to go into.  There shoud be 3 in total.

<a href="http://apps.wcms.ecu.edu.au/semester-timetable/lookup?sq_content_src=%2BdXJsPWh0dHAlM0ElMkYlMkYxMC42Ny4xMjQuMTMxJTNBNzc4MCUyRmFwcHMlMkZzbXNhcHBzJTJGc2VtZXN0ZXJfdGltZXRhYmxlJTJGdmlld19zZW1fdGFibGVfYWN0aXZpdGllcy5qc3AlM0ZwX3Vvb19pZCUzRDI5MDU3MCZhbGw9MQ%3D%3D#ML">

Open in new window

0
 
LVL 27

Expert Comment

by:skullnobrains
ID: 39966228
you don't give enough information for me to answer. you need to figure out what those 3 links have in common

for example, if you want all http links that contain the word "semester", you would replace what was inside the parenthesis in the sed with
http:\/\/[^"]*semester[^"]*

use the following info to adapt it to your needs.
[^"]* matches any string that does not contain a double quote
[...]  is a list of characters
^ negates the list
* repeats 0->n times what precedes

feel free to post if you don't manage but then give me enough information to help (input, code, output, expected output)
0
 

Author Comment

by:madstylex
ID: 39968948
This is what I have so far.  This script goes into the links that I need to grab the information from:

while read id
do
        curl -s -d "p_unit_cd=$id&p_ci_year=$1&cmdSubmit=Search" \
                http://apps.wcms.ecu.edu.au/semester-timetable/lookup | tr " " "\n" | sed -n 's/href="\(http:\/\/[^"]*\)".*/\1/ipg'| grep http://apps.wcms.ecu.edu.au/semester-timetable/lookup?sq_content

done \
|
while read url
do
        curl -s $url | sed 's/<[^>]*>//g' | grep -E 'CSG5116'

done

Open in new window


The second part (while read url), cuts out all of the metadata using sed, then greps across the links to match the required term.  This seems to be working, except how can I use grep to search for multiple terms?
0
 
LVL 27

Accepted Solution

by:
skullnobrains earned 500 total points
ID: 39970807
try something like this
grep 'foo\|bar\|baz'
which would grep either 'foo' 'bar' or 'baz'


you might rather use sed if you want to grab some data
for example : sed -n 's/.*CSG\(\[0-9][0-9]*).*/\1/pg'
would grab the numbers following the letters CSG

it might be easier to help if you posted a sample
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

It is a general practice to get rid of old user profiles on a computer  in a LAN environment. As I have been working with a company in a LAN environment where users move from one place to some other place at times. This will make many user profil…
This article is meant to give a basic understanding of how to use R Sweave as a way to merge LaTeX and R code seamlessly into one presentable document.
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

733 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question