Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

script to use curl to follow links in bash

Posted on 2014-03-27
8
862 Views
1 Endorsement
Last Modified: 2014-05-01
Hi experts,
 
I am quite new to bash scripting and I need to write a script that will follow links using curl in order to grab information.

The script parses in standard input from a text file containing:

CSI3207
CSG5116
CSI3208

The script looks like this so far:

while read id
do
        curl -d "p_unit_cd=$id&p_ci_year=$1&cmdSubmit=Search" \
                http://apps.wcms.ecu.edu.au/semester-timetable/lookup |
done

Open in new window


This outputs information which contains links that I need to follow to get more information.

I am unsure how to approach the problem from here.  Can anyone help?

Thanks in advance
1
Comment
Question by:madstylex
  • 4
  • 3
8 Comments
 
LVL 48

Expert Comment

by:Tintin
ID: 39960734
That's a little more tricky as you have to parse the output to find out what links (and the format they are in) in order to perform another curl request.

Do you have an example of a link in the original HTML source that you want to retrieve?
0
 

Author Comment

by:madstylex
ID: 39960742
Yes I do, here is one of the links:

<a href="http://apps.wcms.ecu.edu.au/semester-timetable/lookup?sq_content_src=%2BdXJsPWh0dHAlM0ElMkYlMkYxMC42Ny4xMjQuMTMxJTNBNzc4MCUyRmFwcHMlMkZzbXNhcHBzJTJGc2VtZXN0ZXJfdGltZXRhYmxlJTJGdmlld19zZW1fdGFibGVfYWN0aXZpdGllcy5qc3AlM0ZwX3Vvb19pZCUzRDI5MDU3MCZhbGw9MQ%3D%3D#ML">

Open in new window


Just to put things into context, the data in the text file are unit codes.

The aim of the script is to fetch the information from the links and use grep to pull information out in order to automate timetable generation.
0
 
LVL 27

Expert Comment

by:skullnobrains
ID: 39960997
retrieve the links using sed

curl ... | tr " " "\n" | sed -n 's/href="\(http:\/\/[^"]*\)".*/\1/ipg'

this will extract all the links starting with "http://" from your page

then if you need to spawn extra curl queries based on this link, your existing code should do :

while read id
do
  curl  ... | sed ...
done \
| while read url
do
  # do another curl and proceed
done

if you need to explore arbitrary depths, you can use a recursive function
0
Webinar: Aligning, Automating, Winning

Join Dan Russo, Senior Manager of Operations Intelligence, for an in-depth discussion on how Dealertrack, leading provider of integrated digital solutions for the automotive industry, transformed their DevOps processes to increase collaboration and move with greater velocity.

 

Author Comment

by:madstylex
ID: 39961000
Thanks, I'll give that a bang and get back to this thread asap.
0
 

Author Comment

by:madstylex
ID: 39964604
Thanks,

The first part of the script works for getting all of the links starting with http.  How would I do it so that it grabs all of the links that look like this?  The one below looks like the ones I need to go into.  There shoud be 3 in total.

<a href="http://apps.wcms.ecu.edu.au/semester-timetable/lookup?sq_content_src=%2BdXJsPWh0dHAlM0ElMkYlMkYxMC42Ny4xMjQuMTMxJTNBNzc4MCUyRmFwcHMlMkZzbXNhcHBzJTJGc2VtZXN0ZXJfdGltZXRhYmxlJTJGdmlld19zZW1fdGFibGVfYWN0aXZpdGllcy5qc3AlM0ZwX3Vvb19pZCUzRDI5MDU3MCZhbGw9MQ%3D%3D#ML">

Open in new window

0
 
LVL 27

Expert Comment

by:skullnobrains
ID: 39966228
you don't give enough information for me to answer. you need to figure out what those 3 links have in common

for example, if you want all http links that contain the word "semester", you would replace what was inside the parenthesis in the sed with
http:\/\/[^"]*semester[^"]*

use the following info to adapt it to your needs.
[^"]* matches any string that does not contain a double quote
[...]  is a list of characters
^ negates the list
* repeats 0->n times what precedes

feel free to post if you don't manage but then give me enough information to help (input, code, output, expected output)
0
 

Author Comment

by:madstylex
ID: 39968948
This is what I have so far.  This script goes into the links that I need to grab the information from:

while read id
do
        curl -s -d "p_unit_cd=$id&p_ci_year=$1&cmdSubmit=Search" \
                http://apps.wcms.ecu.edu.au/semester-timetable/lookup | tr " " "\n" | sed -n 's/href="\(http:\/\/[^"]*\)".*/\1/ipg'| grep http://apps.wcms.ecu.edu.au/semester-timetable/lookup?sq_content

done \
|
while read url
do
        curl -s $url | sed 's/<[^>]*>//g' | grep -E 'CSG5116'

done

Open in new window


The second part (while read url), cuts out all of the metadata using sed, then greps across the links to match the required term.  This seems to be working, except how can I use grep to search for multiple terms?
0
 
LVL 27

Accepted Solution

by:
skullnobrains earned 500 total points
ID: 39970807
try something like this
grep 'foo\|bar\|baz'
which would grep either 'foo' 'bar' or 'baz'


you might rather use sed if you want to grab some data
for example : sed -n 's/.*CSG\(\[0-9][0-9]*).*/\1/pg'
would grab the numbers following the letters CSG

it might be easier to help if you posted a sample
0

Featured Post

The New “Normal” in Modern Enterprise Operations

DevOps for the modern enterprise offers many benefits — increased agility, productivity, and more, but digital transformation isn’t easy, especially if you’re not addressing the right issues. Register for the webinar to dive into the “new normal” for enterprise modern ops.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This tutorial will discuss fancy secure registration forms, with AJAX technology support. In this article I assume you already know HTML and some JS. I will write the code using WhizBase Server Pages, so you need to know some basics in WBSP (you mig…
This article is meant to give a basic understanding of how to use R Sweave as a way to merge LaTeX and R code seamlessly into one presentable document.
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

809 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question