?
Solved

script to use curl to follow links in bash

Posted on 2014-03-27
8
Medium Priority
?
1,020 Views
1 Endorsement
Last Modified: 2014-05-01
Hi experts,
 
I am quite new to bash scripting and I need to write a script that will follow links using curl in order to grab information.

The script parses in standard input from a text file containing:

CSI3207
CSG5116
CSI3208

The script looks like this so far:

while read id
do
        curl -d "p_unit_cd=$id&p_ci_year=$1&cmdSubmit=Search" \
                http://apps.wcms.ecu.edu.au/semester-timetable/lookup |
done

Open in new window


This outputs information which contains links that I need to follow to get more information.

I am unsure how to approach the problem from here.  Can anyone help?

Thanks in advance
1
Comment
Question by:madstylex
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
8 Comments
 
LVL 48

Expert Comment

by:Tintin
ID: 39960734
That's a little more tricky as you have to parse the output to find out what links (and the format they are in) in order to perform another curl request.

Do you have an example of a link in the original HTML source that you want to retrieve?
0
 

Author Comment

by:madstylex
ID: 39960742
Yes I do, here is one of the links:

<a href="http://apps.wcms.ecu.edu.au/semester-timetable/lookup?sq_content_src=%2BdXJsPWh0dHAlM0ElMkYlMkYxMC42Ny4xMjQuMTMxJTNBNzc4MCUyRmFwcHMlMkZzbXNhcHBzJTJGc2VtZXN0ZXJfdGltZXRhYmxlJTJGdmlld19zZW1fdGFibGVfYWN0aXZpdGllcy5qc3AlM0ZwX3Vvb19pZCUzRDI5MDU3MCZhbGw9MQ%3D%3D#ML">

Open in new window


Just to put things into context, the data in the text file are unit codes.

The aim of the script is to fetch the information from the links and use grep to pull information out in order to automate timetable generation.
0
 
LVL 27

Expert Comment

by:skullnobrains
ID: 39960997
retrieve the links using sed

curl ... | tr " " "\n" | sed -n 's/href="\(http:\/\/[^"]*\)".*/\1/ipg'

this will extract all the links starting with "http://" from your page

then if you need to spawn extra curl queries based on this link, your existing code should do :

while read id
do
  curl  ... | sed ...
done \
| while read url
do
  # do another curl and proceed
done

if you need to explore arbitrary depths, you can use a recursive function
0
RHCE - Red Hat OpenStack Prep Course

This course will provide in-depth training so that students who currently hold the EX200 & EX210 certifications can sit for the EX310 exam. Students will learn how to deploy & manage a full Red Hat environment with Ceph block storage, & integrate Ceph into other OpenStack service

 

Author Comment

by:madstylex
ID: 39961000
Thanks, I'll give that a bang and get back to this thread asap.
0
 

Author Comment

by:madstylex
ID: 39964604
Thanks,

The first part of the script works for getting all of the links starting with http.  How would I do it so that it grabs all of the links that look like this?  The one below looks like the ones I need to go into.  There shoud be 3 in total.

<a href="http://apps.wcms.ecu.edu.au/semester-timetable/lookup?sq_content_src=%2BdXJsPWh0dHAlM0ElMkYlMkYxMC42Ny4xMjQuMTMxJTNBNzc4MCUyRmFwcHMlMkZzbXNhcHBzJTJGc2VtZXN0ZXJfdGltZXRhYmxlJTJGdmlld19zZW1fdGFibGVfYWN0aXZpdGllcy5qc3AlM0ZwX3Vvb19pZCUzRDI5MDU3MCZhbGw9MQ%3D%3D#ML">

Open in new window

0
 
LVL 27

Expert Comment

by:skullnobrains
ID: 39966228
you don't give enough information for me to answer. you need to figure out what those 3 links have in common

for example, if you want all http links that contain the word "semester", you would replace what was inside the parenthesis in the sed with
http:\/\/[^"]*semester[^"]*

use the following info to adapt it to your needs.
[^"]* matches any string that does not contain a double quote
[...]  is a list of characters
^ negates the list
* repeats 0->n times what precedes

feel free to post if you don't manage but then give me enough information to help (input, code, output, expected output)
0
 

Author Comment

by:madstylex
ID: 39968948
This is what I have so far.  This script goes into the links that I need to grab the information from:

while read id
do
        curl -s -d "p_unit_cd=$id&p_ci_year=$1&cmdSubmit=Search" \
                http://apps.wcms.ecu.edu.au/semester-timetable/lookup | tr " " "\n" | sed -n 's/href="\(http:\/\/[^"]*\)".*/\1/ipg'| grep http://apps.wcms.ecu.edu.au/semester-timetable/lookup?sq_content

done \
|
while read url
do
        curl -s $url | sed 's/<[^>]*>//g' | grep -E 'CSG5116'

done

Open in new window


The second part (while read url), cuts out all of the metadata using sed, then greps across the links to match the required term.  This seems to be working, except how can I use grep to search for multiple terms?
0
 
LVL 27

Accepted Solution

by:
skullnobrains earned 2000 total points
ID: 39970807
try something like this
grep 'foo\|bar\|baz'
which would grep either 'foo' 'bar' or 'baz'


you might rather use sed if you want to grab some data
for example : sed -n 's/.*CSG\(\[0-9][0-9]*).*/\1/pg'
would grab the numbers following the letters CSG

it might be easier to help if you posted a sample
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Batch, VBS, and scripts in general are incredibly useful for repetitive tasks.  Some tasks can take a while to complete and it can be annoying to check back only to discover that your script finished 5 minutes ago.  Some scripts may complete nearly …
In threads here at EE, each comment has a unique Identifier (ID). It is easy to get the full path for an ID via the right-click context menu. However, we often want to post a short link within a thread rather than the full link. This article shows a…
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …
Suggested Courses

764 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question