asked on

Cfhttp array elements

I have an array that is full of URL's the URL's are complete and can be retrieved with cfhttp manually. The first URL can be retreived using cfhttp but the connection fails at the next URL. There are several URL's but only the first can be retreived. Confused
<cfset StripLinks = #REReplaceNoCase(#Mid(cfhttp.FileContent, Match.pos[1], Match.len[1])#,"=[""]", "http://www.rspb.org.uk/vacancies/index.asp", "ALL")#>
<cfset LinksArray = ListToArray(StripLinks)>
<cfloop from="1" to="#arrayLen(LinksArray)#" index="i">
<cfhttp method="get" url="#LinksArray[i]#">
<cfdump var="#cfhttp#">

8riaN

While it's likely you have already done this, just to be sure...
have you ruled out a problem with the match, but having it just dump the URLs to the ouput instead of <cfhttp>-ing them inside the LinksArray loop?

Tacobell777

after you do your rereplace is the string still a list?
when you dump LinksArray is it a valid array?

VHSB

ASKER

8riaN, I have already checked that, see below thanks.

Tacobell after i do my replace, the string is still a list.
However when I take the cfhttp out and cfdump"#LinksArray[i]#" I get the URL's I expect, eg:
http://www.rspb.org.uk/vacancies/index.asp?id=3570305
http://www.rspb.org.uk/vacancies/index.asp?id=3510305
http://www.rspb.org.uk/vacancies/index.asp?id=3440305
http://www.rspb.org.uk/vacancies/index.asp?id=3430305

But when I cfdump var="#LinksArray[i]#" with the cfhttp code, I dont get the URL's as I expect, eg:
http://www.rspb.org.uk/vacancies/index.asp?id=3570305
;id=3570305

VHSB

ASKER

The results Im getting from cfdump var="#LinksArray[i]#" with the cfhttp code are pretty strange. Is this a problem with the loop ???

VHSB

ASKER

Ive increased the points guys. Ive made no progress with this for 2days and ive tried everything I can think of. Cheers

8riaN

Something is definately off here. I suggest the following 2 ways to try and expose what is really happening. It could still be a problem with the match that is hidden by the browser, so would need to rule that out also before considering the messy alternatives.

If you haven't done this yet, do it and in any case tell us what happens when you do.

1) write a loop to output the list, instead of using cfdump so you can more easily check the HTML source (without all that extra stuff <cfdump> adds) and examine the source to make sure there isn't extra stuff the brower isn't showing you.

2) change the <cfhttp> to <cflocation> and see what URL you are browsing to, then maybe do one <cfhttp> and do the LinksArray[2] as a <cflocation>

Post the results. And while you're at it, would you please post the Regular Expression you are using? And maybe a little of the raw data (if it's not confidential) which it is running on?

Thanks, we'll get this,

8riaN

VHSB

ASKER

Thanks for the reply 8riaN I was begining to get desperate:
1) I output the list without using cfhttp and I got the complete list of links. However when I output the list with the cfhttp code included I got a strange result, here is the out put:
http://www.rspb.org.uk/vacancies/index.asp?id=3570305 ;id=3570305

That only displays the first link and an additional Id?????????

2) Im not familiar with <cflocation> as this is my first CF project, give me a few minutes to look it up and how to use it.

Here is the regex that parses the links from the index page, for rspb.org.uk:
[[:punct:]]+id=(.*?)[[:digit:]]+

The ReReplace is in my first post.

Thanks

Sam

8riaN

can I see the first few links worth of raw data?

VHSB

ASKER

Sorry I forgot to post the html I am parsing:

<tr class="alt"><td><a href="?id=3490305">Centre Assistant</a></td><td>East Scotland</td><td>1/4/2005</td></tr>
<tr><td><a href="?id=3480305">Centre Assistant</a></td><td>East Scotland</td><td>1/4/2005</td></tr>
<tr class="alt"><td><a href="?id=3400305">Visitor & Publicity Officer</a></td><td>Essex</td><td>1/4/2005</td></tr>
<tr><td><a href="?id=3370305">Visitor & Publicity Officer</a></td><td>Lancashire</td><td>31/3/2005</td></tr>

VHSB

ASKER

I the <cflocation> instead of the <cfhttp> and it gave me the webpage of the first URL, but nothing else.

Regards

Sam

8riaN

ok, I think you have to rework the whole regExp search. First one quick note:

the symbol "?" means "match zero or one of the preceding expression"
"*" means zero or more,
so .*? mean zero or more of anything one or more of anything
which is confusing to both me and the regexp parser
you need to escape the ? if you want the literal ?, like this: \?

Instead of trying to do this all at one go, I suggest writing a loop. Since what you want is to recognize a pattern which includes stuff you want and stuff you don't want, just run through and build a list of what you do want, it is much less vulnerable and better suited to how RegExps work.
Since you have well behaved data, I'd suggest taking advantage of it to reduce confusion like so:

<cfset regExp='"\?id=([[:digit:]]+)"'>

then make sure to use only the subexpression (.pos[2], .len[2]) to build a list, as follows:
<cfset idList="">
<cfset findID=REFind(regExp,cfhttp.FileContent,1,True)>
<cfloop condition="findID.pos[1]">
<cfset idList=ListAppend(idList,mid(cfhttp.FileContent,findID.pos[2],cfhttp.FileContent,findID.len[2])>
<cfset findID=REFind(regExp,cfhttp.FileContent,findID.pos[1]+findID.len[1],True)>
</cfloop>

then
<cfloop from="1" to="#arrayLen(idList)#" index="i">
<cfhttp method="get" url="http://www.rspb.org.uk/vacancies/index.asp?ID=#idList[i]#">
<cfdump var="#cfhttp#">

It's a much cleaner, easier to debug method, even if it takes a few more lines of code.

Tell me if that works for you.
8riaN

VHSB

ASKER

Much appreciated 8riaN.
Im getting a problem with <cfset idList=ListAppend(idList,mid(cfhttp.FileContent,findID.pos[2],cfhttp.FileContent,findID.len[2]))> as the function takes three parameters? Any ideas?

ASKER CERTIFIED SOLUTION

8riaN

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

VHSB

ASKER

8riaN it works a treat mate thank you very much for your perseverence with this one.

Thanks again

8riaN

Horay!

VHSB

ASKER

8riaN,
if I needed to add several other different index pages to parse, how flexible would the code be?

8riaN

totally flexible, the question boils down to whether the RegExp works with the new page, i.e. will you always find the id in a link where href="?id=#####" . If so great, if not, you need a new RegExp to use for that file

say the link was href="/search/search.asp?searchID=#####"

you'd make a new RegExp like this:
regExp2='href="/search/search\.asp\?searchID=([[:digit:]]*)"'
remembering to escape things like . and ? and $ in the URL search string (e.g. search\.asp for search.asp)

then all the code is the same until you actually make the URL:
<cfhttp method="get" url="http://www.rspb.org.uk/search/search.asp?ID=#idList[i]#">

QED

VHSB

ASKER

8riaN, sorry to flog this to death but:

Ive added a second website to scrape: http://www.website2/news/jobs.asp

The raw data for this page: <A HREF='job_brief.asp?ID=491'>

A complete job URL should look like this: http://www.website2/news/jobs_brief.asp?id=495

Therefore I used the same regex but changed the URL to:
http://www.english-nature.org.uk/news/job_brief.asp?ID=#Links2[i]#

Looks right? But wait for it........I only get the first id number. Where am i going wrong?

Regards

8riaN

This RegExp:
"\?id=([[:digit:]]+)"
expects a url like this:
href="?id=###"
because it has the quotes.

You need to change it to reflect the fact that the ? is not immediately preceeded by " like this, for example:
job_brief.asp\?id=([[:digit:]]+)

Take some time to read both the Regular Expressions, you should be able to generalize to your next task from that. Just remember that the more specific you are, the less vulnerable to false matches you are.

Also I aught to mention that scraping web sites can run afoul of intellectual property rights, so know the policies of the sites you are scraping and if you don't own the data, get permission if possible.

(I had to add that last bit once you actually used the word "scrape")

Cheers,
8riaN

VHSB

ASKER

8rian,
Thanks for the advice. Three websites have been contacted and agreed to let us perfom the "scrapes", as long as we adhere to a couple of conditions.

With regards to the regex, I have tried job_brief.asp\?id=([[:digit:]]+) yet I still only get the first id.

8riaN

Please post enough of the data that I can see at two IDs and the code that replaces this:
<cfset idList="">
<cfset findID=REFind(regExp,cfhttp.FileContent,1,True)>
<cfloop condition="findID.pos[1]">
<cfset idList=ListAppend(idList,mid(cfhttp.FileContent,findID.pos[2],cfhttp.FileContent,findID.len[2])>
<cfset findID=REFind(regExp,cfhttp.FileContent,findID.pos[1]+findID.len[1],True)>
</cfloop>

VHSB

ASKER

8riaN, please ignore my last post Ive identified my problem. I had used REFindNoCase for <cfset findID=REFind(regExp,cfhttp.FileContent,1,True)> and REFind for <cfset findID=REFind(regExp,cfhttp.FileContent,findID.pos[1]+findID.len[1],True)>. Once I changed them both to REFind it worked, but Im not qite sure why that would make the difference, any ideas?

Thanks for your advice with this, i appreciate it.

S

8riaN

That sure does seem backwards...

If it were reversed and you had the id in LCase and it worked when you switched to REFindNoCase, I would get it, but you're describing the opposite.

Hmm. Is it possible that case is not consistent throughout the data? In this case, they should both be NoCase. You should eyeball the data to make sure you're not dropping anything if you haven't already done it. This explaination is inconsistent with your description of the symptoms, but eyeballing the data would catch other problems as well, so should be done. Are you getting the same number of links that exist in the original?

But if you know you're getting all the links, this goes in the "curious behaviours which I'll look out for next time" file because, as they say, if it ain't broke, don't fix it.

8riaN

VHSB

ASKER

8riaN, your right it does seem backwards, or maybe its me that backwards :)
I did switch it to REFindNoCase and it worked, not the other way round.

Thanks again :)

8riaN

That makes much more sense.

I noticed that I used id and the data used ID - my bad, sorry about that.

But it does mean that your data is likely well behaved. I'd still count the links anyway, or at least check the first, last and a couple random spot checks if there are too many to count.

8riaN