asp.net with regEx help

i have a recursive function listed below... it is used to pull all the links from a pageSource. the problem is link on web pages are like href=/sale.htm and it need it to be a full url.

i have the full url initially going into the app code and want to maintain it in the looping recursive logic

here is my link find code:
public static class LinkFinder
{
    public static List<LinkItem> Find(string file)
    {
        List<LinkItem> list = new List<LinkItem>();

        // 1.
        // Find all matches in file.
        MatchCollection m1 = Regex.Matches(file, @"(?i)(<A.*?>.*?</A>)",
            RegexOptions.Singleline);

        // 2.
        // Loop over each match.
        foreach (Match m in m1)
        {
            string value = m.Groups[1].Value;
            LinkItem i = new LinkItem();

            // 3.
            // Get href attribute.
            Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
                RegexOptions.Singleline);
            if (m2.Success)
            {
                i.Href = m2.Groups[1].Value;
            }

            // 4.
            // Remove inner tags from text.
            string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
                RegexOptions.Singleline);
            i.Text = t;

            list.Add(i);
        }
        return list;
    }
}


below is my app code that uses this.
private void runEmailScrapeApp() {

        //string pageSource = "";
        string URLToSearch = (string.IsNullOrEmpty(txtSearchText.Text)) ? txtSearchURL.Text : googleURLmaker();

        int iPageDepth = Convert.ToInt32(txtNumbPagesToUse.Text.ToString());

        int searchID = saveSeachDataToDB(DateTime.Now.ToString(), rdoSearchType.SelectedValue.ToString(), txtSearchText.Text, txtSearchURL.Text, chkDeepLink.Checked, iPageDepth, txtTags.Text);
        int numbRecursion = Convert.ToInt32(txtNumbPagesToUse.Text);
        findlinks(searchID, URLToSearch, numbRecursion);


        //redirect to page to pull from db and display data or display below in grid...
        DisplayResults();
    }

    private void findlinks(int searchID, string URL, int reccursiveCycleNumb)
    {
        if (reccursiveCycleNumb == 0 || string.IsNullOrEmpty(URL)){
            return; 
        }
        
        WebClient w = new WebClient();
        string pageSource = w.DownloadString(URL).ToLower();

        //save all emails on this pageSource to the DB
        foreach (EmailAddressItem oEmail in EmailFinder.FindEmails(pageSource))
        {
            saveEmailDataToDB(searchID, oEmail.mailTo.ToString(), URL);
        }


        //recursive action here
        foreach (LinkItem i in LinkFinder.Find(pageSource))
        {
            findlinks(searchID, i.Href, reccursiveCycleNumb - 1);
        }

        reccursiveCycleNumb -= reccursiveCycleNumb;

    }

Open in new window

supergirl2008Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

ambienceCommented:
How about something like that?
string fixURL(string baseUrl, string pageUrl)
{
   if(pageUrl[0] == @"/")
       return baseUrl + pageUrl.Right(1);
return baseUrl + pageUrl;
}
findlinks(searchID, fixURL(URL, i.Href), reccursiveCycleNumb - 1);

0
ambienceCommented:
Sorry .. rusty memories

if(pageUrl.StartsWith(@"/"))
      return baseUrl + pageUrl.SubString(1);
return baseUrl + pageUrl;
0
supergirl2008Author Commented:
you need a dynamic way to get page base because the starting point for the recursive search can be something like site.com/About-Us/Contact.aspx with a link like href=/home.aspx

they your code would break down.
0
Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

supergirl2008Author Commented:
to be clear in the first run of this code:

the input param URL can be something like:"http://localhost:55588/testing/testingPages/p1.aspx"      
and the i.Href param fed into the recursive function call can come back as a relative path link and i need a fully qualified URL to be fed into the function.


private void findlinks(int searchID, string URL, int reccursiveCycleNumb)
    {
        if (reccursiveCycleNumb == 0 || string.IsNullOrEmpty(URL)){
            return;
        }
       
        WebClient w = new WebClient();
        string pageSource = w.DownloadString(URL).ToLower();

        //save all emails on this pageSource to the DB
        foreach (EmailAddressItem oEmail in EmailFinder.FindEmails(pageSource))
        {
            saveEmailDataToDB(searchID, oEmail.mailTo.ToString(), URL);
        }


        //recursive action here
        foreach (LinkItem i in LinkFinder.Find(pageSource))
        {
            findlinks(searchID,  i.Href, reccursiveCycleNumb - 1);
        }

        reccursiveCycleNumb -= reccursiveCycleNumb;

    }
0
ambienceCommented:
How about something like
 

private void findlinks(int searchID, String URL, int reccursiveCycleNumb)
{
	Uri baseUri = null; 
	if(!Uri.TryCreate(URL, UriKind.Absolute, out baseUri)
		return;

	Uri relUri = new Uri(i.Href, UriKind.RelativeOrAbsolute);
	if(relUri.IsAbosluteUri)
	{
		if(relUri.Scheme != "http")
			return;
		return findLinks(searchID, i.Href, reccursiveCycleNumb - 1);
	}

	string baseStr = baseUrl.GetLeftPart(UriPartial.Path);
	if(baseStr.LastIndexOf(@"/") != -1)
		baseStr = baseStr.SubString(0, baseStr.LastIndexOf(@"/") );

	return findLinks(searchID, baseStr + i.Href, reccursiveCycleNumb - 1);
}
 

Open in new window

0
supergirl2008Author Commented:
also how to exclude links to pdf files, .doc files and other non web page links. i want to gather only web pages
0
ambienceCommented:
Check if the i.Href ends with ".htm" or ".html" and skip if not.
BTW, that is only a trivial way of checking for non-HTML data and will fail a lot. For example, this sample url will fail

http://www.someserver.com/this-is-a-doc-that-i-dont-know-the-extension-of
The important point to keep in mind is that an HTTP URL is a pointer to a "resource" on the webserver and not a "file" (in the usual sense). Whether index.html returns an HTML document or a word document is upto the webserver and the only reliable way to check that is to actually examine the headers for the returned object. Discard the data if it's anything other than html. Or better fetch only headers before downloading the entire document.
You can examine the "Content-Type" header to find out the type of data returned.
 
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
ambienceCommented:
How is that a "no answer to be had"?
- "also how to exclude links to pdf files .." implies there was partial answer at least
- The last comment is not by the author, he/she didnt come back!
0
Terry WoodsIT GuruCommented:
I recommend http:#32236260 and http:#32225895 being accepted as the answer, as together they appear to answer the question plus additional problems raised by the author. The author seems to be wanting to do something that isn't possible through that technique, but they are now much closer to a solution thanks to the experts contributions.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
.NET Programming

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.