regEx to pull all links to a web page only

i have code to a class that pulls out links from a web page. im having trouble with the regEx caus eit pulls in dirty data many times. i want it only to match on real web pages and not mailto tags, links to .pdf or .doc or .ppt

can anyone suggest the correct regex in the way im using it below?
using System;
using System.Data;
using System.Configuration;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Web.UI.HtmlControls;
using System.Collections.Generic;
using System.Text.RegularExpressions;



public struct LinkItem
{
    public string Href;
    public string Text;

    public override string ToString()
    {
        return Href + "\n\t" + Text;
    }
}



/// <summary>
/// Summary description for Class1
/// </summary>
public static class LinkFinder
{
    private const string _LINK_REGEX = "href=\"[a-zA-Z./:&\\d_-]+\"";

    public static List<LinkItem> Find(string file)
    {
        List<LinkItem> list = new List<LinkItem>();

        MatchCollection m1 = Regex.Matches(file, @"(?i)(<A.*?>.*?</A>)", RegexOptions.Singleline);
        //MatchCollection m1 = Regex.Matches(file, _LINK_REGEX, RegexOptions.Singleline);
        

        // 2.
        // Loop over each match.
        foreach (Match m in m1)
        {
            string value = m.Groups[1].Value;
            LinkItem i = new LinkItem();

            // 3.
            // Get href attribute.
            Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
                RegexOptions.Singleline);
            if (m2.Success)
            {
                i.Href = m2.Groups[1].Value;
                
            }

            // 4.
            // Remove inner tags from text.
            string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
                RegexOptions.Singleline);
            i.Text = t;

            list.Add(i);
        }
        return list;
    }
}

Open in new window

supergirl2008Asked:
Who is Participating?
 
Terry WoodsIT GuruCommented:

<\s*a\s+[^>]*href\s*=\s*[\"']?((?![^\"' >]*(?:doc|pdf|gif)['\" >])[^\"' >]+)[\"' >]

Open in new window

0
 
Terry WoodsIT GuruCommented:
Try this. Do you need to set an ignore case option too?
MatchCollection m1 = Regex.Matches(file, @"(?i)(<A[^>]*href\s*=\s*['"](?!mailto|[^'"]*\.(?:pdf|doc|ppt))[^>]*>.*?</A>)", RegexOptions.Singleline);

Open in new window

0
 
Terry WoodsIT GuruCommented:
Oh, and I'm not sure whether you might need to escape the double quotes in the pattern.
0
Cloud Class® Course: CompTIA Cloud+

The CompTIA Cloud+ Basic training course will teach you about cloud concepts and models, data storage, networking, and network infrastructure.

 
supergirl2008Author Commented:
also what about full paths vs relative paths found in href attributes?
0
 
supergirl2008Author Commented:
i want to do this:
http://onaje.com/content/working-regular-expressions-href-url-extractor

but i also want to exclude and links to .doc, .pdf and .gif ... files

what RegEx d i add to this guys code to exclude my unwanted ones?
<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]
0
 
supergirl2008Author Commented:
that regEx should read...
<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]

Open in new window

0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.