asked on

regEx to pull all links to a web page only

i have code to a class that pulls out links from a web page. im having trouble with the regEx caus eit pulls in dirty data many times. i want it only to match on real web pages and not mailto tags, links to .pdf or .doc or .ppt

can anyone suggest the correct regex in the way im using it below?

using System;
using System.Data;
using System.Configuration;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
using System.Web.UI.HtmlControls;
using System.Collections.Generic;
using System.Text.RegularExpressions;



public struct LinkItem
{
    public string Href;
    public string Text;

    public override string ToString()
    {
        return Href + "\n\t" + Text;
    }
}



/// <summary>
/// Summary description for Class1
/// </summary>
public static class LinkFinder
{
    private const string _LINK_REGEX = "href=\"[a-zA-Z./:&\\d_-]+\"";

    public static List<LinkItem> Find(string file)
    {
        List<LinkItem> list = new List<LinkItem>();

        MatchCollection m1 = Regex.Matches(file, @"(?i)(<A.*?>.*?</A>)", RegexOptions.Singleline);
        //MatchCollection m1 = Regex.Matches(file, _LINK_REGEX, RegexOptions.Singleline);
        

        // 2.
        // Loop over each match.
        foreach (Match m in m1)
        {
            string value = m.Groups[1].Value;
            LinkItem i = new LinkItem();

            // 3.
            // Get href attribute.
            Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
                RegexOptions.Singleline);
            if (m2.Success)
            {
                i.Href = m2.Groups[1].Value;
                
            }

            // 4.
            // Remove inner tags from text.
            string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
                RegexOptions.Singleline);
            i.Text = t;

            list.Add(i);
        }
        return list;
    }
}

Open in new window

Terry Woods

Try this. Do you need to set an ignore case option too?

MatchCollection m1 = Regex.Matches(file, @"(?i)(<A[^>]*href\s*=\s*['"](?!mailto|[^'"]*\.(?:pdf|doc|ppt))[^>]*>.*?</A>)", RegexOptions.Singleline);

Open in new window

Terry Woods

Oh, and I'm not sure whether you might need to escape the double quotes in the pattern.

Dhanasekaran Sengodan

try this

http://stackoverflow.com/questions/1870682/regex-to-pull-specific-links-out-of-rss-feed

supergirl2008

ASKER

also what about full paths vs relative paths found in href attributes?

supergirl2008

ASKER

i want to do this:
http://onaje.com/content/working-regular-expressions-href-url-extractor

but i also want to exclude and links to .doc, .pdf and .gif ... files

what RegEx d i add to this guys code to exclude my unwanted ones?
<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]

supergirl2008

ASKER

that regEx should read...

<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]

Open in new window

ASKER CERTIFIED SOLUTION

Terry Woods

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial