Link to home
Start Free TrialLog in
Avatar of Stevod2
Stevod2

asked on

Regex for extract Company Name from copyright

I need to extract the company name from the usual forms of copyright on a web page. The description is as follows:

2. match the copyright symbol comprising one of:
©
"copyright" as text
copyright symbol as character
"(c)" as text

2. Capture all the text between the copyright symbol and the next closing punctuation mark or tag.

Not really sure how to start!

Thanks
Stevod2
Avatar of Fernando Soto
Fernando Soto
Flag of United States of America image

Hi Stevod2;

Try this string pattern.

(?:\u00A9|(c)|copyright)\s*?([^\.>]+)

Fernando
Start with your copyright options:

    (?<=&copy;|copyright|©|\(c\))    -- match any of the copyright indicators

Then capture all the text

    [^.?,<>]+                                 -- match one or more ( + ) characters which are NOT ( [^] ) mentioned here ( .?,<> )

Then put it together

    (?:&copy;|copyright|©|\(c\))[^.?,<>]+
(?<=&copy;|copyright|©|\(c\))[^.?,<>]+

Open in new window

@FernandoSoto

You have to escape your parens in "(c)"  :)
Hi Stevod2;

if you want to add other "closing punctuation marks" add them to this part of the part, ([^\.>]+), for example if you wanted to add the comma, ",", to the list add the following characters to the list, ",", so that the end results is as follows, ([^\.>,]+) .

Fernando
Avatar of Stevod2
Stevod2

ASKER

Hi Fernando.

This is definately along the right lines. I have modified as follows for .Net:

(?:&copy;|\u00A9|\(c\)|copyright)\s*?(?<Name>[^\.<]+)

How could i get it to ignore a year value if it occurs between the copyright symbol and the company name, and just capture the company name itself.

Thanks,
Stevod
Hmmm...  seems like everyone is in love with capture groups today...  alas.
Yes as kaufmed pointed out my pattern does need to excape the ( and ) around the c so to correct my post it should be:

(?:\u00A9|\(c\)|copyright)\s*?([^\.>]+)
Here you go:

(?:&copy;|\u00A9|\(c\)|copyright)\s*?\d{4}\s*?(?<Name>[^\.<]+)
SOLUTION
Avatar of kaufmed
kaufmed
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Stevod2

ASKER

Hi Guys. Great job, and I;ve split the points as you both contributed.

Thanks a lot.
Stevod
Not a problem, glad I was able to help.  ;=)
Not sure what I did, but thank you nonetheless  :)
@kaufmed, well you keept me honest and that is worth something in getting Stevod2 a solution.  ;=)