Stevod2
asked on
Regex for extract Company Name from copyright
I need to extract the company name from the usual forms of copyright on a web page. The description is as follows:
2. match the copyright symbol comprising one of:
©
"copyright" as text
copyright symbol as character
"(c)" as text
2. Capture all the text between the copyright symbol and the next closing punctuation mark or tag.
Not really sure how to start!
Thanks
Stevod2
2. match the copyright symbol comprising one of:
©
"copyright" as text
copyright symbol as character
"(c)" as text
2. Capture all the text between the copyright symbol and the next closing punctuation mark or tag.
Not really sure how to start!
Thanks
Stevod2
Start with your copyright options:
(?<=©|copyright|©|\(c \)) -- match any of the copyright indicators
Then capture all the text
[^.?,<>]+ -- match one or more ( + ) characters which are NOT ( [^] ) mentioned here ( .?,<> )
Then put it together
(?:©|copyright|©|\(c\ ))[^.?,<>] +
(?<=©|copyright|©|\(c
Then capture all the text
[^.?,<>]+ -- match one or more ( + ) characters which are NOT ( [^] ) mentioned here ( .?,<> )
Then put it together
(?:©|copyright|©|\(c\
(?<=©|copyright|©|\(c\))[^.?,<>]+
@FernandoSoto
You have to escape your parens in "(c)" :)
You have to escape your parens in "(c)" :)
Hi Stevod2;
if you want to add other "closing punctuation marks" add them to this part of the part, ([^\.>]+), for example if you wanted to add the comma, ",", to the list add the following characters to the list, ",", so that the end results is as follows, ([^\.>,]+) .
Fernando
if you want to add other "closing punctuation marks" add them to this part of the part, ([^\.>]+), for example if you wanted to add the comma, ",", to the list add the following characters to the list, ",", so that the end results is as follows, ([^\.>,]+) .
Fernando
ASKER
Hi Fernando.
This is definately along the right lines. I have modified as follows for .Net:
(?:©|\u00A9|\(c\)|cop yright)\s* ?(?<Name>[ ^\.<]+)
How could i get it to ignore a year value if it occurs between the copyright symbol and the company name, and just capture the company name itself.
Thanks,
Stevod
This is definately along the right lines. I have modified as follows for .Net:
(?:©|\u00A9|\(c\)|cop
How could i get it to ignore a year value if it occurs between the copyright symbol and the company name, and just capture the company name itself.
Thanks,
Stevod
Hmmm... seems like everyone is in love with capture groups today... alas.
Yes as kaufmed pointed out my pattern does need to excape the ( and ) around the c so to correct my post it should be:
(?:\u00A9|\(c\)|copyright) \s*?([^\.> ]+)
(?:\u00A9|\(c\)|copyright)
Here you go:
(?:©|\u00A9|\(c\)|cop yright)\s* ?\d{4}\s*? (?<Name>[^ \.<]+)
(?:©|\u00A9|\(c\)|cop
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Hi Guys. Great job, and I;ve split the points as you both contributed.
Thanks a lot.
Stevod
Thanks a lot.
Stevod
Not a problem, glad I was able to help. ;=)
Not sure what I did, but thank you nonetheless :)
@kaufmed, well you keept me honest and that is worth something in getting Stevod2 a solution. ;=)
:D
Try this string pattern.
(?:\u00A9|(c)|copyright)\s
Fernando