Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 524
  • Last Modified:

Regex.Match question C#

Hello,

I have a mission to scrape comments from this website:
http://www.youtube.com/watch_ajax?v=LfLFMf4chBc&action_get_comments=1&p=0&commentthreshold=-5&commentfilter=0&page_size=10

The problem is, comments make me trouble:
What I have is:
if (!String.IsNullOrEmpty(input))
                {
                    MatchCollection collection = Regex.Matches(input, "<div class=\"watch-comment-body\"><div >([^']*?)</div>");
                    if (collection.Count > 1)
                    {
                        for (int i = 0; i < collection.Count; )
                        {
                            currentComment = (Regex.Match(collection[i].Value, "<div >([^']*?)</div>").Groups[1].Value);                          
                            if (currentComment != "Show" || currentComment != "Hide")
                            {
                                cache = cache + "#" + currentComment;
                                counter();
                            }
                            i++;
                        }
//...process data
}

But Regex.Match will always return empty match, since the code of website is:
                <div class="watch-comment-body">
                    <div >

                        the comment text I want to get is here
                    </div>

Its divided in lines, and that is my problem. Any ideas on how I could fix this issue?
0
GVNPublic123
Asked:
GVNPublic123
  • 7
  • 5
1 Solution
 
Gary DavisDir Internet SvcsCommented:
There is white space before the 2nd div but your 1st pattern does not account for it so it fails the match. Remove the <div>:
 Regex.Matches(input, "<div class=\"watch-comment-body\">([^']*?)</div>");
 
Gary Davis
0
 
GVNPublic123Author Commented:
It didnt work. I think its still wrong according to the syntax on website and how Regex operates.
0
 
Gary DavisDir Internet SvcsCommented:
I often use a tool called Expresso to help debug and test out regular expressions. It is free and at http://www.ultrapico.com/.
I ran this reg exp:
"<div class=\"watch-comment-body\">([^']*?)</div>
 against your sample data and it matched, setting the found group to the data from the 2nd <div> upto but not including the 1st (and only) </div>. Maybe not what you wanted but it did match and being on multiple lines did not matter.
Changing the regex to this:
<div class=\"watch-comment-body\">\s*<div>\s*([^']*?)\s*</div>
Will get just the string "the comment text I want to get is here" and maybe that's what you want.
One thing to point out. Your 2nd <div> in your example has a space after the v which will cause a problem. Code for it if you need to.
Gary
 
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
GVNPublic123Author Commented:
<div class=\"watch-comment-body\">([^']*?)</div> = works, but results in many white spaces (empty lines).

<div class=\"watch-comment-body\">\s*<div>\s*([^']*?)\s*</div> = failes to compile with error:
Error      1      Unrecognized escape sequence      (underlines s in \s")
0
 
Gary DavisDir Internet SvcsCommented:
When you need backslashes in strings, use the @ before the 1st quote:
Regex.Matches(input, @"<div class=\"watch-comment-body\">\s*<div>\s*([^']*?)\s*</div>");
Or double up the backslashes.
 
0
 
GVNPublic123Author Commented:
I tried @, but than the \ before " doesnt do the trick anymore.  \"watch-comment-body\" How do I fix that?
0
 
Gary DavisDir Internet SvcsCommented:
OK. Leave the @ off and double up the backslash for the \s. You are then escaping the backslash (\\) just like you are escaping the quote (\").
 
Regex.Matches(input, "<div class=\"watch-comment-body\">\\s*<div>\\s*([^']*?)\\s*</div>");
0
 
GVNPublic123Author Commented:
Yes, but by doing  MatchCollection collection = Regex.Matches(input, "<div class=\"watch-comment-body\">\\s*<div>\\s*([^']*?)\\s*</div>"); I get no matches
0
 
Gary DavisDir Internet SvcsCommented:
Did you remove the space after the V in the data in: <div >?
Or add a space or \\s* in the pattern.
0
 
GVNPublic123Author Commented:
Ah, screw that, it doesnt work either.

Now I have:
MatchCollection collection = Regex.Matches(input, "<div class=\"watch-comment-body\">([^']*?)</div>");

But than on the end, I get the matches, and when I write to file the file is like:
empty space                                                                             comment
empty space                                                                             comment

I tried to do match[i].Trim();, but it didnt work. Please help.
0
 
GVNPublic123Author Commented:
Ah, implementing the new string and trimming it solved the problem. Thank you very much guys!
0
 
GVNPublic123Author Commented:
Good answer
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

  • 7
  • 5
Tackle projects and never again get stuck behind a technical roadblock.
Join Now