Link to home
Start Free TrialLog in
Avatar of trevor1940
trevor1940

asked on

C#: Seaching for multiple strings in multiple files using regular expression

I have a bunch of html files in a directory
I need to locate a number of divs within these html files based on the div id
I don't know which html file contains a given div id

So rather then searching every file multiple times I thought about creating a regular expression for all wanted   div id open each file once then  do a patten match
Once I've located which file contains  a given div id  I can then pass the html string into a HtmlAgilityPack.HtmlDocument

My regEx isn't working  (Manuel search shows the fist div in the list is  in the first html file opened)
User generated image
from the pic
The Regex PidsReg = {post_message_1234|post_message_5678| etc

Sample of code I nether get to match success

..........
                string[] folders = Directory.GetDirectories(RootDir, "*", SearchOption.TopDirectoryOnly);
                String Pids="";
                foreach (string folder in folders)
                {

                    FileInfo FI = new FileInfo(folder);
                    string DirectoryName = FI.Name;
                    Match DirNameMatch = DirNameReg.Match(DirectoryName);
                    if (DirectoryName.Contains(DirTextBox.Text) && DirNameMatch.Success)
                    {
                        string PID = "post_message_" + DirNameMatch.Groups[0];
                        Pids += PID + "|";
                    }
                }
                if (Pids != "")
                {
                    Pids = Pids.Remove(Pids.LastIndexOf("|"), 1);
                }
                MessageBox.Show("Pids " + Pids);
                Regex PidsReg = new Regex(Pids);

                foreach(string HTMLFile in Directory.GetFiles(RootDir + "\\00_html_Files", "*.html") )
                {
                    string HTML = File.ReadAllText(HTMLFile);
                    Match PostMatch = PidsReg.Match(HTML);
                    if (PostMatch.Success)
                    {
                        string post = PostMatch.Groups[0].Value;
                        string[] Parts = post.Split('_');
                        int myID;
                        int.TryParse(Parts[1], out myID);

                        HtmlAgilityPack.HtmlDocument Hdoc = new HtmlAgilityPack.HtmlDocument();
                        Hdoc.LoadHtml(HTML);
                        var DivHTML = Hdoc.DocumentNode.DescendantsAndSelf("//div[starts-with(@id,post_message_')]");
                        foreach (var divNodes in DivHTML)
                        {
                            var DIVs = divNodes.DescendantsAndSelf("div");
                            foreach (var div in DIVs)
                            {
                                try
                                {
                                    if (div.Attributes["id"].Value == post)
                                    {
                                      
                                        string InnerText = div.InnerText;
                                        InnerText = InnerText.Trim();

// Do Stuff with InnerText 
..........

Open in new window



Can anyone see why the RegEx  fails or suggest a better  way?
ASKER CERTIFIED SOLUTION
Avatar of oBdA
oBdA

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Looking at the documentation, your current pattern isn’t doing what you think. Unless a person takes the time to REALLY dig into regular expressions it’s usually “luck” to figure out a pattern that actually works. In general, I suggest using a tool like Regexr or Regex101 to help figure out the pattern you’re going to use.

 Again, Looking at the documentation, your pattern should be something similar to the following:
var Pids = “(post_message_1234|post_message_5678| etc)”;

Open in new window


I would test that pattern using one of the tools above and several of your HTML files to ensure it works before tearing any more of your hair out in the C# debugger. Once you’ve got the pattern working, copy it back over to your code and then debug it.

Finally, if the DIV ids all conform to a specific format then your RegEx should test for that format, not for individual values. If I’m correct that format is post_message_[some number]. If that’s the case you can do away with the code that builds up the list of Pids and simply use
var pattern = @"(post_message_[0-9]+)";
var options = RegexOptions.IgnoreCase | RegexOptions.Compiled;
var PidsReg = new Regex(pattern, options);

Open in new window

This screenshot is proof that the pattern works.User generated image
Good luck!
Avatar of trevor1940
trevor1940

ASKER

Hi
I used Regex101 to test the RegeX

Unfortunately I need to be more specific than   var pattern = @"(post_message_[0-9]+)"; because each html file has  upto 15  div's with post_message_\d+

I'll try
post_message_(1234|5678)\b

Open in new window

Changing the RegEx to

post_message_(1234|5678)\b

Open in new window


Highlighted spaces in the Pids string once I removed the course it worked

Thanx for your help