Solved

VS 2008 Looking for Regular Expressions

Posted on 2011-03-05
13
407 Views
Last Modified: 2012-05-11
Hi, I have a string list which has hundrens of elements. Now I want to find a rule to distinguish them according to their similarity. There are these kinds of pattern.


Group 1: "ABC 20", "ABC 20 Dup". There is a substring "DUP" in the last position.
Group 2: "saliva 4", "saliva 4_2","siliva 4_3", etc. There is a substring "_" plus a number in the last position.
Group 3: "sal_1", "sal_1b","sal_1c", etc. There is a character in the last popsition.
Group 4: "NA222", "NA222b", there is a character in the last position.
Group 5: "1","1_2","10","10_2" etc.

I want to put them into a dictionary if they are similar. Thanks for help with C# code.
0
Comment
Question by:zhshqzyc
  • 8
  • 5
13 Comments
 

Author Comment

by:zhshqzyc
ID: 35044518
Maybe my group rule is wrong.
 only group 5 is enough. I want to find the similarty of the strings then put into a dictionary.

My expected result is:

 
Dictionary<string,List<string>> dict = new Dictionary<string,List<string>>();
dict["ABC 20"] = {"ABC 20","ABC 20 Dup"};
 dict["1"] = {"1","1_2"};

Open in new window

The question is how to avoid dict["1"]={"1","10"};

0
 
LVL 19

Expert Comment

by:Shahan Ayyub
ID: 35045760
Hi!

Your above attached code should be like this:

            Dictionary<string,List<string>> dict = new Dictionary<string,List<string>>();
            dict["ABC 20"] = new List<string> { "ABC 20 DUP" };
            dict["1"] = new List<string> { "1", "1_2" };

Open in new window


In the code these part are considerable:
            Dictionary<string,List<string>> dict = new Dictionary<string,List<string>>();
            dict["ABC 20"] = new List<string> { "ABC DUP 20","Dup 20" };
            dict["1"] = new List<string> { "1", "1_2" };

Now, Can you elaborate a little bit that if you have these values:
"saliva 4_2","siliva 4_3"
under the key:
"saliva 4"

Then what you want next ??
Please clarify this:
>>>I want to put them into a dictionary if they are similar
0
 

Author Comment

by:zhshqzyc
ID: 35047219
dict["ABC 20"]=new List<string>{"ABC 20","ABC 20 Dup"};
dict["saliva 4"]=new List<string>{"saliva 4","saliva 4_2","siliva 4_3"};

Rule:
•Only digits or
•mix letters and digits or underscore(non pure digits)

Case 1: digits only
add the string to the dictionary as a new key.Search the entire string list,
  if a string is found for the portion before non digit
  add the string to the list referenced by the found key.(ex. "10","10_2","10_b")
Case 2: mix letters and digits or underscore/white space(non pure digits)
 add the string to the dictionary as a new key.Search the entire string list,
 if a string is found for the portion
  just add the string to the list referenced by the found key.(ex. "ABC 20","ABC 20 Dup")("N2222","N2222b")("Sal_1","Sal_1b")("saliva 4","saliva 4_2")


0
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

 

Author Comment

by:zhshqzyc
ID: 35047330
We are doing biology experiments. Each bio-sample has an uniqe id. But the experiment may be repeated for each sample. For example  sample "1" was tested then we have to duplicate the experiment, the second time we used "1_1" or "1_b" as control group name for analysis comparation. These sample names are stranger such as "ABC 20", then duplicated experiment is marked "ABC 20 Dup". So the key should be an element of the list or array. Thus you can not split "ABC 20" to "ABC". "saliva 4" is a sample name, the duplicate experiment should be named as  "saliva 4" plus something such as "saliva 4_2", so you can not split "sliva 4" to "saliva".

And also "1" and "10" are different sample, they can not grouped together. "1" and "1_1" or "1_b" are the same sample but used in different experiments. I think the hard part is when the sample id is a number, then the duplicate one is the number plus nondigit character. But we should prevent to put "1" and "10_2" together although portion "0_2" is a pure digit.

0
 
LVL 19

Expert Comment

by:Shahan Ayyub
ID: 35047862
Hi!

As I understood:
In dict[key, value]
Here key refers original sample and value refers the duplicate. So what i need to do is to group duplicates under their samples like this:
key:                   Value:
dict["ABC"] =   { "ABC"}
dict["ABC Dup"] = {"ABC Dup"}
dict["1"] = {"1"}
dict["1_1"] = {"1_1"}

to this:
dict["ABC"] = {"ABC" , "ABC Dup"}
dict["1"] = {"1" , "1_1"}

Here ABC Dup is a duplicate of ABC and 1_1 is a duplicate of 1.

If their is something wrong please correct it.
0
 

Author Comment

by:zhshqzyc
ID: 35048222
Exactly.
0
 

Author Comment

by:zhshqzyc
ID: 35064693
My primary code. Need help.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace Test
{
    class Program
    {
       
        static void Main(string[] args)
        {
            GetSimilarities();
        }

         static void GetSimilarities()
        {
             Dictionary<string, List<string>> dict1 = new Dictionary<string, List<string>>();

            for (int i = 0; i < header1.Length; i++)
            {
                string key = header1[i];
                bool result = key.All(Char.IsDigit);
                for (int j = 0; j < header1.Length; j++)
                {
                    string value = header1[j];
                    if (value.Length <= key.Length)
                        continue;
                    if (result == true)
                    {
                        string pattern = key + @"[^\d]+\w";
                        if (Regex.IsMatch(value, pattern))
                        {
                            if (key == "1")
                                Console.ReadLine();
                            if (!dict1.ContainsKey(key))
                            {
                                dict1[key] = new List<string>();
                                dict1[key].Add(key);
                                dict1[key].Add(value);
                            }
                        }
                    }
                    else
                    {
                        string pattern = key + @".+";
                        if (Regex.IsMatch(value, pattern))
                        {
                            if (!dict1.ContainsKey(key))
                            {
                                dict1[key] = new List<string>();
                                dict1[key].Add(key);
                                dict1[key].Add(value);  
                            }
                        }
                    }
                }

            }
        }

        static string[] header1 = new string[]
        { "XYZ 20", "XYZ 20 Dup", "Saliva 1_2", "Saliva 1", "Sal_Lb", "Sal_L", "Sal_2b", "Sal_2", "Sal_1b", "Sal_1", "KA_2", "KA", "JDT_2", "JDT", "JA_2", "JA", "8_2", "8_2b", "6688", "6688b", "test1a_2", "test1a", "1", "1_1", "2", "2_1", "3", "3_1", "10", "10_1" };
    }
    
}

Open in new window

When key ="1", the value is incorrect.
0
 
LVL 19

Expert Comment

by:Shahan Ayyub
ID: 35073527
Hi!

Sorry for the late response, i was busy somewhere. I didn't test your method yet, but came up with this one. A full version:

     
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Diagnostics;
using Microsoft.VisualBasic; 

namespace ConsoleCS
{
    class Program
    {
        static string[] header1 = new string[] {
                                                 "XYZ 20", "XYZ 20 Dup", "Saliva 1_2", "Saliva 1", 
                                                 "Sal_Lb", "Sal_L", "Sal_2b", "Sal_2", "Sal_1b", "Sal_1",
                                                 "KA_2", "KA", "JDT_2", "JDT", "JA_2", "JA", "8_2", 
                                                 "8_2b", "6688", "6688b", "test1a_2", "test1a", "1", 
                                                 "1_1", "2", "2_1", "3", "3_1", "10", "10_1",
                                                 "XYZ 20 DUP_2"
                                               };
        static void Main(string[] args)
        {
            Dictionary<string,List<string>> dict = new Dictionary<string,List<string>>();
            dict = GroupSamples();
            PrintValues(dict);
        }

        public static void PrintValues(Dictionary<string, List<string>> dict)
        {
            foreach (string k in dict.Keys)
            {
                Console.WriteLine(k);
                foreach (string v in dict[k])
                {
                    Console.WriteLine("    " + v);
                }
            }
        }

        static Dictionary<string, List<string>> GroupSamples()
        {
            Dictionary<string, List<string>> dict = new Dictionary<string, List<string>>();
            Array.Sort(header1);
            for (int i = 0; i < header1.Length; i++)
            {
                for (int j = i + 1; j < header1.Length; j++)
                {
                    if (!Information.IsNumeric(header1[i]))
                    {
                        if (Regex.IsMatch(header1[j], "(?i)^" + header1[i] + "(_[a-z]|[a-z]|_\\d)+$"))
                        {
                            if (!dict.ContainsKey(header1[i]))
                            {
                                dict[header1[i]] = new List<string>(new string[] { header1[j] });
                            }
                            else
                            {
                                dict[header1[i]].Add(header1[j]);
                            }
                        }
                    }
                    if (Information.IsNumeric(header1[i]) && header1[i].Contains("_"))                        
                    {
                        if (Regex.IsMatch(header1[j], "(?i)^" + header1[i] + @"([\s\da-z]+)$"))
                        {
                            if (!dict.ContainsKey(header1[i]))
                            {
                                dict[header1[i]] = new List<string>(new string[] { header1[j] });
                            }
                            else
                            {
                                dict[header1[i]].Add(header1[j]);
                            }
                        }
                    }
                    if(Information.IsNumeric(header1[i]) && !header1[i].Contains("_"))
                    {
                        if (Regex.IsMatch(header1[j], "(?i)^" + header1[i] + @"(_\d+|[a-z]+)$"))
                        {
                            if (!dict.ContainsKey(header1[i]))
                            {
                                dict[header1[i]] = new List<string>(new string[] { header1[j] });
                            }
                            else
                            {
                                dict[header1[i]].Add(header1[j]);
                            }
                        }
                    }
                    if (!Information.IsNumeric(header1[i]) && !header1[i].Contains("_"))
                    {
                        if (Regex.IsMatch(header1[j], "(?i)^" + header1[i] + @"[\sa-z\d]+$"))
                        {
                            if (!dict.ContainsKey(header1[i]))
                            {
                                dict[header1[i]] = new List<string>(new string[] { header1[j] });
                            }
                            else
                            {
                                dict[header1[i]].Add(header1[j]);
                            }
                        }
                    }

                }
            }
            return dict;
        }
    }
}

Open in new window

0
 

Author Comment

by:zhshqzyc
ID: 35074046
I already figured it out. Thank you very much.
0
 
LVL 19

Expert Comment

by:Shahan Ayyub
ID: 35074064
So your problem solved ???
0
 
LVL 19

Accepted Solution

by:
Shahan Ayyub earned 500 total points
ID: 35074089
Did you test my solution ???
0
 

Author Comment

by:zhshqzyc
ID: 35087506
Your code should work but I would like a general case. In your pattern you used '_', but it may be other char. For example, {"KA","KA*2}  instead of {"KA","KA_2"}.
0
 

Author Comment

by:zhshqzyc
ID: 35098015
Sorry, it is a mistakenly hit.
0

Featured Post

3 Use Cases for Connected Systems

Our Dev teams are like yours. They’re continually cranking out code for new features/bugs fixes, testing, deploying, testing some more, responding to production monitoring events and more. It’s complex. So, we thought you’d like to see what’s working for us.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Entity Framework is a powerful tool to help you interact with the DataBase but still doesn't help much when we have a Stored Procedure that returns more than one resultset. The solution takes some of out-of-the-box thinking; read on!
Exception Handling is in the core of any application that is able to dignify its name. In this article, I'll guide you through the process of writing a DRY (Don't Repeat Yourself) Exception Handling mechanism, using Aspect Oriented Programming.
This Micro Tutorial will teach you how to censor certain areas of your screen. The example in this video will show a little boy's face being blurred. This will be demonstrated using Adobe Premiere Pro CS6.
Finds all prime numbers in a range requested and places them in a public primes() array. I've demostrated a template size of 30 (2 * 3 * 5) but larger templates can be built such 210  (2 * 3 * 5 * 7) or 2310  (2 * 3 * 5 * 7 * 11). The larger templa…

778 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question