Solved

VS 2008 Looking for Regular Expressions

Posted on 2011-03-05
13
408 Views
Last Modified: 2012-05-11
Hi, I have a string list which has hundrens of elements. Now I want to find a rule to distinguish them according to their similarity. There are these kinds of pattern.


Group 1: "ABC 20", "ABC 20 Dup". There is a substring "DUP" in the last position.
Group 2: "saliva 4", "saliva 4_2","siliva 4_3", etc. There is a substring "_" plus a number in the last position.
Group 3: "sal_1", "sal_1b","sal_1c", etc. There is a character in the last popsition.
Group 4: "NA222", "NA222b", there is a character in the last position.
Group 5: "1","1_2","10","10_2" etc.

I want to put them into a dictionary if they are similar. Thanks for help with C# code.
0
Comment
Question by:zhshqzyc
  • 8
  • 5
13 Comments
 

Author Comment

by:zhshqzyc
ID: 35044518
Maybe my group rule is wrong.
 only group 5 is enough. I want to find the similarty of the strings then put into a dictionary.

My expected result is:

 
Dictionary<string,List<string>> dict = new Dictionary<string,List<string>>();
dict["ABC 20"] = {"ABC 20","ABC 20 Dup"};
 dict["1"] = {"1","1_2"};

Open in new window

The question is how to avoid dict["1"]={"1","10"};

0
 
LVL 19

Expert Comment

by:Shahan Ayyub
ID: 35045760
Hi!

Your above attached code should be like this:

            Dictionary<string,List<string>> dict = new Dictionary<string,List<string>>();
            dict["ABC 20"] = new List<string> { "ABC 20 DUP" };
            dict["1"] = new List<string> { "1", "1_2" };

Open in new window


In the code these part are considerable:
            Dictionary<string,List<string>> dict = new Dictionary<string,List<string>>();
            dict["ABC 20"] = new List<string> { "ABC DUP 20","Dup 20" };
            dict["1"] = new List<string> { "1", "1_2" };

Now, Can you elaborate a little bit that if you have these values:
"saliva 4_2","siliva 4_3"
under the key:
"saliva 4"

Then what you want next ??
Please clarify this:
>>>I want to put them into a dictionary if they are similar
0
 

Author Comment

by:zhshqzyc
ID: 35047219
dict["ABC 20"]=new List<string>{"ABC 20","ABC 20 Dup"};
dict["saliva 4"]=new List<string>{"saliva 4","saliva 4_2","siliva 4_3"};

Rule:
•Only digits or
•mix letters and digits or underscore(non pure digits)

Case 1: digits only
add the string to the dictionary as a new key.Search the entire string list,
  if a string is found for the portion before non digit
  add the string to the list referenced by the found key.(ex. "10","10_2","10_b")
Case 2: mix letters and digits or underscore/white space(non pure digits)
 add the string to the dictionary as a new key.Search the entire string list,
 if a string is found for the portion
  just add the string to the list referenced by the found key.(ex. "ABC 20","ABC 20 Dup")("N2222","N2222b")("Sal_1","Sal_1b")("saliva 4","saliva 4_2")


0
Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 

Author Comment

by:zhshqzyc
ID: 35047330
We are doing biology experiments. Each bio-sample has an uniqe id. But the experiment may be repeated for each sample. For example  sample "1" was tested then we have to duplicate the experiment, the second time we used "1_1" or "1_b" as control group name for analysis comparation. These sample names are stranger such as "ABC 20", then duplicated experiment is marked "ABC 20 Dup". So the key should be an element of the list or array. Thus you can not split "ABC 20" to "ABC". "saliva 4" is a sample name, the duplicate experiment should be named as  "saliva 4" plus something such as "saliva 4_2", so you can not split "sliva 4" to "saliva".

And also "1" and "10" are different sample, they can not grouped together. "1" and "1_1" or "1_b" are the same sample but used in different experiments. I think the hard part is when the sample id is a number, then the duplicate one is the number plus nondigit character. But we should prevent to put "1" and "10_2" together although portion "0_2" is a pure digit.

0
 
LVL 19

Expert Comment

by:Shahan Ayyub
ID: 35047862
Hi!

As I understood:
In dict[key, value]
Here key refers original sample and value refers the duplicate. So what i need to do is to group duplicates under their samples like this:
key:                   Value:
dict["ABC"] =   { "ABC"}
dict["ABC Dup"] = {"ABC Dup"}
dict["1"] = {"1"}
dict["1_1"] = {"1_1"}

to this:
dict["ABC"] = {"ABC" , "ABC Dup"}
dict["1"] = {"1" , "1_1"}

Here ABC Dup is a duplicate of ABC and 1_1 is a duplicate of 1.

If their is something wrong please correct it.
0
 

Author Comment

by:zhshqzyc
ID: 35048222
Exactly.
0
 

Author Comment

by:zhshqzyc
ID: 35064693
My primary code. Need help.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace Test
{
    class Program
    {
       
        static void Main(string[] args)
        {
            GetSimilarities();
        }

         static void GetSimilarities()
        {
             Dictionary<string, List<string>> dict1 = new Dictionary<string, List<string>>();

            for (int i = 0; i < header1.Length; i++)
            {
                string key = header1[i];
                bool result = key.All(Char.IsDigit);
                for (int j = 0; j < header1.Length; j++)
                {
                    string value = header1[j];
                    if (value.Length <= key.Length)
                        continue;
                    if (result == true)
                    {
                        string pattern = key + @"[^\d]+\w";
                        if (Regex.IsMatch(value, pattern))
                        {
                            if (key == "1")
                                Console.ReadLine();
                            if (!dict1.ContainsKey(key))
                            {
                                dict1[key] = new List<string>();
                                dict1[key].Add(key);
                                dict1[key].Add(value);
                            }
                        }
                    }
                    else
                    {
                        string pattern = key + @".+";
                        if (Regex.IsMatch(value, pattern))
                        {
                            if (!dict1.ContainsKey(key))
                            {
                                dict1[key] = new List<string>();
                                dict1[key].Add(key);
                                dict1[key].Add(value);  
                            }
                        }
                    }
                }

            }
        }

        static string[] header1 = new string[]
        { "XYZ 20", "XYZ 20 Dup", "Saliva 1_2", "Saliva 1", "Sal_Lb", "Sal_L", "Sal_2b", "Sal_2", "Sal_1b", "Sal_1", "KA_2", "KA", "JDT_2", "JDT", "JA_2", "JA", "8_2", "8_2b", "6688", "6688b", "test1a_2", "test1a", "1", "1_1", "2", "2_1", "3", "3_1", "10", "10_1" };
    }
    
}

Open in new window

When key ="1", the value is incorrect.
0
 
LVL 19

Expert Comment

by:Shahan Ayyub
ID: 35073527
Hi!

Sorry for the late response, i was busy somewhere. I didn't test your method yet, but came up with this one. A full version:

     
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Diagnostics;
using Microsoft.VisualBasic; 

namespace ConsoleCS
{
    class Program
    {
        static string[] header1 = new string[] {
                                                 "XYZ 20", "XYZ 20 Dup", "Saliva 1_2", "Saliva 1", 
                                                 "Sal_Lb", "Sal_L", "Sal_2b", "Sal_2", "Sal_1b", "Sal_1",
                                                 "KA_2", "KA", "JDT_2", "JDT", "JA_2", "JA", "8_2", 
                                                 "8_2b", "6688", "6688b", "test1a_2", "test1a", "1", 
                                                 "1_1", "2", "2_1", "3", "3_1", "10", "10_1",
                                                 "XYZ 20 DUP_2"
                                               };
        static void Main(string[] args)
        {
            Dictionary<string,List<string>> dict = new Dictionary<string,List<string>>();
            dict = GroupSamples();
            PrintValues(dict);
        }

        public static void PrintValues(Dictionary<string, List<string>> dict)
        {
            foreach (string k in dict.Keys)
            {
                Console.WriteLine(k);
                foreach (string v in dict[k])
                {
                    Console.WriteLine("    " + v);
                }
            }
        }

        static Dictionary<string, List<string>> GroupSamples()
        {
            Dictionary<string, List<string>> dict = new Dictionary<string, List<string>>();
            Array.Sort(header1);
            for (int i = 0; i < header1.Length; i++)
            {
                for (int j = i + 1; j < header1.Length; j++)
                {
                    if (!Information.IsNumeric(header1[i]))
                    {
                        if (Regex.IsMatch(header1[j], "(?i)^" + header1[i] + "(_[a-z]|[a-z]|_\\d)+$"))
                        {
                            if (!dict.ContainsKey(header1[i]))
                            {
                                dict[header1[i]] = new List<string>(new string[] { header1[j] });
                            }
                            else
                            {
                                dict[header1[i]].Add(header1[j]);
                            }
                        }
                    }
                    if (Information.IsNumeric(header1[i]) && header1[i].Contains("_"))                        
                    {
                        if (Regex.IsMatch(header1[j], "(?i)^" + header1[i] + @"([\s\da-z]+)$"))
                        {
                            if (!dict.ContainsKey(header1[i]))
                            {
                                dict[header1[i]] = new List<string>(new string[] { header1[j] });
                            }
                            else
                            {
                                dict[header1[i]].Add(header1[j]);
                            }
                        }
                    }
                    if(Information.IsNumeric(header1[i]) && !header1[i].Contains("_"))
                    {
                        if (Regex.IsMatch(header1[j], "(?i)^" + header1[i] + @"(_\d+|[a-z]+)$"))
                        {
                            if (!dict.ContainsKey(header1[i]))
                            {
                                dict[header1[i]] = new List<string>(new string[] { header1[j] });
                            }
                            else
                            {
                                dict[header1[i]].Add(header1[j]);
                            }
                        }
                    }
                    if (!Information.IsNumeric(header1[i]) && !header1[i].Contains("_"))
                    {
                        if (Regex.IsMatch(header1[j], "(?i)^" + header1[i] + @"[\sa-z\d]+$"))
                        {
                            if (!dict.ContainsKey(header1[i]))
                            {
                                dict[header1[i]] = new List<string>(new string[] { header1[j] });
                            }
                            else
                            {
                                dict[header1[i]].Add(header1[j]);
                            }
                        }
                    }

                }
            }
            return dict;
        }
    }
}

Open in new window

0
 

Author Comment

by:zhshqzyc
ID: 35074046
I already figured it out. Thank you very much.
0
 
LVL 19

Expert Comment

by:Shahan Ayyub
ID: 35074064
So your problem solved ???
0
 
LVL 19

Accepted Solution

by:
Shahan Ayyub earned 500 total points
ID: 35074089
Did you test my solution ???
0
 

Author Comment

by:zhshqzyc
ID: 35087506
Your code should work but I would like a general case. In your pattern you used '_', but it may be other char. For example, {"KA","KA*2}  instead of {"KA","KA_2"}.
0
 

Author Comment

by:zhshqzyc
ID: 35098015
Sorry, it is a mistakenly hit.
0

Featured Post

Active Directory Webinar

We all know we need to protect and secure our privileges, but where to start? Join Experts Exchange and ManageEngine on Tuesday, April 11, 2017 10:00 AM PDT to learn how to track and secure privileged users in Active Directory.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Entity Framework is a powerful tool to help you interact with the DataBase but still doesn't help much when we have a Stored Procedure that returns more than one resultset. The solution takes some of out-of-the-box thinking; read on!
This article aims to explain the working of CircularLogArchiver. This tool was designed to solve the buildup of log file in cases where systems do not support circular logging or where circular logging is not enabled
Established in 1997, Technology Architects has become one of the most reputable technology solutions companies in the country. TA have been providing businesses with cost effective state-of-the-art solutions and unparalleled service that is designed…
Although Jacob Bernoulli (1654-1705) has been credited as the creator of "Binomial Distribution Table", Gottfried Leibniz (1646-1716) did his dissertation on the subject in 1666; Leibniz you may recall is the co-inventor of "Calculus" and beat Isaac…

828 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question