Solved

VS 2008 Looking for Regular Expressions

Posted on 2011-03-05
13
410 Views
Last Modified: 2012-05-11
Hi, I have a string list which has hundrens of elements. Now I want to find a rule to distinguish them according to their similarity. There are these kinds of pattern.


Group 1: "ABC 20", "ABC 20 Dup". There is a substring "DUP" in the last position.
Group 2: "saliva 4", "saliva 4_2","siliva 4_3", etc. There is a substring "_" plus a number in the last position.
Group 3: "sal_1", "sal_1b","sal_1c", etc. There is a character in the last popsition.
Group 4: "NA222", "NA222b", there is a character in the last position.
Group 5: "1","1_2","10","10_2" etc.

I want to put them into a dictionary if they are similar. Thanks for help with C# code.
0
Comment
Question by:zhshqzyc
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 8
  • 5
13 Comments
 

Author Comment

by:zhshqzyc
ID: 35044518
Maybe my group rule is wrong.
 only group 5 is enough. I want to find the similarty of the strings then put into a dictionary.

My expected result is:

 
Dictionary<string,List<string>> dict = new Dictionary<string,List<string>>();
dict["ABC 20"] = {"ABC 20","ABC 20 Dup"};
 dict["1"] = {"1","1_2"};

Open in new window

The question is how to avoid dict["1"]={"1","10"};

0
 
LVL 19

Expert Comment

by:Shahan Ayyub
ID: 35045760
Hi!

Your above attached code should be like this:

            Dictionary<string,List<string>> dict = new Dictionary<string,List<string>>();
            dict["ABC 20"] = new List<string> { "ABC 20 DUP" };
            dict["1"] = new List<string> { "1", "1_2" };

Open in new window


In the code these part are considerable:
            Dictionary<string,List<string>> dict = new Dictionary<string,List<string>>();
            dict["ABC 20"] = new List<string> { "ABC DUP 20","Dup 20" };
            dict["1"] = new List<string> { "1", "1_2" };

Now, Can you elaborate a little bit that if you have these values:
"saliva 4_2","siliva 4_3"
under the key:
"saliva 4"

Then what you want next ??
Please clarify this:
>>>I want to put them into a dictionary if they are similar
0
 

Author Comment

by:zhshqzyc
ID: 35047219
dict["ABC 20"]=new List<string>{"ABC 20","ABC 20 Dup"};
dict["saliva 4"]=new List<string>{"saliva 4","saliva 4_2","siliva 4_3"};

Rule:
•Only digits or
•mix letters and digits or underscore(non pure digits)

Case 1: digits only
add the string to the dictionary as a new key.Search the entire string list,
  if a string is found for the portion before non digit
  add the string to the list referenced by the found key.(ex. "10","10_2","10_b")
Case 2: mix letters and digits or underscore/white space(non pure digits)
 add the string to the dictionary as a new key.Search the entire string list,
 if a string is found for the portion
  just add the string to the list referenced by the found key.(ex. "ABC 20","ABC 20 Dup")("N2222","N2222b")("Sal_1","Sal_1b")("saliva 4","saliva 4_2")


0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:zhshqzyc
ID: 35047330
We are doing biology experiments. Each bio-sample has an uniqe id. But the experiment may be repeated for each sample. For example  sample "1" was tested then we have to duplicate the experiment, the second time we used "1_1" or "1_b" as control group name for analysis comparation. These sample names are stranger such as "ABC 20", then duplicated experiment is marked "ABC 20 Dup". So the key should be an element of the list or array. Thus you can not split "ABC 20" to "ABC". "saliva 4" is a sample name, the duplicate experiment should be named as  "saliva 4" plus something such as "saliva 4_2", so you can not split "sliva 4" to "saliva".

And also "1" and "10" are different sample, they can not grouped together. "1" and "1_1" or "1_b" are the same sample but used in different experiments. I think the hard part is when the sample id is a number, then the duplicate one is the number plus nondigit character. But we should prevent to put "1" and "10_2" together although portion "0_2" is a pure digit.

0
 
LVL 19

Expert Comment

by:Shahan Ayyub
ID: 35047862
Hi!

As I understood:
In dict[key, value]
Here key refers original sample and value refers the duplicate. So what i need to do is to group duplicates under their samples like this:
key:                   Value:
dict["ABC"] =   { "ABC"}
dict["ABC Dup"] = {"ABC Dup"}
dict["1"] = {"1"}
dict["1_1"] = {"1_1"}

to this:
dict["ABC"] = {"ABC" , "ABC Dup"}
dict["1"] = {"1" , "1_1"}

Here ABC Dup is a duplicate of ABC and 1_1 is a duplicate of 1.

If their is something wrong please correct it.
0
 

Author Comment

by:zhshqzyc
ID: 35048222
Exactly.
0
 

Author Comment

by:zhshqzyc
ID: 35064693
My primary code. Need help.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace Test
{
    class Program
    {
       
        static void Main(string[] args)
        {
            GetSimilarities();
        }

         static void GetSimilarities()
        {
             Dictionary<string, List<string>> dict1 = new Dictionary<string, List<string>>();

            for (int i = 0; i < header1.Length; i++)
            {
                string key = header1[i];
                bool result = key.All(Char.IsDigit);
                for (int j = 0; j < header1.Length; j++)
                {
                    string value = header1[j];
                    if (value.Length <= key.Length)
                        continue;
                    if (result == true)
                    {
                        string pattern = key + @"[^\d]+\w";
                        if (Regex.IsMatch(value, pattern))
                        {
                            if (key == "1")
                                Console.ReadLine();
                            if (!dict1.ContainsKey(key))
                            {
                                dict1[key] = new List<string>();
                                dict1[key].Add(key);
                                dict1[key].Add(value);
                            }
                        }
                    }
                    else
                    {
                        string pattern = key + @".+";
                        if (Regex.IsMatch(value, pattern))
                        {
                            if (!dict1.ContainsKey(key))
                            {
                                dict1[key] = new List<string>();
                                dict1[key].Add(key);
                                dict1[key].Add(value);  
                            }
                        }
                    }
                }

            }
        }

        static string[] header1 = new string[]
        { "XYZ 20", "XYZ 20 Dup", "Saliva 1_2", "Saliva 1", "Sal_Lb", "Sal_L", "Sal_2b", "Sal_2", "Sal_1b", "Sal_1", "KA_2", "KA", "JDT_2", "JDT", "JA_2", "JA", "8_2", "8_2b", "6688", "6688b", "test1a_2", "test1a", "1", "1_1", "2", "2_1", "3", "3_1", "10", "10_1" };
    }
    
}

Open in new window

When key ="1", the value is incorrect.
0
 
LVL 19

Expert Comment

by:Shahan Ayyub
ID: 35073527
Hi!

Sorry for the late response, i was busy somewhere. I didn't test your method yet, but came up with this one. A full version:

     
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Diagnostics;
using Microsoft.VisualBasic; 

namespace ConsoleCS
{
    class Program
    {
        static string[] header1 = new string[] {
                                                 "XYZ 20", "XYZ 20 Dup", "Saliva 1_2", "Saliva 1", 
                                                 "Sal_Lb", "Sal_L", "Sal_2b", "Sal_2", "Sal_1b", "Sal_1",
                                                 "KA_2", "KA", "JDT_2", "JDT", "JA_2", "JA", "8_2", 
                                                 "8_2b", "6688", "6688b", "test1a_2", "test1a", "1", 
                                                 "1_1", "2", "2_1", "3", "3_1", "10", "10_1",
                                                 "XYZ 20 DUP_2"
                                               };
        static void Main(string[] args)
        {
            Dictionary<string,List<string>> dict = new Dictionary<string,List<string>>();
            dict = GroupSamples();
            PrintValues(dict);
        }

        public static void PrintValues(Dictionary<string, List<string>> dict)
        {
            foreach (string k in dict.Keys)
            {
                Console.WriteLine(k);
                foreach (string v in dict[k])
                {
                    Console.WriteLine("    " + v);
                }
            }
        }

        static Dictionary<string, List<string>> GroupSamples()
        {
            Dictionary<string, List<string>> dict = new Dictionary<string, List<string>>();
            Array.Sort(header1);
            for (int i = 0; i < header1.Length; i++)
            {
                for (int j = i + 1; j < header1.Length; j++)
                {
                    if (!Information.IsNumeric(header1[i]))
                    {
                        if (Regex.IsMatch(header1[j], "(?i)^" + header1[i] + "(_[a-z]|[a-z]|_\\d)+$"))
                        {
                            if (!dict.ContainsKey(header1[i]))
                            {
                                dict[header1[i]] = new List<string>(new string[] { header1[j] });
                            }
                            else
                            {
                                dict[header1[i]].Add(header1[j]);
                            }
                        }
                    }
                    if (Information.IsNumeric(header1[i]) && header1[i].Contains("_"))                        
                    {
                        if (Regex.IsMatch(header1[j], "(?i)^" + header1[i] + @"([\s\da-z]+)$"))
                        {
                            if (!dict.ContainsKey(header1[i]))
                            {
                                dict[header1[i]] = new List<string>(new string[] { header1[j] });
                            }
                            else
                            {
                                dict[header1[i]].Add(header1[j]);
                            }
                        }
                    }
                    if(Information.IsNumeric(header1[i]) && !header1[i].Contains("_"))
                    {
                        if (Regex.IsMatch(header1[j], "(?i)^" + header1[i] + @"(_\d+|[a-z]+)$"))
                        {
                            if (!dict.ContainsKey(header1[i]))
                            {
                                dict[header1[i]] = new List<string>(new string[] { header1[j] });
                            }
                            else
                            {
                                dict[header1[i]].Add(header1[j]);
                            }
                        }
                    }
                    if (!Information.IsNumeric(header1[i]) && !header1[i].Contains("_"))
                    {
                        if (Regex.IsMatch(header1[j], "(?i)^" + header1[i] + @"[\sa-z\d]+$"))
                        {
                            if (!dict.ContainsKey(header1[i]))
                            {
                                dict[header1[i]] = new List<string>(new string[] { header1[j] });
                            }
                            else
                            {
                                dict[header1[i]].Add(header1[j]);
                            }
                        }
                    }

                }
            }
            return dict;
        }
    }
}

Open in new window

0
 

Author Comment

by:zhshqzyc
ID: 35074046
I already figured it out. Thank you very much.
0
 
LVL 19

Expert Comment

by:Shahan Ayyub
ID: 35074064
So your problem solved ???
0
 
LVL 19

Accepted Solution

by:
Shahan Ayyub earned 500 total points
ID: 35074089
Did you test my solution ???
0
 

Author Comment

by:zhshqzyc
ID: 35087506
Your code should work but I would like a general case. In your pattern you used '_', but it may be other char. For example, {"KA","KA*2}  instead of {"KA","KA_2"}.
0
 

Author Comment

by:zhshqzyc
ID: 35098015
Sorry, it is a mistakenly hit.
0

Featured Post

Salesforce Has Never Been Easier

Improve and reinforce salesforce training & adoption using WalkMe's digital adoption platform. Start saving on costly employee training by creating fast intuitive Walk-Thrus for Salesforce. Claim your Free Account Now

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

We all know that functional code is the leg that any good program stands on when it comes right down to it, however, if your program lacks a good user interface your product may not have the appeal needed to keep your customers happy. This issue can…
Exception Handling is in the core of any application that is able to dignify its name. In this article, I'll guide you through the process of writing a DRY (Don't Repeat Yourself) Exception Handling mechanism, using Aspect Oriented Programming.
With Secure Portal Encryption, the recipient is sent a link to their email address directing them to the email laundry delivery page. From there, the recipient will be required to enter a user name and password to enter the page. Once the recipient …
This video shows how to use Hyena, from SystemTools Software, to update 100 user accounts from an external text file. View in 1080p for best video quality.

732 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question