Solved

VS 2008 Looking for Regular Expressions

Posted on 2011-03-05
13
405 Views
Last Modified: 2012-05-11
Hi, I have a string list which has hundrens of elements. Now I want to find a rule to distinguish them according to their similarity. There are these kinds of pattern.


Group 1: "ABC 20", "ABC 20 Dup". There is a substring "DUP" in the last position.
Group 2: "saliva 4", "saliva 4_2","siliva 4_3", etc. There is a substring "_" plus a number in the last position.
Group 3: "sal_1", "sal_1b","sal_1c", etc. There is a character in the last popsition.
Group 4: "NA222", "NA222b", there is a character in the last position.
Group 5: "1","1_2","10","10_2" etc.

I want to put them into a dictionary if they are similar. Thanks for help with C# code.
0
Comment
Question by:zhshqzyc
  • 8
  • 5
13 Comments
 

Author Comment

by:zhshqzyc
ID: 35044518
Maybe my group rule is wrong.
 only group 5 is enough. I want to find the similarty of the strings then put into a dictionary.

My expected result is:

 
Dictionary<string,List<string>> dict = new Dictionary<string,List<string>>();
dict["ABC 20"] = {"ABC 20","ABC 20 Dup"};
 dict["1"] = {"1","1_2"};

Open in new window

The question is how to avoid dict["1"]={"1","10"};

0
 
LVL 19

Expert Comment

by:Shahan Ayyub
ID: 35045760
Hi!

Your above attached code should be like this:

            Dictionary<string,List<string>> dict = new Dictionary<string,List<string>>();
            dict["ABC 20"] = new List<string> { "ABC 20 DUP" };
            dict["1"] = new List<string> { "1", "1_2" };

Open in new window


In the code these part are considerable:
            Dictionary<string,List<string>> dict = new Dictionary<string,List<string>>();
            dict["ABC 20"] = new List<string> { "ABC DUP 20","Dup 20" };
            dict["1"] = new List<string> { "1", "1_2" };

Now, Can you elaborate a little bit that if you have these values:
"saliva 4_2","siliva 4_3"
under the key:
"saliva 4"

Then what you want next ??
Please clarify this:
>>>I want to put them into a dictionary if they are similar
0
 

Author Comment

by:zhshqzyc
ID: 35047219
dict["ABC 20"]=new List<string>{"ABC 20","ABC 20 Dup"};
dict["saliva 4"]=new List<string>{"saliva 4","saliva 4_2","siliva 4_3"};

Rule:
•Only digits or
•mix letters and digits or underscore(non pure digits)

Case 1: digits only
add the string to the dictionary as a new key.Search the entire string list,
  if a string is found for the portion before non digit
  add the string to the list referenced by the found key.(ex. "10","10_2","10_b")
Case 2: mix letters and digits or underscore/white space(non pure digits)
 add the string to the dictionary as a new key.Search the entire string list,
 if a string is found for the portion
  just add the string to the list referenced by the found key.(ex. "ABC 20","ABC 20 Dup")("N2222","N2222b")("Sal_1","Sal_1b")("saliva 4","saliva 4_2")


0
 

Author Comment

by:zhshqzyc
ID: 35047330
We are doing biology experiments. Each bio-sample has an uniqe id. But the experiment may be repeated for each sample. For example  sample "1" was tested then we have to duplicate the experiment, the second time we used "1_1" or "1_b" as control group name for analysis comparation. These sample names are stranger such as "ABC 20", then duplicated experiment is marked "ABC 20 Dup". So the key should be an element of the list or array. Thus you can not split "ABC 20" to "ABC". "saliva 4" is a sample name, the duplicate experiment should be named as  "saliva 4" plus something such as "saliva 4_2", so you can not split "sliva 4" to "saliva".

And also "1" and "10" are different sample, they can not grouped together. "1" and "1_1" or "1_b" are the same sample but used in different experiments. I think the hard part is when the sample id is a number, then the duplicate one is the number plus nondigit character. But we should prevent to put "1" and "10_2" together although portion "0_2" is a pure digit.

0
 
LVL 19

Expert Comment

by:Shahan Ayyub
ID: 35047862
Hi!

As I understood:
In dict[key, value]
Here key refers original sample and value refers the duplicate. So what i need to do is to group duplicates under their samples like this:
key:                   Value:
dict["ABC"] =   { "ABC"}
dict["ABC Dup"] = {"ABC Dup"}
dict["1"] = {"1"}
dict["1_1"] = {"1_1"}

to this:
dict["ABC"] = {"ABC" , "ABC Dup"}
dict["1"] = {"1" , "1_1"}

Here ABC Dup is a duplicate of ABC and 1_1 is a duplicate of 1.

If their is something wrong please correct it.
0
 

Author Comment

by:zhshqzyc
ID: 35048222
Exactly.
0
Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

 

Author Comment

by:zhshqzyc
ID: 35064693
My primary code. Need help.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace Test
{
    class Program
    {
       
        static void Main(string[] args)
        {
            GetSimilarities();
        }

         static void GetSimilarities()
        {
             Dictionary<string, List<string>> dict1 = new Dictionary<string, List<string>>();

            for (int i = 0; i < header1.Length; i++)
            {
                string key = header1[i];
                bool result = key.All(Char.IsDigit);
                for (int j = 0; j < header1.Length; j++)
                {
                    string value = header1[j];
                    if (value.Length <= key.Length)
                        continue;
                    if (result == true)
                    {
                        string pattern = key + @"[^\d]+\w";
                        if (Regex.IsMatch(value, pattern))
                        {
                            if (key == "1")
                                Console.ReadLine();
                            if (!dict1.ContainsKey(key))
                            {
                                dict1[key] = new List<string>();
                                dict1[key].Add(key);
                                dict1[key].Add(value);
                            }
                        }
                    }
                    else
                    {
                        string pattern = key + @".+";
                        if (Regex.IsMatch(value, pattern))
                        {
                            if (!dict1.ContainsKey(key))
                            {
                                dict1[key] = new List<string>();
                                dict1[key].Add(key);
                                dict1[key].Add(value);  
                            }
                        }
                    }
                }

            }
        }

        static string[] header1 = new string[]
        { "XYZ 20", "XYZ 20 Dup", "Saliva 1_2", "Saliva 1", "Sal_Lb", "Sal_L", "Sal_2b", "Sal_2", "Sal_1b", "Sal_1", "KA_2", "KA", "JDT_2", "JDT", "JA_2", "JA", "8_2", "8_2b", "6688", "6688b", "test1a_2", "test1a", "1", "1_1", "2", "2_1", "3", "3_1", "10", "10_1" };
    }
    
}

Open in new window

When key ="1", the value is incorrect.
0
 
LVL 19

Expert Comment

by:Shahan Ayyub
ID: 35073527
Hi!

Sorry for the late response, i was busy somewhere. I didn't test your method yet, but came up with this one. A full version:

     
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Diagnostics;
using Microsoft.VisualBasic; 

namespace ConsoleCS
{
    class Program
    {
        static string[] header1 = new string[] {
                                                 "XYZ 20", "XYZ 20 Dup", "Saliva 1_2", "Saliva 1", 
                                                 "Sal_Lb", "Sal_L", "Sal_2b", "Sal_2", "Sal_1b", "Sal_1",
                                                 "KA_2", "KA", "JDT_2", "JDT", "JA_2", "JA", "8_2", 
                                                 "8_2b", "6688", "6688b", "test1a_2", "test1a", "1", 
                                                 "1_1", "2", "2_1", "3", "3_1", "10", "10_1",
                                                 "XYZ 20 DUP_2"
                                               };
        static void Main(string[] args)
        {
            Dictionary<string,List<string>> dict = new Dictionary<string,List<string>>();
            dict = GroupSamples();
            PrintValues(dict);
        }

        public static void PrintValues(Dictionary<string, List<string>> dict)
        {
            foreach (string k in dict.Keys)
            {
                Console.WriteLine(k);
                foreach (string v in dict[k])
                {
                    Console.WriteLine("    " + v);
                }
            }
        }

        static Dictionary<string, List<string>> GroupSamples()
        {
            Dictionary<string, List<string>> dict = new Dictionary<string, List<string>>();
            Array.Sort(header1);
            for (int i = 0; i < header1.Length; i++)
            {
                for (int j = i + 1; j < header1.Length; j++)
                {
                    if (!Information.IsNumeric(header1[i]))
                    {
                        if (Regex.IsMatch(header1[j], "(?i)^" + header1[i] + "(_[a-z]|[a-z]|_\\d)+$"))
                        {
                            if (!dict.ContainsKey(header1[i]))
                            {
                                dict[header1[i]] = new List<string>(new string[] { header1[j] });
                            }
                            else
                            {
                                dict[header1[i]].Add(header1[j]);
                            }
                        }
                    }
                    if (Information.IsNumeric(header1[i]) && header1[i].Contains("_"))                        
                    {
                        if (Regex.IsMatch(header1[j], "(?i)^" + header1[i] + @"([\s\da-z]+)$"))
                        {
                            if (!dict.ContainsKey(header1[i]))
                            {
                                dict[header1[i]] = new List<string>(new string[] { header1[j] });
                            }
                            else
                            {
                                dict[header1[i]].Add(header1[j]);
                            }
                        }
                    }
                    if(Information.IsNumeric(header1[i]) && !header1[i].Contains("_"))
                    {
                        if (Regex.IsMatch(header1[j], "(?i)^" + header1[i] + @"(_\d+|[a-z]+)$"))
                        {
                            if (!dict.ContainsKey(header1[i]))
                            {
                                dict[header1[i]] = new List<string>(new string[] { header1[j] });
                            }
                            else
                            {
                                dict[header1[i]].Add(header1[j]);
                            }
                        }
                    }
                    if (!Information.IsNumeric(header1[i]) && !header1[i].Contains("_"))
                    {
                        if (Regex.IsMatch(header1[j], "(?i)^" + header1[i] + @"[\sa-z\d]+$"))
                        {
                            if (!dict.ContainsKey(header1[i]))
                            {
                                dict[header1[i]] = new List<string>(new string[] { header1[j] });
                            }
                            else
                            {
                                dict[header1[i]].Add(header1[j]);
                            }
                        }
                    }

                }
            }
            return dict;
        }
    }
}

Open in new window

0
 

Author Comment

by:zhshqzyc
ID: 35074046
I already figured it out. Thank you very much.
0
 
LVL 19

Expert Comment

by:Shahan Ayyub
ID: 35074064
So your problem solved ???
0
 
LVL 19

Accepted Solution

by:
Shahan Ayyub earned 500 total points
ID: 35074089
Did you test my solution ???
0
 

Author Comment

by:zhshqzyc
ID: 35087506
Your code should work but I would like a general case. In your pattern you used '_', but it may be other char. For example, {"KA","KA*2}  instead of {"KA","KA_2"}.
0
 

Author Comment

by:zhshqzyc
ID: 35098015
Sorry, it is a mistakenly hit.
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Article by: Ivo
C# And Nullable Types Since 2.0 C# has Nullable(T) Generic Structure. The idea behind is to allow value type objects to have null values just like reference types have. This concerns scenarios where not all data sources have values (like a databa…
Introduction This article series is supposed to shed some light on the use of IDisposable and objects that inherit from it. In essence, a more apt title for this article would be: using (IDisposable) {}. I’m just not sure how many people would ge…
Sending a Secure fax is easy with eFax Corporate (http://www.enterprise.efax.com). First, Just open a new email message.  In the To field, type your recipient's fax number @efaxsend.com. You can even send a secure international fax — just include t…
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

759 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now