Solved

RegEx MultiLine issue with repeating data.

Posted on 2008-10-15
26
1,182 Views
Last Modified: 2008-10-23
Hi DDrudik :P and All

Ok, the only way I can illustrate this is by attaching all the code, so I will reference everything by line-number (as displayed in Visual C# / text editor) so that you know what I'm referring to.
Apologies for the inconvenience.

GIVEN::

Line 44 - 49:
* TemplateItems are added
* Note especially line 48 in this example for "UNT..."

Line 55 - 63
* Incoming data is added
* Note especially FIRST occurance of UNT data in lines 58 and 59
* Note again SECOND occurance of UNT data in lines 61 - 63

Line 178
* my pattern is finalised

Line 183
* A Regular expression is created using the pattern
* The Multiline property is set

RESULT:
If you run the program and look at the output, you will notice that the UNT items fetched are as follows:

[untTotalCode] [A]
[untTotal] [000001546000AA]
[untTotalCode_1] [B]
[untTotal_1] [000001546000BB]
[untTotalCode_2] [4]
[untTotal_2] [00000154600004]
[untTotalCode_3] [5]
[untTotal_3] [00000154600005]
[untTotalCode_4] [6]
[untTotal_4] [00000154600006]

PROBLEM:

I only need the Regex to match all instances of the UNT match until it encounters something other than the matched pattern. i.e. The items lin line 58 and 59, not 61 - 63. Now I understand why it is doing this because it is getting all matches that it finds for the template in line 48, but how do I change this so that it only gets the matched items while the pattern doesn't change?

In other words, for my template, match all sequential matches, but when something OTHER than the pattern is encountered, stop matching??

Thanks








using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
 
namespace MessageTranslationExample
{
    public class TemplateItem
    {
        public Boolean isTemplateItemRequired { get; set; }
        public Boolean areExactFieldsRequired { get; set; }
        public Boolean isRepeatable { get; set; }
        public String templateBody { get; set; }
        public TemplateItem(Boolean IsTemplateItemRequired, Boolean AreExactFieldsRequired, Boolean IsRepeatable, String TemplateBody)
        {
            isTemplateItemRequired = IsTemplateItemRequired;
            areExactFieldsRequired = AreExactFieldsRequired;
            isRepeatable = IsRepeatable;
            templateBody = TemplateBody;
        }
    }
    
    class MessageTranslation
    {
        static String incomingData;
        static List<TemplateItem> templateItems;
        static Dictionary<String, String> incomingDictionary;
        static Boolean bDeleteMatchedDataFromIncoming;
        static String myPattern;
       
        static void Main(string[] args)
        {
            SetTemplateItems();
            SetIncomingData();
            TranslateMessage();
            DisplayDictionary(incomingDictionary);    
            Console.ReadKey();
        }
                
        //Example Data Structure Template to use
        static void SetTemplateItems()
        {
            templateItems = new List<TemplateItem>();
            templateItems.Add(new TemplateItem(true, true, false, "UNH+{unhCode1}+{unhMessageType}:{unhShortCode}:{unhVersion}:{unhControlBody}+{unhType}'"));
            templateItems.Add(new TemplateItem(true, true, false, "UCI+{uciNumber}+{uciCustomer}+{uciOrganisation}+{uciVersion}'"));
            templateItems.Add(new TemplateItem(true, true, false, "UCM+{ucmNumber}+{ucmType}:{ucmShortCode}:{ucmAbbrev}:{ucmOrganisation}:{ucmIndex}+{ucmIndexCode}'"));
            templateItems.Add(new TemplateItem(true, true, true, "UNT+{untTotalCode}+{untTotal}'"));
            templateItems.Add(new TemplateItem(true, true, false, "UNZ+{unzCode}+{unzId}'"));
        }
 
        //Example Incoming Data to use
        static void SetIncomingData()
        {
            incomingData = @"UNH+00000154600001+CONTRL:D:3:UN+CONTRL'
UCI+00000000000443+ETRADEX+SARS+7'
UCM+00000044300001+CUSDEC:D:96B:UN:ZZZ01+7'
UNT+A+000001546000AA'
UNT+B+000001546000BB'
UNZ+1+000001546'
UNT+4+00000154600004'
UNT+5+00000154600005'
UNT+6+00000154600006'";
        }
 
        //Translate the Incoming Message
        /* DEV HINTS:
        //  * If start index != 0 (thus not located in first item of incoming string) 
                for a template match and the templateItem isRequired, error, else if
                not required, move on to next templateItem without deleting data from incoming
            *  If repeatable, loop through the incoming data until a match is not found for the 
               templateItem.
         */
        /* DEV ISSUES:
       // 
        */ 
        static void TranslateMessage()
        {
            incomingDictionary = new Dictionary<string, string>();
            
            //Analyze each Template Item against the Incoming Data
            foreach (TemplateItem templateItem in templateItems)
            {
                //Init
                bDeleteMatchedDataFromIncoming = true;
                myPattern = "";
                               
                //Generate RegEx Pattern
                SetIncomingValues(templateItem, incomingData);
                                
                if (templateItem.isRepeatable)
                {
                    //Create Regex to match all back to back occurances of the template pattern
                }
                else
                {
                    //Process against incoming data
                    if ("RegEx Match is found" == "RegEx Match found")
                    {
 
                    }
                    else
                    {
                        if (templateItem.isTemplateItemRequired)
                        {
                            
                        }
                    }
                                      
                }
            }
            return;
        }
 
        //Generate Regular Expression Pattern
        static void SetIncomingValues(TemplateItem templateItem, String incomingData)
        {
            //Init
            Regex reg;
            String pattern;
            String keyValue, keyName;
            String processedKeyValue, processedKeyName;
            List<String> keys, processedKeys;
            int index;
            GroupCollection groups;
            String lastMatch = "";
            reg = new Regex("(?<text>[^{}]*)({(?<key>[^}]+)})?"); // .NET Regular Expression matching KeyTemplate Grammar
            keys = new List<String>();
            processedKeys = new List<String>();
            
            // Pattern Start Character
            pattern = "^"; 
            
            //For each RegEx Template Item match in the Template
            foreach (Match match in reg.Matches(templateItem.templateBody))
            {
                //Handle whitespaces in the ValueTemplate
                keyValue = "";
                foreach (char c in match.Groups["text"].Value)
                {
                    if (c != ' ' && c != '\t')
                        keyValue += c + "$$SPACE$$";
                    else
                        keyValue += c;
                }
                keyValue = keyValue.Replace("$$SPACE$$ ", "$$SPACE$$");
 
                //Remove the last white space matcher of the pattern
                if (keyValue.EndsWith("$$SPACE$$"))
                {
                    keyValue = keyValue.Substring(0, keyValue.Length - "$$SPACE$$".Length);
                }
                pattern += keyValue.Replace("+", "\\+").Replace(".", "\\.").Replace("*", "\\*").Replace("?", "\\?").Replace("(", "\\(").Replace("[", "\\[").Replace("]", "\\]").Replace(")", "\\)").Replace("$$SPACE$$", "\\s*");
                
                //Extract the key from the match
                if (match.Groups["key"].Value != "")
                {
                    keyName = match.Groups["key"].Value.Replace("+", "\\+").Replace(".", "\\.").Replace("*", "\\*").Replace("?", "\\?").Replace("(", "\\(").Replace("[", "\\[").Replace("]", "\\]").Replace(")", "\\)");
 
                    //Find a valid key name for the result dictionary to avoid duplicates when repeating the template
                    if (keys.Contains(keyName))
                    {
                        index = 1;
                        while (keys.Contains(keyName + "_" + index.ToString())) index++;
                        keyName = keyName + "_" + index.ToString();
                    }
                     keys.Add(keyName);
                   
                    //A value may be omitted so make its matcher optionnal
                    pattern += string.Format("(?<{0}>.*)", keyName);
                    
                    //Set last match for error messages
                    lastMatch = keyName; 
                }
            }
           
            //Allows pattern to look at new line for the same pattern
            pattern += @"(?=\r\n|$)";
           
            Console.WriteLine("Pattern: " + pattern + "\n");
                       
            //Value Extractor : Uses the generated Regex to extract values from the input
            reg = new Regex(pattern, RegexOptions.Multiline); //Allows pattern matching to span one or multiple lines
            
            //Match Validation
            if (!reg.IsMatch(incomingData))
            {
                //Only error if templateItem is required, othewrwise match not required
                if (templateItem.isTemplateItemRequired)
                {
                    throw new Exception("The Message Data Structure differs from that of the Message Template Structure and thus a conversion can not be done between the two. Last Successful Match Key was: " + lastMatch);
                }
             }
 
            //Build Key Values based on matches
            MatchCollection mc = reg.Matches(incomingData);
            if (mc.Count > 0)
            {
                foreach (Match m in mc)
                {
                    for (int gIdx = 0; gIdx < m.Groups.Count; gIdx++)
                    {
                        //Skip item at index 0 as it contains the full match
                        if (gIdx > 0)
                        {
                            processedKeyName = reg.GetGroupNames()[gIdx];
                            processedKeyValue = m.Groups[gIdx].Value;
                            
                            //Find a valid key name for the result dictionary to avoid duplicates when repeating the template
                            if (processedKeys.Contains(processedKeyName))
                            {
                                index = 1;
                                while (processedKeys.Contains(processedKeyName + "_" + index.ToString())) index++;
                                processedKeyName = processedKeyName + "_" + index.ToString();
                            }
                            processedKeys.Add(processedKeyName);
                           
                            //Add to Dictionary
                            incomingDictionary[processedKeyName] = processedKeyValue;
                        }
                    }
                    // Only match first match unless templateItem is repeatable
                    if (!templateItem.isRepeatable)
                    {
                        break;
                    }
                }
            }
            else
            {
                throw new Exception("Pattern did not match: ( + " + templateItem.templateBody + ").");
            }
 
            return;
        }
 
        static void DisplayDictionary(Dictionary<String, String> dictionary)
        {
            Console.WriteLine("---- DICTIONARY DATA ----\n");
            foreach ( String key in dictionary.Keys)
            {
                Console.WriteLine("[" + key + "] [" + dictionary[key] + "]\r");
            }
            Console.WriteLine("\n\n");
        }
    }
}

Open in new window

0
Comment
Question by:djcheeky
  • 13
  • 13
26 Comments
 
LVL 27

Expert Comment

by:ddrudik
ID: 22724942
Is this something unique to UNT?

You could try to match the source on:
Regex re = new Regex(@".+(?=^(?!UNT))",RegexOptions.Multiline | RegexOptions.Singleline);

Which would result in:
    [0] => UNH+00000154600001+CONTRL:D:3:UN+CONTRL'
UCI+00000000000443+ETRADEX+SARS+7'
UCM+00000044300001+CUSDEC:D:96B:UN:ZZZ01+7'
UNT+A+000001546000AA'
UNT+B+000001546000BB'

Then applying the UNT regex to that result will find only the matches you seek.
0
 

Author Comment

by:djcheeky
ID: 22728926
Basically, with the RegEx that is generated specifically for that templateItem below:
^U\s*N\s*T\s*\+(?<untTotalCode>.*)\+(?<untTotal>.*)'(?=\r\n|$)

The Regular expression goes through all the data:
(A)
 incomingData = @"UNH+00000154600001+CONTRL:D:3:UN+CONTRL'
UCI+00000000000443+ETRADEX+SARS+7'
UCM+00000044300001+CUSDEC:D:96B:UN:ZZZ01+7'
UNT+A+000001546000AA'
UNT+B+000001546000BB'
UNZ+1+000001546'
UNT+4+00000154600004'
UNT+5+00000154600005'
UNT+6+00000154600006'";

(B)
And matches these lines:
UNT+A+000001546000AA'
UNT+B+000001546000BB'
UNT+4+00000154600004'
UNT+5+00000154600005'
UNT+6+00000154600006'

But if you look at (A), the line:
UNZ+1+000001546'

this actually breaks the pattern between:
UNT+A+000001546000AA'
UNT+B+000001546000BB'

AND

UNT+4+00000154600004'
UNT+5+00000154600005'
UNT+6+00000154600006'

But the regular expression matches ALL UNT matches - I just want it to match until it encounters something different, in other words, just:
UNT+A+000001546000AA'
UNT+B+000001546000BB'

So it is matching all back-to-back similar data matches UNTIL something else is encountered, and not EVERY single match for that pattern in the whole string.

Thanks


 
0
 
LVL 27

Expert Comment

by:ddrudik
ID: 22730065
For the repeating pattern you will need to use a process as I described in my comment above, once you match the entire string on every line until the last UNT is consumed in the first group of UNT's, then you can perform your match of all UNT's from that new string.  There's no other regex way of doing what you describe once a pattern has been applied to a string and matches are returned.
0
Master Your Team's Linux and Cloud Stack

Come see why top tech companies like Mailchimp and Media Temple use Linux Academy to build their employee training programs.

 

Author Comment

by:djcheeky
ID: 22730459
I'm not too sure how to do that, because your suggestion seems to hard code "UNT" into the Regex, but it could be any element?
0
 
LVL 27

Expert Comment

by:ddrudik
ID: 22730837
If that's the case then you will need to loop through the regex pattern line-by-line, creating a matchcollection with each, and if the count of the matchcollection is more than 1 then you need to use the first three characters (UNT UMT etc.) as a variable in regex pattern as shown above to isolate the part of the string you want to the matches from, then create a matchcollection for those matches.  It's going to be a bit of looping etc. but if that's what you need then that's how I see it can be done.  If you need an example of this from me unfortunately I cannot provide one until later today.
0
 

Author Comment

by:djcheeky
ID: 22734256
Hi drrudik - could you please provide an example. I have tried going through your posts but I dont seem to follow? Thanks
0
 
LVL 27

Expert Comment

by:ddrudik
ID: 22734268
In about 2 hours I will be able to.  Thanks.
0
 

Author Comment

by:djcheeky
ID: 22734308
No prob mate! Thanks a lot!
0
 
LVL 27

Expert Comment

by:ddrudik
ID: 22736490
The solution is not yet finalized, this may have to wait until next day.
0
 
LVL 27

Expert Comment

by:ddrudik
ID: 22736863
See if that fits that you need (it works for your requirement but I'm not sure how that fits with the solution option where the entire regex pattern can match the entire string, if that's still something you were checking for).

The block I added starts with the line:
string subPattern...

Maybe that will give you an idea how to use.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
 
namespace MessageTranslationExample
{
    public class TemplateItem
    {
        public Boolean isTemplateItemRequired { get; set; }
        public Boolean areExactFieldsRequired { get; set; }
        public Boolean isRepeatable { get; set; }
        public String templateBody { get; set; }
        public TemplateItem(Boolean IsTemplateItemRequired, Boolean AreExactFieldsRequired, Boolean IsRepeatable, String TemplateBody)
        {
            isTemplateItemRequired = IsTemplateItemRequired;
            areExactFieldsRequired = AreExactFieldsRequired;
            isRepeatable = IsRepeatable;
            templateBody = TemplateBody;
        }
    }
    
    class MessageTranslation
    {
        static String incomingData;
        static List<TemplateItem> templateItems;
        static Dictionary<String, String> incomingDictionary;
        static Boolean bDeleteMatchedDataFromIncoming;
        static String myPattern;
       
        static void Main(string[] args)
        {
            SetTemplateItems();
            SetIncomingData();
            TranslateMessage();
            DisplayDictionary(incomingDictionary);    
            Console.ReadKey();
        }
                
        //Example Data Structure Template to use
        static void SetTemplateItems()
        {
            templateItems = new List<TemplateItem>();
            templateItems.Add(new TemplateItem(true, true, false, "UNH+{unhCode1}+{unhMessageType}:{unhShortCode}:{unhVersion}:{unhControlBody}+{unhType}'"));
            templateItems.Add(new TemplateItem(true, true, false, "UCI+{uciNumber}+{uciCustomer}+{uciOrganisation}+{uciVersion}'"));
            templateItems.Add(new TemplateItem(true, true, false, "UCM+{ucmNumber}+{ucmType}:{ucmShortCode}:{ucmAbbrev}:{ucmOrganisation}:{ucmIndex}+{ucmIndexCode}'"));
            templateItems.Add(new TemplateItem(true, true, true, "UNT+{untTotalCode}+{untTotal}'"));
            templateItems.Add(new TemplateItem(true, true, false, "UNZ+{unzCode}+{unzId}'"));
        }
 
        //Example Incoming Data to use
        static void SetIncomingData()
        {
            incomingData = @"UNH+00000154600001+CONTRL:D:3:UN+CONTRL'
UCI+00000000000443+ETRADEX+SARS+7'
UCM+00000044300001+CUSDEC:D:96B:UN:ZZZ01+7'
UNT+A+000001546000AA'
UNT+B+000001546000BB'
UNZ+1+000001546'
UNT+4+00000154600004'
UNT+5+00000154600005'
UNT+6+00000154600006'";
        }
 
        //Translate the Incoming Message
        /* DEV HINTS:
        //  * If start index != 0 (thus not located in first item of incoming string) 
                for a template match and the templateItem isRequired, error, else if
                not required, move on to next templateItem without deleting data from incoming
            *  If repeatable, loop through the incoming data until a match is not found for the 
               templateItem.
         */
        /* DEV ISSUES:
       // 
        */ 
        static void TranslateMessage()
        {
            incomingDictionary = new Dictionary<string, string>();
            
            //Analyze each Template Item against the Incoming Data
            foreach (TemplateItem templateItem in templateItems)
            {
                //Init
                bDeleteMatchedDataFromIncoming = true;
                myPattern = "";
                               
                //Generate RegEx Pattern
                SetIncomingValues(templateItem, incomingData);
                                
                if (templateItem.isRepeatable)
                {
                    //Create Regex to match all back to back occurances of the template pattern
                }
                else
                {
                    //Process against incoming data
                    if ("RegEx Match is found" == "RegEx Match found")
                    {
 
                    }
                    else
                    {
                        if (templateItem.isTemplateItemRequired)
                        {
                            
                        }
                    }
                                      
                }
            }
            return;
        }
 
        //Generate Regular Expression Pattern
        static void SetIncomingValues(TemplateItem templateItem, String incomingData)
        {
            //Init
            Regex reg;
            String pattern;
            String keyValue, keyName;
            String processedKeyValue, processedKeyName;
            List<String> keys, processedKeys;
            int index;
            GroupCollection groups;
            String lastMatch = "";
            reg = new Regex("(?<text>[^{}]*)({(?<key>[^}]+)})?"); // .NET Regular Expression matching KeyTemplate Grammar
            keys = new List<String>();
            processedKeys = new List<String>();
            
            // Pattern Start Character
            pattern = "^"; 
            
            //For each RegEx Template Item match in the Template
            foreach (Match match in reg.Matches(templateItem.templateBody))
            {
                //Handle whitespaces in the ValueTemplate
                keyValue = "";
                foreach (char c in match.Groups["text"].Value)
                {
                    if (c != ' ' && c != '\t')
                        keyValue += c + "$$SPACE$$";
                    else
                        keyValue += c;
                }
                keyValue = keyValue.Replace("$$SPACE$$ ", "$$SPACE$$");
 
                //Remove the last white space matcher of the pattern
                if (keyValue.EndsWith("$$SPACE$$"))
                {
                    keyValue = keyValue.Substring(0, keyValue.Length - "$$SPACE$$".Length);
                }
                pattern += keyValue.Replace("+", "\\+").Replace(".", "\\.").Replace("*", "\\*").Replace("?", "\\?").Replace("(", "\\(").Replace("[", "\\[").Replace("]", "\\]").Replace(")", "\\)").Replace("$$SPACE$$", "\\s*");
                
                //Extract the key from the match
                if (match.Groups["key"].Value != "")
                {
                    keyName = match.Groups["key"].Value.Replace("+", "\\+").Replace(".", "\\.").Replace("*", "\\*").Replace("?", "\\?").Replace("(", "\\(").Replace("[", "\\[").Replace("]", "\\]").Replace(")", "\\)");
 
                    //Find a valid key name for the result dictionary to avoid duplicates when repeating the template
                    if (keys.Contains(keyName))
                    {
                        index = 1;
                        while (keys.Contains(keyName + "_" + index.ToString())) index++;
                        keyName = keyName + "_" + index.ToString();
                    }
                     keys.Add(keyName);
                   
                    //A value may be omitted so make its matcher optionnal
                    pattern += string.Format("(?<{0}>.*)", keyName);
                    
                    //Set last match for error messages
                    lastMatch = keyName; 
                }
            }
           
            //Allows pattern to look at new line for the same pattern
            pattern += @"(?=\r\n|$)";
            Console.WriteLine("Pattern: " + pattern + "\n");
 
            string subPattern = pattern.Substring(0, pattern.IndexOf(@"+") + 1);
            Console.WriteLine("subPattern: " + subPattern + "\n");
            Regex reSub = new Regex(subPattern + @".*?(?=\r\n(?!" + subPattern + @"))", RegexOptions.Multiline | RegexOptions.Singleline);
            Match mm = reSub.Match(incomingData);
            string newData = mm.Groups[0].Value;
       
            //Value Extractor : Uses the generated Regex to extract values from the input
            reg = new Regex(pattern, RegexOptions.Multiline); //Allows pattern matching to span one or multiple lines
            
            //Match Validation
            if (!reg.IsMatch(newData))
            {
                //Only error if templateItem is required, othewrwise match not required
                if (templateItem.isTemplateItemRequired)
                {
                    throw new Exception("The Message Data Structure differs from that of the Message Template Structure and thus a conversion can not be done between the two. Last Successful Match Key was: " + lastMatch);
                }
             }
 
            //Build Key Values based on matches
            MatchCollection mc = reg.Matches(newData);
            if (mc.Count > 0)
            {
                foreach (Match m in mc)
                {
                    for (int gIdx = 0; gIdx < m.Groups.Count; gIdx++)
                    {
                        //Skip item at index 0 as it contains the full match
                        if (gIdx > 0)
                        {
                            processedKeyName = reg.GetGroupNames()[gIdx];
                            processedKeyValue = m.Groups[gIdx].Value;
                            
                            //Find a valid key name for the result dictionary to avoid duplicates when repeating the template
                            if (processedKeys.Contains(processedKeyName))
                            {
                                index = 1;
                                while (processedKeys.Contains(processedKeyName + "_" + index.ToString())) index++;
                                processedKeyName = processedKeyName + "_" + index.ToString();
                            }
                            processedKeys.Add(processedKeyName);
                           
                            //Add to Dictionary
                            incomingDictionary[processedKeyName] = processedKeyValue;
                        }
                    }
                    // Only match first match unless templateItem is repeatable
                    if (!templateItem.isRepeatable)
                    {
                        break;
                    }
                }
            }
            else
            {
                throw new Exception("Pattern did not match: ( + " + templateItem.templateBody + ").");
            }
 
            return;
        }
 
        static void DisplayDictionary(Dictionary<String, String> dictionary)
        {
            Console.WriteLine("---- DICTIONARY DATA ----\n");
            foreach ( String key in dictionary.Keys)
            {
                Console.WriteLine("[" + key + "] [" + dictionary[key] + "]\r");
            }
            Console.WriteLine("\n\n");
        }
    }
}

Open in new window

0
 

Author Comment

by:djcheeky
ID: 22738587
Hi ddrudik

Will get to this one over the weekend and get back to you - ta!

0
 

Author Comment

by:djcheeky
ID: 22739201
Hi ddrudik

Ok, so I ran the code and it does do what I want, but only for that particular template / incoming string.
This functionality actually caters for any type of message type, not just that EDIFACT example I gave, for exampl, if you take the code snippet below, you will see that there is now XML data, which doesn't contain that + sign used in the previous example.

Thanks
 //Example Data Structure Template to use
        static void SetTemplateItems()
        {
            templateItems = new List<TemplateItem>();
           templateItems.Add(new TemplateItem(true, true, true, "<UNT untTotalCode={untTotalCode} untTotal={untTotal} />"));
        }
 
        //Example Incoming Data to use
        static void SetIncomingData()
        {
          incomingData = @"<UNT untTotalCode={25} untTotal={hello} />";
        }

Open in new window

0
 
LVL 27

Expert Comment

by:ddrudik
ID: 22744205
You would need to decide what makes a repeating rows then, I was using the first part of the regex pattern to do that, I just happened to stop at the first index of "+", I suppose if you change the input source format then you would need change the pattern.  Will the patterns always be three letters with \s* after each of the three letters?
0
 

Author Comment

by:djcheeky
ID: 22755623
Hi ddrudik.

The pattern may not always start with three characters and \s after the letters.
It will however be:

* Start with a alphanumeric character or special character
* Contain text
* end with a space or special character

For example (These are possible starting characters before the ... ):

<Unbheader ...  // Starts with a '<' and ends with a ' '
UnbHeader ...  // Starts with a 'U' and ends with a ' '
UnbHeader+    // Starts with a 'U' and ends with a '+'
<UnbHeader+  // Starts with a '<' and ends with a '+'

So it can be any one of tyhose four combinations - note that the '<' and '+' characters used in the example could be any special characters. But the similarity is that it will always start with an alphanumeric character or special character and end with a space (or \s) or special character.

Thanks
0
 
LVL 27

Expert Comment

by:ddrudik
ID: 22756483
In order for a regex solution to be identified I would need to know what specifically you want to consider a "special character" other than " " and "+".
0
 

Author Comment

by:djcheeky
ID: 22756571
Sure, it could be:

+
-
<
>
{
}
=
`
'
/
\
0
 
LVL 27

Expert Comment

by:ddrudik
ID: 22757862
You might have UNT{A+000001546000AA' or UNT<A+000001546000AA' ?

The use of < or { as special characters will make it problematic to parse properly since {} is used throughout the pattern and < could be used at the front of the pattern.
0
 

Author Comment

by:djcheeky
ID: 22757953
No, but I could have:

<UNT>
<name>Blah</name>
</UNT>

OR

<person>
<name>{name}</name>
</person>

OR

UNT+Blah:AnotherBlah-Nothing'

OR

person_{myName}+{anything}*{nothing}'


So I guess the main characters are >, +, - and perhaps a few other, but if I see the code for those few I should be able to modify it for any others that arise :)

Thanks
0
 
LVL 27

Accepted Solution

by:
ddrudik earned 500 total points
ID: 22758089
Here's an inclusion of the subpattern I was thinking of, if this fails with your other data please show an extended example of that repeating data.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
 
namespace MessageTranslationExample
{
    public class TemplateItem
    {
        public Boolean isTemplateItemRequired { get; set; }
        public Boolean areExactFieldsRequired { get; set; }
        public Boolean isRepeatable { get; set; }
        public String templateBody { get; set; }
        public TemplateItem(Boolean IsTemplateItemRequired, Boolean AreExactFieldsRequired, Boolean IsRepeatable, String TemplateBody)
        {
            isTemplateItemRequired = IsTemplateItemRequired;
            areExactFieldsRequired = AreExactFieldsRequired;
            isRepeatable = IsRepeatable;
            templateBody = TemplateBody;
        }
    }
 
    class MessageTranslation
    {
        static String incomingData;
        static List<TemplateItem> templateItems;
        static Dictionary<String, String> incomingDictionary;
        static Boolean bDeleteMatchedDataFromIncoming;
        static String myPattern;
 
        static void Main(string[] args)
        {
            SetTemplateItems();
            SetIncomingData();
            TranslateMessage();
            DisplayDictionary(incomingDictionary);
            Console.ReadKey();
        }
 
        //Example Data Structure Template to use
        static void SetTemplateItems()
        {
            templateItems = new List<TemplateItem>();
            templateItems.Add(new TemplateItem(true, true, false, "UNH+{unhCode1}+{unhMessageType}:{unhShortCode}:{unhVersion}:{unhControlBody}+{unhType}'"));
            templateItems.Add(new TemplateItem(true, true, false, "UCI+{uciNumber}+{uciCustomer}+{uciOrganisation}+{uciVersion}'"));
            templateItems.Add(new TemplateItem(true, true, false, "UCM+{ucmNumber}+{ucmType}:{ucmShortCode}:{ucmAbbrev}:{ucmOrganisation}:{ucmIndex}+{ucmIndexCode}'"));
            templateItems.Add(new TemplateItem(true, true, true, "UNT+{untTotalCode}+{untTotal}'"));
            templateItems.Add(new TemplateItem(true, true, false, "UNZ+{unzCode}+{unzId}'"));
        }
 
        //Example Incoming Data to use
        static void SetIncomingData()
        {
            incomingData = @"UNH+00000154600001+CONTRL:D:3:UN+CONTRL'
UCI+00000000000443+ETRADEX+SARS+7'
UCM+00000044300001+CUSDEC:D:96B:UN:ZZZ01+7'
UNT+A+000001546000AA'
UNT+B+000001546000BB'
UNZ+1+000001546'
UNT+4+00000154600004'
UNT+5+00000154600005'
UNT+6+00000154600006'";
        }
 
        //Translate the Incoming Message
        /* DEV HINTS:
        //  * If start index != 0 (thus not located in first item of incoming string) 
                for a template match and the templateItem isRequired, error, else if
                not required, move on to next templateItem without deleting data from incoming
            *  If repeatable, loop through the incoming data until a match is not found for the 
               templateItem.
         */
        /* DEV ISSUES:
       // 
        */
        static void TranslateMessage()
        {
            incomingDictionary = new Dictionary<string, string>();
 
            //Analyze each Template Item against the Incoming Data
            foreach (TemplateItem templateItem in templateItems)
            {
                //Init
                bDeleteMatchedDataFromIncoming = true;
                myPattern = "";
 
                //Generate RegEx Pattern
                SetIncomingValues(templateItem, incomingData);
 
                if (templateItem.isRepeatable)
                {
                    //Create Regex to match all back to back occurances of the template pattern
                }
                else
                {
                    //Process against incoming data
                    if ("RegEx Match is found" == "RegEx Match found")
                    {
 
                    }
                    else
                    {
                        if (templateItem.isTemplateItemRequired)
                        {
 
                        }
                    }
 
                }
            }
            return;
        }
 
        //Generate Regular Expression Pattern
        static void SetIncomingValues(TemplateItem templateItem, String incomingData)
        {
            //Init
            Regex reg;
            String pattern;
            String keyValue, keyName;
            String processedKeyValue, processedKeyName;
            List<String> keys, processedKeys;
            int index;
            GroupCollection groups;
            String lastMatch = "";
            reg = new Regex("(?<text>[^{}]*)({(?<key>[^}]+)})?"); // .NET Regular Expression matching KeyTemplate Grammar
            keys = new List<String>();
            processedKeys = new List<String>();
 
            // Pattern Start Character
            pattern = "^";
 
            //For each RegEx Template Item match in the Template
            foreach (Match match in reg.Matches(templateItem.templateBody))
            {
                //Handle whitespaces in the ValueTemplate
                keyValue = "";
                foreach (char c in match.Groups["text"].Value)
                {
                    if (c != ' ' && c != '\t')
                        keyValue += c + "$$SPACE$$";
                    else
                        keyValue += c;
                }
                keyValue = keyValue.Replace("$$SPACE$$ ", "$$SPACE$$");
 
                //Remove the last white space matcher of the pattern
                if (keyValue.EndsWith("$$SPACE$$"))
                {
                    keyValue = keyValue.Substring(0, keyValue.Length - "$$SPACE$$".Length);
                }
                pattern += keyValue.Replace("+", "\\+").Replace(".", "\\.").Replace("*", "\\*").Replace("?", "\\?").Replace("(", "\\(").Replace("[", "\\[").Replace("]", "\\]").Replace(")", "\\)").Replace("$$SPACE$$", "\\s*");
 
                //Extract the key from the match
                if (match.Groups["key"].Value != "")
                {
                    keyName = match.Groups["key"].Value.Replace("+", "\\+").Replace(".", "\\.").Replace("*", "\\*").Replace("?", "\\?").Replace("(", "\\(").Replace("[", "\\[").Replace("]", "\\]").Replace(")", "\\)");
 
                    //Find a valid key name for the result dictionary to avoid duplicates when repeating the template
                    if (keys.Contains(keyName))
                    {
                        index = 1;
                        while (keys.Contains(keyName + "_" + index.ToString())) index++;
                        keyName = keyName + "_" + index.ToString();
                    }
                    keys.Add(keyName);
 
                    //A value may be omitted so make its matcher optionnal
                    pattern += string.Format("(?<{0}>.*)", keyName);
 
                    //Set last match for error messages
                    lastMatch = keyName;
                }
            }
 
            //Allows pattern to look at new line for the same pattern
            pattern += @"(?=\r\n|$)";
            Console.WriteLine("Pattern: " + pattern + "\n");
 
            Match mmm = Regex.Match(pattern,@".*?[>+ -]");
            string subPattern = mmm.Groups[0].Value;
            Console.WriteLine("subPattern: " + subPattern + "\n");
            Regex reSub = new Regex(subPattern + @".*?(?=\r\n(?!" + subPattern + @"))", RegexOptions.Multiline | RegexOptions.Singleline);
            Match mm = reSub.Match(incomingData);
            string newData = mm.Groups[0].Value;
 
            //Value Extractor : Uses the generated Regex to extract values from the input
            reg = new Regex(pattern, RegexOptions.Multiline); //Allows pattern matching to span one or multiple lines
 
            //Match Validation
            if (!reg.IsMatch(newData))
            {
                //Only error if templateItem is required, othewrwise match not required
                if (templateItem.isTemplateItemRequired)
                {
                    throw new Exception("The Message Data Structure differs from that of the Message Template Structure and thus a conversion can not be done between the two. Last Successful Match Key was: " + lastMatch);
                }
            }
 
            //Build Key Values based on matches
            MatchCollection mc = reg.Matches(newData);
            if (mc.Count > 0)
            {
                foreach (Match m in mc)
                {
                    for (int gIdx = 0; gIdx < m.Groups.Count; gIdx++)
                    {
                        //Skip item at index 0 as it contains the full match
                        if (gIdx > 0)
                        {
                            processedKeyName = reg.GetGroupNames()[gIdx];
                            processedKeyValue = m.Groups[gIdx].Value;
 
                            //Find a valid key name for the result dictionary to avoid duplicates when repeating the template
                            if (processedKeys.Contains(processedKeyName))
                            {
                                index = 1;
                                while (processedKeys.Contains(processedKeyName + "_" + index.ToString())) index++;
                                processedKeyName = processedKeyName + "_" + index.ToString();
                            }
                            processedKeys.Add(processedKeyName);
 
                            //Add to Dictionary
                            incomingDictionary[processedKeyName] = processedKeyValue;
                        }
                    }
                    // Only match first match unless templateItem is repeatable
                    if (!templateItem.isRepeatable)
                    {
                        break;
                    }
                }
            }
            else
            {
                throw new Exception("Pattern did not match: ( + " + templateItem.templateBody + ").");
            }
 
            return;
        }
 
        static void DisplayDictionary(Dictionary<String, String> dictionary)
        {
            Console.WriteLine("---- DICTIONARY DATA ----\n");
            foreach (String key in dictionary.Keys)
            {
                Console.WriteLine("[" + key + "] [" + dictionary[key] + "]\r");
            }
            Console.WriteLine("\n\n");
        }
    }
}

Open in new window

0
 

Author Comment

by:djcheeky
ID: 22765101
Hi ddrudik.

I have taken your solution and implemented it below. But I get a very strange behaviour. If you run the code, you will notice that the program keeps crashing on the last item in the incoming string / template.

In this case it is:
UNP+P+000001546000PP'

but if I remove that line from the incoming string as well as the item from tempalteItems, it still does the same.

It just seems to be crashing on the last record and I don't know why.
Would you like me to post this in a seperate issue?

Thanks.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
 
namespace MessageTranslationExample
{
    public class TemplateItem
    {
        public Boolean isTemplateItemRequired { get; set; }
        public Boolean areExactFieldsRequired { get; set; }
        public Boolean isRepeatable { get; set; }
        public String templateBody { get; set; }
        public TemplateItem(Boolean IsTemplateItemRequired, Boolean AreExactFieldsRequired, Boolean IsRepeatable, String TemplateBody)
        {
            isTemplateItemRequired = IsTemplateItemRequired;
            areExactFieldsRequired = AreExactFieldsRequired;
            isRepeatable = IsRepeatable;
            templateBody = TemplateBody;
        }
    }
    
    class MessageTranslation
    {
        static String incomingData;
        static List<TemplateItem> templateItems;
        static Dictionary<String, String> incomingDictionary;
        static Boolean bDeleteMatchedDataFromIncoming;
      
        static void Main(string[] args)
        {
            SetTemplateItems();
            SetIncomingData();
            TranslateMessage();
            DisplayDictionary(incomingDictionary);    
            Console.ReadKey();
        }
                
        //Example Data Structure Template to use
        static void SetTemplateItems()
        {
            templateItems = new List<TemplateItem>();
            templateItems.Add(new TemplateItem(true, true, false, "UNH+{unhCode1}+{unhMessageType}:{unhShortCode}:{unhVersion}:{unhControlBody}+{unhType}'"));
           templateItems.Add(new TemplateItem(true, true, false, "UCI+{uciNumber}+{uciCustomer}+{uciOrganisation}+{uciVersion}'"));
           templateItems.Add(new TemplateItem(true, true, false, "UCM+{ucmNumber}+{ucmType}:{ucmShortCode}:{ucmAbbrev}:{ucmOrganisation}:{ucmIndex}+{ucmIndexCode}'"));
            templateItems.Add(new TemplateItem(true, true, true, "UNT+{untTotalCode}+{untTotal}'"));
            templateItems.Add(new TemplateItem(true, true, false, "UNZ+{unzCode}+{unzId}'"));
            templateItems.Add(new TemplateItem(true, true, true, "UNT+{untTotalCode}+{untTotal}'"));
            templateItems.Add(new TemplateItem(true, true, true, "UNP+{unpTotalCode}+{unpTotal}'"));
        }
 
        //Example Incoming Data to use
        static void SetIncomingData()
        {
            incomingData = @"UNH+00000154600001+CONTRL:D:3:UN+CONTRL'                 
UCI+00000000000443+ETRADEX+SARS+7'             
UCM+00000044300001+CUSDEC:D:96B:UN:ZZZ01+7'               
UNT+A+000001546000AA' 
UNT+B+000001546000BB'     
UNT+C+000001546000CC'  
UNZ+1+000001546' 
UNT+D+000001546000DD'     
UNT+E+000001546000EE'
UNP+P+000001546000PP'";
        }
 
        //Translate the Incoming Message
        /* DEV HINTS:
        //  * If start index != 0 (thus not located in first item of incoming string) 
                for a template match and the templateItem isRequired, error, else if
                not required, move on to next templateItem without deleting data from incoming
            *  If repeatable, loop through the incoming data until a match is not found for the 
               templateItem.
         */
        /* DEV ISSUES:
       // 
        */ 
        static void TranslateMessage()
        {
            incomingDictionary = new Dictionary<string, string>();
            
            //Analyze each Template Item against the Incoming Data
            foreach (TemplateItem templateItem in templateItems)
            {
                //Init
                bDeleteMatchedDataFromIncoming = true;
 
                Console.WriteLine("BEFORE: \n" + templateItem.templateBody);
                
                //Trim Preceding and Trailing Indentation and Whitespace
                templateItem.templateBody = templateItem.templateBody.Trim();
                Regex precedingWS = new Regex(@"\n\s+<");
                templateItem.templateBody = precedingWS.Replace(templateItem.templateBody, "\n<");
                Regex trailingWS = new Regex(@"\s+\n");
                templateItem.templateBody = trailingWS.Replace(templateItem.templateBody, "\n");
 
                Console.WriteLine("\n\nAFTER: \n" + templateItem.templateBody);
               
                
                //Generate RegEx Pattern
                SetMessageInDictionary(templateItem);
                                
                if (templateItem.isRepeatable)
                {
                    //Create Regex to match all back to back occurances of the template pattern
                }
                else
                {
                    //Process against incoming data
                    if ("RegEx Match is found" == "RegEx Match found")
                    {
 
                    }
                    else
                    {
                        if (templateItem.isTemplateItemRequired)
                        {
                            
                        }
                    }
                                      
                }
            }
            return;
        }
 
        //Build the Message IN Dictionary keys and values
        static void SetMessageInDictionary(TemplateItem templateItem)
        {
            //Init
            Regex reg;
            String pattern;
            String keyValue, keyName;
            String processedKeyValue, processedKeyName;
            List<String> keys, processedKeys;
            int index;
            String lastMatch = "";
            reg = new Regex("(?<text>[^{}]*)({(?<key>[^}]+)})?"); // .NET Regular Expression matching KeyTemplate Grammar
            keys = new List<String>();
            processedKeys = new List<String>();
            
            // Pattern Start Character
            pattern = "^"; 
            
            //For each RegEx Template Item match in the Template
            foreach (Match match in reg.Matches(templateItem.templateBody))
            {
                //Handle whitespaces in the ValueTemplate
                keyValue = "";
                foreach (char c in match.Groups["text"].Value)
                {
                    if (c != ' ' && c != '\t')
                        keyValue += c + "$$SPACE$$";
                    else
                        keyValue += c;
                }
                keyValue = keyValue.Replace("$$SPACE$$ ", "$$SPACE$$");
 
                //Remove the last white space matcher of the pattern
                if (keyValue.EndsWith("$$SPACE$$"))
                {
                    keyValue = keyValue.Substring(0, keyValue.Length - "$$SPACE$$".Length);
                }
                pattern += keyValue.Replace("+", "\\+").Replace(".", "\\.").Replace("*", "\\*").Replace("?", "\\?").Replace("(", "\\(").Replace("[", "\\[").Replace("]", "\\]").Replace(")", "\\)").Replace("$$SPACE$$", "\\s*");
                
                //Extract the key from the match
                if (match.Groups["key"].Value != "")
                {
                    keyName = match.Groups["key"].Value.Replace("+", "\\+").Replace(".", "\\.").Replace("*", "\\*").Replace("?", "\\?").Replace("(", "\\(").Replace("[", "\\[").Replace("]", "\\]").Replace(")", "\\)");
 
                    //Find a valid key name for the result dictionary to avoid duplicates when repeating the template
                    if (keys.Contains(keyName))
                    {
                        index = 1;
                        while (keys.Contains(keyName + "_" + index.ToString())) index++;
                        keyName = keyName + "_" + index.ToString();
                    }
                     keys.Add(keyName);
                   
                    //A value may be omitted so make its matcher optionnal
                    pattern += string.Format("(?<{0}>.*)", keyName);
                    
                    //Set last match for error messages
                    lastMatch = keyName; 
                }
            }
           
            //Allows pattern to look at new line for the same pattern
            pattern += @"(?=\s+|$)";
            Console.WriteLine("Pattern: " + pattern + "\r");
            
            Match mmm = Regex.Match(pattern, @".*?[>+ -]");
            string subPattern = mmm.Groups[0].Value;
            Console.WriteLine("SubPattern: " + subPattern + "\r");
            Regex reSub = new Regex(subPattern + @".*?(?=\r\n(?!" + subPattern + @"))", RegexOptions.Multiline | RegexOptions.Singleline);
            Match mm = reSub.Match(incomingData);
            string newData = mm.Groups[0].Value;
            Console.WriteLine("NewData: " + newData + "\r");
            
            //Value Extractor : Uses the generated Regex to extract values from the input
            reg = new Regex(pattern, RegexOptions.Multiline); //Allows pattern matching to span one or multiple lines
            
            //Trim Rubbish from incoming Data
            incomingData = incomingData.Trim();
            Console.WriteLine("IncomingData before match:\n" + incomingData);
           
            
            
            //Match Validation
            if (!reg.IsMatch(newData))
            {
                //Only error if templateItem is required, othewrwise match not required
                if (templateItem.isTemplateItemRequired)
                {
                    throw new Exception("The Message Data Structure differs from that of the Message Template Structure and thus a conversion can not be done between the two. Last Successful Match Key was: " + lastMatch);
                }
            }
 
            //Build Key Values based on matches
            MatchCollection mc = reg.Matches(newData);
            Console.WriteLine("Matches: " + mc.Count.ToString() + "\n");
            if (mc.Count > 0)
            {
                foreach (Match m in mc)
                {
                    for (int gIdx = 0; gIdx < m.Groups.Count; gIdx++)
                    {
                        //Skip item at index 0 as it contains the full match
                        if (gIdx > 0)
                        {
                            processedKeyName = reg.GetGroupNames()[gIdx];
                            processedKeyValue = m.Groups[gIdx].Value;
                            
                            //Find a valid key name for the result dictionary to avoid duplicates when repeating the template
                            if (processedKeys.Contains(processedKeyName))
                            {
                                index = 1;
                                if (templateItem.isRepeatable)
                                {
                                    while (processedKeys.Contains(processedKeyName + "_" + index.ToString() + "|R")) index++;
                                     {
                                        processedKeyName = processedKeyName + "_" + index.ToString() + "|R";
                                     }
                                }
                               else 
                                {
                                    while (processedKeys.Contains(processedKeyName + "_" + index.ToString())) index++;
                                     {
                                        processedKeyName = processedKeyName + "_" + index.ToString();
                                     }
                                }
                            }
                            processedKeys.Add(processedKeyName);
                           
                            //Add to Dictionary
                            incomingDictionary[processedKeyName] = processedKeyValue;
                        }
                     }
                    Console.WriteLine("\nRemoveDataMatchFromIncomingData(" + "0, " + (m.Length) + "\n");
                    incomingData.Trim();
                    RemoveDataMatchFromIncomingData(0, m.Length);
                    incomingData.Trim();
 
                    
                    //Remove Matched text from incoming data
                   
                    Console.WriteLine("Pattern to apply to data: " + pattern + "\n");
 
                    
                   
                }
            }
            else
            {
                throw new Exception("Pattern did not match: ( + " + templateItem.templateBody + ").");
            }
 
            return;
        }
 
        static void RemoveDataMatchFromIncomingData(int startPosition, int lengthOfMatch)
        {
            Console.WriteLine("\nIncomingData BEFORE: \r\n");
            Console.WriteLine(incomingData + "\n\n");
            incomingData = incomingData.Remove(startPosition, lengthOfMatch);
            incomingData = incomingData.Trim();
            Console.WriteLine("IncomingData AFTER: \r");
            Console.WriteLine(incomingData + "\n\n");
        }
 
        static void DisplayDictionary(Dictionary<String, String> dictionary)
        {
            Console.WriteLine("---- DICTIONARY DATA ----\n");
            foreach ( String key in dictionary.Keys)
            {
                Console.WriteLine("[" + key + "] [" + dictionary[key] + "]\r");
            }
            Console.WriteLine("\n\n");
        }
    }
}

Open in new window

0
 

Author Comment

by:djcheeky
ID: 22774046
Any ideas?? I think what I am ging thave to do is split this program so that each message type (e.g. XML / EFIFACT etc) has its own variable extraction function.

I will start posting seperate questions for that in the meantime.

Thanks a mill!
Paolo
0
 
LVL 27

Expert Comment

by:ddrudik
ID: 22776831
It will take a bit to determine what was changed, a copy-and-paste of the code in 22758089 does not produce the exception regarding not matching the data you have in that last example.
0
 

Author Comment

by:djcheeky
ID: 22777010
And what happens when you run the code in 22765101?

Thanks
0
 
LVL 27

Expert Comment

by:ddrudik
ID: 22777062
Exception stating that the data doesn't match the pattern, that's an exception you manually set.
0
 

Author Comment

by:djcheeky
ID: 22784209
Yeah - thats what I was trying to avoid - but I am going to close this question now because I am doing things using XML. Will open a new question again if anything similar is required.

Thanks for all your help!
0
 
LVL 27

Expert Comment

by:ddrudik
ID: 22785271
Thanks for the question and the points.
0

Featured Post

DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Summary: Persistence is the capability of an application to store the state of objects and recover it when necessary. This article compares the two common types of serialization in aspects of data access, readability, and runtime cost. A ready-to…
This article is for Object-Oriented Programming (OOP) beginners. An Interface contains declarations of events, indexers, methods and/or properties. Any class which implements the Interface should provide the concrete implementation for each Inter…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question