C#: RegEx for splitting lines of text

trevor1940
trevor1940 used Ask the Experts™
on
Hi I need some help refining my regEx

I need to split the example the numbers eg ". 2 " however I also need  to split on "--n. 5 " and keep the "n."
Note: i'm first splitting on ". 1 " has the preceding Lexical doesn't have '--'

What I'm getting is every number, which I don't need!
I do need the Lexical If there is 1 And the sentance

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;

public class Example
{
    public static void Main()
    {
        Regex SentanceSpit = new Regex(@"\.(\s+\d+\s+)|\.--([a-z]+\.)\s+\d+\s+");
        string Line = @"abandon v. 1 give up or over, yield, surrender, leave, cede, let go, deliver (up), turn over, relinquish: I can see no reason why we should abandon the house to thieves and vandals. 2 depart from, leave, desert, quit, go away from: The order was given to abandon ship. 3 desert, forsake, jilt, walk out on: He even abandoned his fianc,e. 4 give up, renounce; discontinue, forgo, drop, desist, abstain from: She abandoned cigarettes and whisky after the doctor's warning.--n. 5 recklessness, intemperance, wantonness, lack of restraint, unrestraint: He behaved with wild abandon after he received the inheritance.";
        // Output strings
		string Term;
		string Lexical; // not every example have diferant Lexical
        string[] WordsExample;
        string[] Words;
        string Example;
		string[] FirstSecond = Regex.Split(Line, @"\s1\s");
		if (FirstSecond.Length ==2)
		{
			string First = FirstSecond[0];
			int idx = First.LastIndexOf(" ");
			Term = First.Substring(0, idx);
			Lexical = First.Substring(idx + 1);

			Console.WriteLine("Term: {0}, Lexical {1}", Term, Lexical);

			string Second = FirstSecond[1];


			string[] Parts = Regex.Split(Second, SentanceSpit.ToString());
			for (int i = 0; i < Parts.Length; i++)
			{
				//it's a number'
				int sInt = 0;
				if (int.TryParse(Parts[i], out sInt))
				{
					continue;
				}
				else if (!Parts[i].ToString().Contains(":"))
				{
					Lexical = Parts[i].ToString();
					Console.WriteLine("New Lex {0}", Lexical);
				}
				else
				{
					WordsExample = Parts[i].Split(":");
					Words = WordsExample[0].Split(",");
					// attach word list to Thesauri

					Example = WordsExample[1];
					Console.WriteLine("Example: {0}", Example);

				}
			}
		} // end FirstSecond

    }
}

Open in new window

Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Fernando SotoRetired
Distinguished Expert 2017

Commented:
Hi trevor1940;

Can you attach a sample file of the data you wish to parse and what you need from each line.

Thanks
Fernando

Author

Commented:
I gave an example but attached is a sample Note: In the sample 1 line needs treating differently FYI. I'm doing this in the code bellow

I need the Term 1st Lexical. 1  List of alternate words: and the example.  --next Lexical. 2  List of alternate words: 2nd the example.  

sample.txt

Eventually I shall organize this into a class list but for now just need to sort the RegEx  above unless there is a better way

....
                } // end FirstSecond
                else
                {
                    int FirstDot = Line.IndexOf(".");
                    int Space = Line.LastIndexOf(" ", FirstDot);
                    string term = Line.Substring(0, Space);
                    Lexical = Line.Substring(Space, FirstDot - Space +1);
                    string Sentance = Line.Substring(FirstDot + 1, Line.Length - 1 - FirstDot).Trim() ;
                   // Sentance.Trim();
                    WordsExample = Sentance.Split(":");
                    Words = WordsExample[0].Split(",");
                    // attach word list to Thesauri

                    Example = WordsExample[1];
                    Console.WriteLine("Term {0}, Lex {1}, Example: {2}",term, Lexical,  Example);
                }

Open in new window

Fernando SotoRetired
Distinguished Expert 2017

Commented:
Hi trevor1940;

The below code snippet should do what you are looking for.
// Regex pattern to parse the file
var pattern = @"(?<Term>.+?)\s+(?<Lexical>(v|adj|adv|n)\.)([^:]+:\s+(?<Example>[^\.]+\.)(?<Lexical2>(--v|--adj|--adv|--n|--prep)\.)?)+";
// Modify this line with the path to the sample.txt file
var input = System.IO.File.ReadLines(@"Path to the sample.txt file");

// Process each line of the file
foreach(string line in input)
{
    // Parse the line of the file
    MatchCollection matches = Regex.Matches(line, pattern);
    // Look for a Lex change
    bool newLex = true;
    
    // Process each capture group
    foreach (Match match in matches)
    {
        // Get the parsed values
        GroupCollection groups = match.Groups;
        string lex2 = groups["Lexical2"].Value;
        int lex2Idx = groups["Lexical2"].Index;
        string term = groups["Term"].Value;
        string lex = groups["Lexical"].Value;
        Console.WriteLine("Term {0}, Lex {1}", term, lex);

        // The Examples are a collection 
        foreach (Capture capture in groups["Example"].Captures)
        {
            if ( lex2Idx == 0 || capture.Index < lex2Idx )
            {
                Console.WriteLine("\t\tExample: {0}", capture.Value);
            }
            else
            {
                if (lex2Idx > 0)
                {
                    if (newLex == true && lex2Idx != 0)
                    {
                        newLex = false;
                        Console.WriteLine("\tLex {0}", lex2);
                    }
                    Console.WriteLine("\t\tExample: {0}", capture.Value);
                }
            }
        }
    }
}

Open in new window


this is the results of the sample.txt file you posted.
Term abandon, Lex v.
    Example: I can see no reason why we should abandon the house to thieves and vandals.
    Example: The order was given to abandon ship.
    Example: He even abandoned his fianc,e.
    Example: She abandoned cigarettes and whisky after the doctor's warning.
  Lex --n.
    Example: He behaved with wild abandon after he received the inheritance.
Term abandoned, Lex adj.
    Example: An abandoned infant was found on the church steps.
    Example: His abandoned behaviour soon landed him in jail.
Term about, Lex adv.
    Example: Gather about, for I have something to tell you.
    Example: In 1685 London had been, for about half a century, the most populous capital in Europe.
    Example: He wandered about aimlessly for several days.
    Example: My papers were scattered about as if a tornado had struck.
    Example: There is a lot of flu about this year.
    Example: It is about time you telephoned your mother.
  Lex --prep.
    Example: There is a railing about the monument.
    Example: Please look about the room for my hat.
    Example: There were a lot of trees about the garden.
    Example: I am sorry, but I haven't my cheque-book about me.
    Example: He wrote a book about the Spanish,Armada.
Term charter, Lex n.
    Example: This year we again commemorate the signing of the United Nations charter.
    Example: He was given an exclusive charter to export furs in 1679.
    Example: We have the yacht under charter for the summer.
  Lex --v.
    Example: He is a chartered accountant, she a chartered surveyor.
    Example: I chartered the sloop for three weeks.
Term chase, Lex n.
    Example: Police dogs entered the chase and the prisoner was finally caught.
    Example: The police were chasing a man down the street.
    Example: I chased the cat away from the birdcage.
Term chaste, Lex adj.
    Example: Only a knight who was wholly chaste would find the Grail.
    Example: In some respects, modern architecture emulates the chaste style of the ancient Egyptians.
Term zone, Lex n.
    Example: A duty-free zone will allow for quicker transshipment of goods.
Term zoo, Lex n.
    Example: When I was a child, I enjoyed going to the zoo almost as much as I do now.
    Example: When my husband and the three children get ready in the morning the kitchen is like a zoo.

Open in new window


Fernando
Learn SQL Server Core 2016

This course will introduce you to SQL Server Core 2016, as well as teach you about SSMS, data tools, installation, server configuration, using Management Studio, and writing and executing queries.

Author

Commented:
Hi Thanx
Erm: You've missed out the comma separated list of alternate words.

I tried this but it's too greedy

(?<Term>.+?)\s+(?<Lexical>(v|adj|adv|n)\.)(?<AltWords>(\D+))([^:]+:\s+(?<Example>[^\.]+\.)(?<Lexical2>(--v|--adj|--adv|--n|--prep)\.)?)+

Open in new window

Fernando SotoRetired
Distinguished Expert 2017

Commented:
Hi trevor1940;

Please post how the output should look like.

Author

Commented:
Something like this

Term abandon, Lex v.

    AltWord(s): give up or over, yield, surrender, leave, cede, let go, deliver (up), turn over, relinquish 
    Example: I can see no reason why we should abandon the house to thieves and vandals.
    AltWord(s): depart from, leave, desert, quit, go away from
    Example: The order was given to abandon ship.
    AltWord(s): desert, forsake, jilt, walk out on
    Example: He even abandoned his fianc,e.
    AltWord(s): give up, renounce; discontinue
    Example: She abandoned cigarettes and whisky after the doctor's warning.

Open in new window


Apologize if this wasn't clear
Retired
Distinguished Expert 2017
Commented:
Hi trevor1940;

This code snipped should do what you need.
var pattern = @"(?<Term>.+?)\s((?<Lexical>(v|adj|adv|n|--v|--adj|--adv|--n|--prep)\.)?\s\d\s(?<Words>[^:]+):\s(?<Example>[^\.]+).)+";
// Modify this line to point to where your sample.txt file is
var input = System.IO.File.ReadLines(@"C:\Working Directory\sample.txt");

foreach(string line in input)
{
    MatchCollection matches = Regex.Matches(line, pattern);

    foreach (Match match in matches) {
        // Get the collection of matched groups
        GroupCollection groups = match.Groups;
        // Get the Term for the current line
        string term = groups["Term"].Value;
        // Get the collection of the Lexical for this line
        CaptureCollection lex = groups["Lexical"].Captures;
        // Holds the next Lexical index value
        List<int> lexBoundry = new List<int>();
        for(var idx = 0; idx < lex.Count; idx++) {
            // Skipping the first index because we need to point to the next lex starts at
            if(idx == 0) continue;
            lexBoundry.Add(lex[idx].Index);
        }
        // Points to the current Lexical string value in the collection
        int currentLexIdx = 0;
        // Get all the Lexical values
        CaptureCollection words = groups["Words"].Captures;
        // Get all the Examples
        CaptureCollection examples = groups["Example"].Captures;
        
        Console.WriteLine("Term {0}, Lex {1}", term, lex[0].Value);

        for(int idx = 0; idx < words.Count; idx++) {
            // If true we are starting a new Lexical
            if(lexBoundry.Count != 0 && words[idx].Index > lexBoundry[0]) {
                // Remove the current Lex index so that lexBoundry[0] always point to the next one, an int value.
                lexBoundry.RemoveAt(0);
                // Point to the next lexical string value 
                currentLexIdx++;
                Console.WriteLine("\tLex {0}", lex[currentLexIdx].Value);
            }
            Console.WriteLine("\t\tAltWord(s): {0}\n\t\tExample: {1}", words[idx].Value, examples[idx].Value);
        }        
    }
}

Open in new window

Author

Commented:
Wow Thanx very much for your help I've learnt a lot especially matched groups
Fernando SotoRetired
Distinguished Expert 2017

Commented:
Not a problem trevor1940, glad to help.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial