Link to home
Start Free TrialLog in
Avatar of montarch
montarch

asked on

VB.Net- using an array in a regular expression string

I have a task which involves extracting data from a large number of report files (.txt format). Thanks to the help here, I've been able to get the project rolling using regular expressions in vb.net 2003. I've been able to successfully extract data from the files into 3 of the fields I've tried so far, but I have several other fields that are a bit more complex (from my uninitiated point of view) and had questions in those regards.

Can an array be used in a regular expressions string?

The data that I'm trying to extract is regarding the dates that 'Fieldwork' was conducted. The actual fields I need to populate are 'fieldworkstart' and 'fieldworkend', however if I can just get the general 'fieldwork' data extracted, then I can coax the data further from there.

The dates aren't in a standard format, but they are somewhat consistant in the manner that they were entered. Here are a few examples of the fieldwork and dates occurrences from several of the files.

The fieldwork was performed between August 1 and 5, 2003
The fieldwork was performed on August 12 and 15, 2003
The fieldwork was conducted by John Doe and Jim Public on August 12th  and 13th, 2003
The fieldwork was performed during July and August, 2003
The fieldwork was conducted by John Doe and Jim Public between March 16-29, 2004
The fieldwork was conducted in September and October 2003
The fieldwork was performed on July 24, 2003
The fieldwork was conducted on July 28 through August 15, 2003.  
The fieldwork was performed on July 24 and 25, (NOTE: NO Year entered in this example)
The fieldwork was performed by John Doe on 07 April 2003
The fieldwork was supervised by John Doe and assisted by Jim Public from April 14 through 29, 2003

The reason I ask if I can use an array in the regular expression is this: Without knowing a better way to proceed, I had thought to create a 'months' array (january, february...december) to do a regular expression search that looked for the month names as part of the match criteria. From the examples above, here is the data that I want to extract to a file (.csv).

fieldwork August 1 and 5, 2003
fieldwork August 12 and 15, 2003
fieldwork September and October 2003

Which I will then (somehow) break down into the proper fields:
fieldworkstart: August 1
fieldworkend: August 5 2003

fieldworkstart: August 12
fieldworkend: August 15 2003

fieldworkstart: September
fieldworkend: October 2003

If you need a copy of the working code that I'm using for the other fields (thanks ddrudik!) let me know and I'll post that.

Thanks again all!
Avatar of Fernando Soto
Fernando Soto
Flag of United States of America image

Hi montarch;

This code snippet should do what you need.

Fernando

Imports System.Text.RegularExpressions
Imports System.Text
 
 
    Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
 
        Dim testData As String = "The fieldwork was performed between August 1 and 5, 2003" & vbCrLf & _
            "The fieldwork was performed on August 12 and 15, 2003" & vbCrLf & _
            "The fieldwork was conducted by John Doe and Jim Public on August 12th  and 13th, 2003" & vbCrLf & _
            "The fieldwork was performed during July and August, 2003" & vbCrLf & _
            "The fieldwork was conducted by John Doe and Jim Public between March 16-29, 2004" & vbCrLf & _
            "The fieldwork was conducted in September and October 2003" & vbCrLf & _
            "The fieldwork was performed on July 24, 2003" & vbCrLf & _
            "The fieldwork was conducted on July 28 through August 15, 2003" & vbCrLf & _
            "The fieldwork was performed on July 24 and 25," & vbCrLf & _
            "The fieldwork was performed by John Doe on 07 April 2003" & vbCrLf & _
            "The fieldwork was supervised by John Doe and assisted by Jim Public from April 14 through 29, 2003"
        Dim pattern As String = "(?im)(?=.*?fieldwork\s)(?=.*?" & _
            "(?<Month>January|February|March|April|May|June|July|August|September|October|November|December))" & _
            "(?:.*?(?<Dates>\d{1,2}\s+\k<Month>.*?)|.*?(?<Dates>\k<Month>.*?))(?:\r\n|$)"
        Dim sb As StringBuilder = New StringBuilder
        Dim mc As MatchCollection = Regex.Matches(testData, pattern)
 
        For Each m As Match In mc
            sb.Append(m.Groups("Dates").Value & vbCrLf)
        Next
 
        MessageBox.Show(sb.ToString())
 
    End Sub

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Fernando Soto
Fernando Soto
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of montarch
montarch

ASKER

Sorry for not getting back sooner- I've been working on this project for days, and into the wee hours.
Excellent code sample you provided for me- I've been trying to integrate it into a console app.
The goal is to get the data from files in a directory, instead of the strings that you hard coded under 'testdata as string'. From there, to write the data to a file.
It works well by itself, however when I try to add in the other code (regex patterns and such- let me know if you'd like a copy of theat code and the problematic code-) I'm having a couple of problems.
The original code (that I'm trying to add your sample to) uses a for..each..next loop to get to the file location, and read the files in one at a time. I'm having a problem getting results by placing your code within that for.each.next..

Here's your code, that I pushed into a console app.. This works file, by the way.

Module Module1
    Sub Main()
        Dim testData As String = "The fieldwork was performed between August 1 and 5, 2003" & vbCrLf & _
            "The fieldwork was performed on August 12 and 15, 2003" & vbCrLf & _
            "The fieldwork was conducted by John Doe and Jim Public on August 12th  and 13th, 2003" & vbCrLf & _
            "The fieldwork was performed during July and August, 2003" & vbCrLf & _
            "The fieldwork was conducted by John Doe and Jim Public between March 16-29, 2004" & vbCrLf & _
            "The fieldwork was conducted in September and October 2003" & vbCrLf & _
            "The fieldwork was performed on July 24, 2003" & vbCrLf & _
            "The fieldwork was conducted on July 28 through August 15, 2003" & vbCrLf & _
            "The fieldwork was performed on July 24 and 25," & vbCrLf & _
            "The fieldwork was performed by John Doe on 07 April 2003" & vbCrLf & _
            "The fieldwork was supervised by John Doe and assisted by Jim Public from April 14 through 29, 2003"
        Dim pattern As String = "(?im)(?=.*?fieldwork\s)(?=.*?" & _
            "(?January|February|March|April|May|June|July|August|September|October|November|December))" & _
            "(?:.*?(?\d{1,2}\s+\k.*?)|.*?(?\k.*?))(?:\r\n|$)"
        Dim sb As StringBuilder = New StringBuilder
        Dim mc As MatchCollection = Regex.Matches(testData, pattern)
        Dim sw As StreamWriter = New StreamWriter("c:\1\out\projtest.txt", True)
        Dim fwdatestr As String
        For Each m As Match In mc
            sb.Append(m.Groups("Dates").Value & vbCrLf)
        Next
        fwdatestr = sb.ToString
        sw.WriteLine(fwdatestr)
        sw.Close()
    End Sub
End Module
......................................
And here's the code that I am trying to incorporate your sample into. The problem I'm having is with the matchcollection for each, sb.Append(m.Groups("Dates").Value & vbCrLf)-

I can't get my head wrapped around how to get this set ptoperly. Anyway, here's the other code that is also working..
Module Module1
    Sub Main()
        Dim projfile As String = "C:\1\out\projects.csv"
        Try
            If File.Exists(projfile) Then
                File.Delete(projfile)
                Console.WriteLine("'projects.csv' file found, deleted file.")
            Else
                Console.WriteLine("'projects.csv' file NOT found, NO deleted file.")
            End If
            For Each datafile As String In Directory.GetFiles("c:\1\")
                Dim sr As StreamReader = New StreamReader(datafile)
                Dim filetext As String = sr.ReadToEnd()
                sr.Close()
                Console.WriteLine("Processing: " & datafile)
                Dim repermitnumber As Regex = New Regex("U-\d{2}-MQ-\d{1,4}[a-z]*")
                Dim reprojectcounty As Regex = New Regex("\w* ?\w+ ?county, *Utah", RegexOptions.IgnoreCase)
                Dim rereportnumber As Regex = New Regex("(?<=Report No\. ?)\d\d-\d\d\d?")
                Dim mpermitnumber As Match = repermitnumber.Match(filetext)
                Dim mprojectcounty As Match = reprojectcounty.Match(filetext)
                Dim mreportnumber As Match = rereportnumber.Match(filetext)
                Dim sw As StreamWriter = New StreamWriter(projfile, True)
                Dim dataline As String = mpermitnumber.Groups(0).Value & "|" & mprojectcounty.Groups(0).Value & "|" & mreportnumber.Groups(0).Value
                Console.WriteLine("   Writing: " & dataline)
                sw.WriteLine(dataline)
                sw.Close()
            Next
        Catch E As Exception
            Console.WriteLine("An error was encountered:")
            Console.WriteLine(E.Message)
        End Try
    End Sub
End Module


Thanks again for your help.
Hi montarch;

Below is your code which I have moved some lines around and have added my solution to it. I have also placed some comments in the code.

Fernando

Sub Main()
	Dim projfile As String = "C:\1\out\projects.csv"
	Try
		If File.Exists(projfile) Then
			File.Delete(projfile)
			Console.WriteLine("'projects.csv' file found, deleted file.")
		Else
			Console.WriteLine("'projects.csv' file NOT found, NO deleted file.")
		End If
 
		' These Regex objects only need to be created one. Doing it this was improves performance
		Dim repermitnumber As Regex = New Regex("U-\d{2}-MQ-\d{1,4}[a-z]*")
		Dim reprojectcounty As Regex = New Regex("\w* ?\w+ ?county, *Utah", RegexOptions.IgnoreCase)
		Dim rereportnumber As Regex = New Regex("(?<=Report No\. ?)\d\d-\d\d\d?")
		Dim pattern As String = "(?im)(?=.*?fieldwork\s)(?=.*?" & _
			"(?<Month>January|February|March|April|May|June|July|August|September|October|November|December))" & _
			"(?:.*?(?<Dates>\d{1,2}\s+\k<Month>.*?)|.*?(?<Dates>\k<Month>.*?))(?:\r\n|$)"
		Dim redatechanged As Regex = New Regex(pattern)
 
		' Initialize the StringBuilder
		Dim sb As StringBuilder = New StringBuilder
 
		For Each datafile As String In Directory.GetFiles("c:\1\")
			' Set the lenth of the string in StringBuilder to 0 this
			' makes the string within the StringBuilder = String.Empty
			sb.Length = 0
			Dim sr As StreamReader = New StreamReader(datafile)
			Dim filetext As String = sr.ReadToEnd()
			sr.Close()
			Console.WriteLine("Processing: " & datafile)
			Dim mpermitnumber As Match = repermitnumber.Match(filetext)
			Dim mprojectcounty As Match = reprojectcounty.Match(filetext)
			Dim mreportnumber As Match = rereportnumber.Match(filetext)
			Dim sw As StreamWriter = New StreamWriter(projfile, True)
			' Adding this to the StringBuilder object
			sb.Append(mpermitnumber.Groups(0).Value & "|" & _
				mprojectcounty.Groups(0).Value & "|" & mreportnumber.Groups(0).Value & vbCrLf)
			
			Dim mc As MatchCollection = redatechanged.Matches(filetext)
			For Each m As Match In mc
				sb.Append(m.Groups("Dates").Value & vbCrLf)
			Next
			
			Console.WriteLine(" Writing: " & sb.ToString())
			sw.WriteLine(sb.ToString())
			sw.Close()
		Next
	Catch E As Exception
		Console.WriteLine("An error was encountered:")
		Console.WriteLine(E.Message)
	End Try
End Sub

Open in new window