Solved

VB.Net- using an array in a regular expression string

Posted on 2008-11-01
4
365 Views
Last Modified: 2013-11-26
I have a task which involves extracting data from a large number of report files (.txt format). Thanks to the help here, I've been able to get the project rolling using regular expressions in vb.net 2003. I've been able to successfully extract data from the files into 3 of the fields I've tried so far, but I have several other fields that are a bit more complex (from my uninitiated point of view) and had questions in those regards.

Can an array be used in a regular expressions string?

The data that I'm trying to extract is regarding the dates that 'Fieldwork' was conducted. The actual fields I need to populate are 'fieldworkstart' and 'fieldworkend', however if I can just get the general 'fieldwork' data extracted, then I can coax the data further from there.

The dates aren't in a standard format, but they are somewhat consistant in the manner that they were entered. Here are a few examples of the fieldwork and dates occurrences from several of the files.

The fieldwork was performed between August 1 and 5, 2003
The fieldwork was performed on August 12 and 15, 2003
The fieldwork was conducted by John Doe and Jim Public on August 12th  and 13th, 2003
The fieldwork was performed during July and August, 2003
The fieldwork was conducted by John Doe and Jim Public between March 16-29, 2004
The fieldwork was conducted in September and October 2003
The fieldwork was performed on July 24, 2003
The fieldwork was conducted on July 28 through August 15, 2003.  
The fieldwork was performed on July 24 and 25, (NOTE: NO Year entered in this example)
The fieldwork was performed by John Doe on 07 April 2003
The fieldwork was supervised by John Doe and assisted by Jim Public from April 14 through 29, 2003

The reason I ask if I can use an array in the regular expression is this: Without knowing a better way to proceed, I had thought to create a 'months' array (january, february...december) to do a regular expression search that looked for the month names as part of the match criteria. From the examples above, here is the data that I want to extract to a file (.csv).

fieldwork August 1 and 5, 2003
fieldwork August 12 and 15, 2003
fieldwork September and October 2003

Which I will then (somehow) break down into the proper fields:
fieldworkstart: August 1
fieldworkend: August 5 2003

fieldworkstart: August 12
fieldworkend: August 15 2003

fieldworkstart: September
fieldworkend: October 2003

If you need a copy of the working code that I'm using for the other fields (thanks ddrudik!) let me know and I'll post that.

Thanks again all!
0
Comment
Question by:montarch
  • 3
4 Comments
 
LVL 62

Expert Comment

by:Fernando Soto
ID: 22857970
Hi montarch;

This code snippet should do what you need.

Fernando

Imports System.Text.RegularExpressions

Imports System.Text
 
 

    Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
 

        Dim testData As String = "The fieldwork was performed between August 1 and 5, 2003" & vbCrLf & _

            "The fieldwork was performed on August 12 and 15, 2003" & vbCrLf & _

            "The fieldwork was conducted by John Doe and Jim Public on August 12th  and 13th, 2003" & vbCrLf & _

            "The fieldwork was performed during July and August, 2003" & vbCrLf & _

            "The fieldwork was conducted by John Doe and Jim Public between March 16-29, 2004" & vbCrLf & _

            "The fieldwork was conducted in September and October 2003" & vbCrLf & _

            "The fieldwork was performed on July 24, 2003" & vbCrLf & _

            "The fieldwork was conducted on July 28 through August 15, 2003" & vbCrLf & _

            "The fieldwork was performed on July 24 and 25," & vbCrLf & _

            "The fieldwork was performed by John Doe on 07 April 2003" & vbCrLf & _

            "The fieldwork was supervised by John Doe and assisted by Jim Public from April 14 through 29, 2003"

        Dim pattern As String = "(?im)(?=.*?fieldwork\s)(?=.*?" & _

            "(?<Month>January|February|March|April|May|June|July|August|September|October|November|December))" & _

            "(?:.*?(?<Dates>\d{1,2}\s+\k<Month>.*?)|.*?(?<Dates>\k<Month>.*?))(?:\r\n|$)"

        Dim sb As StringBuilder = New StringBuilder

        Dim mc As MatchCollection = Regex.Matches(testData, pattern)
 

        For Each m As Match In mc

            sb.Append(m.Groups("Dates").Value & vbCrLf)

        Next
 

        MessageBox.Show(sb.ToString())
 

    End Sub

Open in new window

0
 
LVL 62

Accepted Solution

by:
Fernando Soto earned 500 total points
ID: 22868097
Did this answer your question?
0
 

Author Closing Comment

by:montarch
ID: 31512329
Sorry for not getting back sooner- I've been working on this project for days, and into the wee hours.
Excellent code sample you provided for me- I've been trying to integrate it into a console app.
The goal is to get the data from files in a directory, instead of the strings that you hard coded under 'testdata as string'. From there, to write the data to a file.
It works well by itself, however when I try to add in the other code (regex patterns and such- let me know if you'd like a copy of theat code and the problematic code-) I'm having a couple of problems.
The original code (that I'm trying to add your sample to) uses a for..each..next loop to get to the file location, and read the files in one at a time. I'm having a problem getting results by placing your code within that for.each.next..

Here's your code, that I pushed into a console app.. This works file, by the way.

Module Module1
    Sub Main()
        Dim testData As String = "The fieldwork was performed between August 1 and 5, 2003" & vbCrLf & _
            "The fieldwork was performed on August 12 and 15, 2003" & vbCrLf & _
            "The fieldwork was conducted by John Doe and Jim Public on August 12th  and 13th, 2003" & vbCrLf & _
            "The fieldwork was performed during July and August, 2003" & vbCrLf & _
            "The fieldwork was conducted by John Doe and Jim Public between March 16-29, 2004" & vbCrLf & _
            "The fieldwork was conducted in September and October 2003" & vbCrLf & _
            "The fieldwork was performed on July 24, 2003" & vbCrLf & _
            "The fieldwork was conducted on July 28 through August 15, 2003" & vbCrLf & _
            "The fieldwork was performed on July 24 and 25," & vbCrLf & _
            "The fieldwork was performed by John Doe on 07 April 2003" & vbCrLf & _
            "The fieldwork was supervised by John Doe and assisted by Jim Public from April 14 through 29, 2003"
        Dim pattern As String = "(?im)(?=.*?fieldwork\s)(?=.*?" & _
            "(?January|February|March|April|May|June|July|August|September|October|November|December))" & _
            "(?:.*?(?\d{1,2}\s+\k.*?)|.*?(?\k.*?))(?:\r\n|$)"
        Dim sb As StringBuilder = New StringBuilder
        Dim mc As MatchCollection = Regex.Matches(testData, pattern)
        Dim sw As StreamWriter = New StreamWriter("c:\1\out\projtest.txt", True)
        Dim fwdatestr As String
        For Each m As Match In mc
            sb.Append(m.Groups("Dates").Value & vbCrLf)
        Next
        fwdatestr = sb.ToString
        sw.WriteLine(fwdatestr)
        sw.Close()
    End Sub
End Module
......................................
And here's the code that I am trying to incorporate your sample into. The problem I'm having is with the matchcollection for each, sb.Append(m.Groups("Dates").Value & vbCrLf)-

I can't get my head wrapped around how to get this set ptoperly. Anyway, here's the other code that is also working..
Module Module1
    Sub Main()
        Dim projfile As String = "C:\1\out\projects.csv"
        Try
            If File.Exists(projfile) Then
                File.Delete(projfile)
                Console.WriteLine("'projects.csv' file found, deleted file.")
            Else
                Console.WriteLine("'projects.csv' file NOT found, NO deleted file.")
            End If
            For Each datafile As String In Directory.GetFiles("c:\1\")
                Dim sr As StreamReader = New StreamReader(datafile)
                Dim filetext As String = sr.ReadToEnd()
                sr.Close()
                Console.WriteLine("Processing: " & datafile)
                Dim repermitnumber As Regex = New Regex("U-\d{2}-MQ-\d{1,4}[a-z]*")
                Dim reprojectcounty As Regex = New Regex("\w* ?\w+ ?county, *Utah", RegexOptions.IgnoreCase)
                Dim rereportnumber As Regex = New Regex("(?<=Report No\. ?)\d\d-\d\d\d?")
                Dim mpermitnumber As Match = repermitnumber.Match(filetext)
                Dim mprojectcounty As Match = reprojectcounty.Match(filetext)
                Dim mreportnumber As Match = rereportnumber.Match(filetext)
                Dim sw As StreamWriter = New StreamWriter(projfile, True)
                Dim dataline As String = mpermitnumber.Groups(0).Value & "|" & mprojectcounty.Groups(0).Value & "|" & mreportnumber.Groups(0).Value
                Console.WriteLine("   Writing: " & dataline)
                sw.WriteLine(dataline)
                sw.Close()
            Next
        Catch E As Exception
            Console.WriteLine("An error was encountered:")
            Console.WriteLine(E.Message)
        End Try
    End Sub
End Module


Thanks again for your help.
0
 
LVL 62

Expert Comment

by:Fernando Soto
ID: 22904496
Hi montarch;

Below is your code which I have moved some lines around and have added my solution to it. I have also placed some comments in the code.

Fernando

Sub Main()

	Dim projfile As String = "C:\1\out\projects.csv"

	Try

		If File.Exists(projfile) Then

			File.Delete(projfile)

			Console.WriteLine("'projects.csv' file found, deleted file.")

		Else

			Console.WriteLine("'projects.csv' file NOT found, NO deleted file.")

		End If
 

		' These Regex objects only need to be created one. Doing it this was improves performance

		Dim repermitnumber As Regex = New Regex("U-\d{2}-MQ-\d{1,4}[a-z]*")

		Dim reprojectcounty As Regex = New Regex("\w* ?\w+ ?county, *Utah", RegexOptions.IgnoreCase)

		Dim rereportnumber As Regex = New Regex("(?<=Report No\. ?)\d\d-\d\d\d?")

		Dim pattern As String = "(?im)(?=.*?fieldwork\s)(?=.*?" & _

			"(?<Month>January|February|March|April|May|June|July|August|September|October|November|December))" & _

			"(?:.*?(?<Dates>\d{1,2}\s+\k<Month>.*?)|.*?(?<Dates>\k<Month>.*?))(?:\r\n|$)"

		Dim redatechanged As Regex = New Regex(pattern)
 

		' Initialize the StringBuilder

		Dim sb As StringBuilder = New StringBuilder
 

		For Each datafile As String In Directory.GetFiles("c:\1\")

			' Set the lenth of the string in StringBuilder to 0 this

			' makes the string within the StringBuilder = String.Empty

			sb.Length = 0

			Dim sr As StreamReader = New StreamReader(datafile)

			Dim filetext As String = sr.ReadToEnd()

			sr.Close()

			Console.WriteLine("Processing: " & datafile)

			Dim mpermitnumber As Match = repermitnumber.Match(filetext)

			Dim mprojectcounty As Match = reprojectcounty.Match(filetext)

			Dim mreportnumber As Match = rereportnumber.Match(filetext)

			Dim sw As StreamWriter = New StreamWriter(projfile, True)

			' Adding this to the StringBuilder object

			sb.Append(mpermitnumber.Groups(0).Value & "|" & _

				mprojectcounty.Groups(0).Value & "|" & mreportnumber.Groups(0).Value & vbCrLf)

			

			Dim mc As MatchCollection = redatechanged.Matches(filetext)

			For Each m As Match In mc

				sb.Append(m.Groups("Dates").Value & vbCrLf)

			Next

			

			Console.WriteLine(" Writing: " & sb.ToString())

			sw.WriteLine(sb.ToString())

			sw.Close()

		Next

	Catch E As Exception

		Console.WriteLine("An error was encountered:")

		Console.WriteLine(E.Message)

	End Try

End Sub

Open in new window

0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Suggested Solutions

This article describes relatively difficult and non-obvious issues that are likely to arise when creating COM class in Visual Studio and deploying it by professional MSI-authoring tools. It is assumed that the reader is already familiar with the cla…
Today I had a very interesting conundrum that had to get solved quickly. Needless to say, it wasn't resolved quickly because when we needed it we were very rushed, but as soon as the conference call was over and I took a step back I saw the correct …
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now