montarch
asked on
VB.Net- using an array in a regular expression string
I have a task which involves extracting data from a large number of report files (.txt format). Thanks to the help here, I've been able to get the project rolling using regular expressions in vb.net 2003. I've been able to successfully extract data from the files into 3 of the fields I've tried so far, but I have several other fields that are a bit more complex (from my uninitiated point of view) and had questions in those regards.
Can an array be used in a regular expressions string?
The data that I'm trying to extract is regarding the dates that 'Fieldwork' was conducted. The actual fields I need to populate are 'fieldworkstart' and 'fieldworkend', however if I can just get the general 'fieldwork' data extracted, then I can coax the data further from there.
The dates aren't in a standard format, but they are somewhat consistant in the manner that they were entered. Here are a few examples of the fieldwork and dates occurrences from several of the files.
The fieldwork was performed between August 1 and 5, 2003
The fieldwork was performed on August 12 and 15, 2003
The fieldwork was conducted by John Doe and Jim Public on August 12th and 13th, 2003
The fieldwork was performed during July and August, 2003
The fieldwork was conducted by John Doe and Jim Public between March 16-29, 2004
The fieldwork was conducted in September and October 2003
The fieldwork was performed on July 24, 2003
The fieldwork was conducted on July 28 through August 15, 2003.
The fieldwork was performed on July 24 and 25, (NOTE: NO Year entered in this example)
The fieldwork was performed by John Doe on 07 April 2003
The fieldwork was supervised by John Doe and assisted by Jim Public from April 14 through 29, 2003
The reason I ask if I can use an array in the regular expression is this: Without knowing a better way to proceed, I had thought to create a 'months' array (january, february...december) to do a regular expression search that looked for the month names as part of the match criteria. From the examples above, here is the data that I want to extract to a file (.csv).
fieldwork August 1 and 5, 2003
fieldwork August 12 and 15, 2003
fieldwork September and October 2003
Which I will then (somehow) break down into the proper fields:
fieldworkstart: August 1
fieldworkend: August 5 2003
fieldworkstart: August 12
fieldworkend: August 15 2003
fieldworkstart: September
fieldworkend: October 2003
If you need a copy of the working code that I'm using for the other fields (thanks ddrudik!) let me know and I'll post that.
Thanks again all!
Can an array be used in a regular expressions string?
The data that I'm trying to extract is regarding the dates that 'Fieldwork' was conducted. The actual fields I need to populate are 'fieldworkstart' and 'fieldworkend', however if I can just get the general 'fieldwork' data extracted, then I can coax the data further from there.
The dates aren't in a standard format, but they are somewhat consistant in the manner that they were entered. Here are a few examples of the fieldwork and dates occurrences from several of the files.
The fieldwork was performed between August 1 and 5, 2003
The fieldwork was performed on August 12 and 15, 2003
The fieldwork was conducted by John Doe and Jim Public on August 12th and 13th, 2003
The fieldwork was performed during July and August, 2003
The fieldwork was conducted by John Doe and Jim Public between March 16-29, 2004
The fieldwork was conducted in September and October 2003
The fieldwork was performed on July 24, 2003
The fieldwork was conducted on July 28 through August 15, 2003.
The fieldwork was performed on July 24 and 25, (NOTE: NO Year entered in this example)
The fieldwork was performed by John Doe on 07 April 2003
The fieldwork was supervised by John Doe and assisted by Jim Public from April 14 through 29, 2003
The reason I ask if I can use an array in the regular expression is this: Without knowing a better way to proceed, I had thought to create a 'months' array (january, february...december) to do a regular expression search that looked for the month names as part of the match criteria. From the examples above, here is the data that I want to extract to a file (.csv).
fieldwork August 1 and 5, 2003
fieldwork August 12 and 15, 2003
fieldwork September and October 2003
Which I will then (somehow) break down into the proper fields:
fieldworkstart: August 1
fieldworkend: August 5 2003
fieldworkstart: August 12
fieldworkend: August 15 2003
fieldworkstart: September
fieldworkend: October 2003
If you need a copy of the working code that I'm using for the other fields (thanks ddrudik!) let me know and I'll post that.
Thanks again all!
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Sorry for not getting back sooner- I've been working on this project for days, and into the wee hours.
Excellent code sample you provided for me- I've been trying to integrate it into a console app.
The goal is to get the data from files in a directory, instead of the strings that you hard coded under 'testdata as string'. From there, to write the data to a file.
It works well by itself, however when I try to add in the other code (regex patterns and such- let me know if you'd like a copy of theat code and the problematic code-) I'm having a couple of problems.
The original code (that I'm trying to add your sample to) uses a for..each..next loop to get to the file location, and read the files in one at a time. I'm having a problem getting results by placing your code within that for.each.next..
Here's your code, that I pushed into a console app.. This works file, by the way.
Module Module1
Sub Main()
Dim testData As String = "The fieldwork was performed between August 1 and 5, 2003" & vbCrLf & _
"The fieldwork was performed on August 12 and 15, 2003" & vbCrLf & _
"The fieldwork was conducted by John Doe and Jim Public on August 12th and 13th, 2003" & vbCrLf & _
"The fieldwork was performed during July and August, 2003" & vbCrLf & _
"The fieldwork was conducted by John Doe and Jim Public between March 16-29, 2004" & vbCrLf & _
"The fieldwork was conducted in September and October 2003" & vbCrLf & _
"The fieldwork was performed on July 24, 2003" & vbCrLf & _
"The fieldwork was conducted on July 28 through August 15, 2003" & vbCrLf & _
"The fieldwork was performed on July 24 and 25," & vbCrLf & _
"The fieldwork was performed by John Doe on 07 April 2003" & vbCrLf & _
"The fieldwork was supervised by John Doe and assisted by Jim Public from April 14 through 29, 2003"
Dim pattern As String = "(?im)(?=.*?fieldwork\s)(? =.*?" & _
"(?January|February|March| April|May| June|July| August|Sep tember|Oct ober|Novem ber|Decemb er))" & _
"(?:.*?(?\d{1,2}\s+\k.*?)| .*?(?\k.*? ))(?:\r\n| $)"
Dim sb As StringBuilder = New StringBuilder
Dim mc As MatchCollection = Regex.Matches(testData, pattern)
Dim sw As StreamWriter = New StreamWriter("c:\1\out\pro jtest.txt" , True)
Dim fwdatestr As String
For Each m As Match In mc
sb.Append(m.Groups("Dates" ).Value & vbCrLf)
Next
fwdatestr = sb.ToString
sw.WriteLine(fwdatestr)
sw.Close()
End Sub
End Module
.......................... .......... ..
And here's the code that I am trying to incorporate your sample into. The problem I'm having is with the matchcollection for each, sb.Append(m.Groups("Dates" ).Value & vbCrLf)-
I can't get my head wrapped around how to get this set ptoperly. Anyway, here's the other code that is also working..
Module Module1
Sub Main()
Dim projfile As String = "C:\1\out\projects.csv"
Try
If File.Exists(projfile) Then
File.Delete(projfile)
Console.WriteLine("'projec ts.csv' file found, deleted file.")
Else
Console.WriteLine("'projec ts.csv' file NOT found, NO deleted file.")
End If
For Each datafile As String In Directory.GetFiles("c:\1\" )
Dim sr As StreamReader = New StreamReader(datafile)
Dim filetext As String = sr.ReadToEnd()
sr.Close()
Console.WriteLine("Process ing: " & datafile)
Dim repermitnumber As Regex = New Regex("U-\d{2}-MQ-\d{1,4}[ a-z]*")
Dim reprojectcounty As Regex = New Regex("\w* ?\w+ ?county, *Utah", RegexOptions.IgnoreCase)
Dim rereportnumber As Regex = New Regex("(?<=Report No\. ?)\d\d-\d\d\d?")
Dim mpermitnumber As Match = repermitnumber.Match(filet ext)
Dim mprojectcounty As Match = reprojectcounty.Match(file text)
Dim mreportnumber As Match = rereportnumber.Match(filet ext)
Dim sw As StreamWriter = New StreamWriter(projfile, True)
Dim dataline As String = mpermitnumber.Groups(0).Va lue & "|" & mprojectcounty.Groups(0).V alue & "|" & mreportnumber.Groups(0).Va lue
Console.WriteLine(" Writing: " & dataline)
sw.WriteLine(dataline)
sw.Close()
Next
Catch E As Exception
Console.WriteLine("An error was encountered:")
Console.WriteLine(E.Messag e)
End Try
End Sub
End Module
Thanks again for your help.
Excellent code sample you provided for me- I've been trying to integrate it into a console app.
The goal is to get the data from files in a directory, instead of the strings that you hard coded under 'testdata as string'. From there, to write the data to a file.
It works well by itself, however when I try to add in the other code (regex patterns and such- let me know if you'd like a copy of theat code and the problematic code-) I'm having a couple of problems.
The original code (that I'm trying to add your sample to) uses a for..each..next loop to get to the file location, and read the files in one at a time. I'm having a problem getting results by placing your code within that for.each.next..
Here's your code, that I pushed into a console app.. This works file, by the way.
Module Module1
Sub Main()
Dim testData As String = "The fieldwork was performed between August 1 and 5, 2003" & vbCrLf & _
"The fieldwork was performed on August 12 and 15, 2003" & vbCrLf & _
"The fieldwork was conducted by John Doe and Jim Public on August 12th and 13th, 2003" & vbCrLf & _
"The fieldwork was performed during July and August, 2003" & vbCrLf & _
"The fieldwork was conducted by John Doe and Jim Public between March 16-29, 2004" & vbCrLf & _
"The fieldwork was conducted in September and October 2003" & vbCrLf & _
"The fieldwork was performed on July 24, 2003" & vbCrLf & _
"The fieldwork was conducted on July 28 through August 15, 2003" & vbCrLf & _
"The fieldwork was performed on July 24 and 25," & vbCrLf & _
"The fieldwork was performed by John Doe on 07 April 2003" & vbCrLf & _
"The fieldwork was supervised by John Doe and assisted by Jim Public from April 14 through 29, 2003"
Dim pattern As String = "(?im)(?=.*?fieldwork\s)(?
"(?January|February|March|
"(?:.*?(?\d{1,2}\s+\k.*?)|
Dim sb As StringBuilder = New StringBuilder
Dim mc As MatchCollection = Regex.Matches(testData, pattern)
Dim sw As StreamWriter = New StreamWriter("c:\1\out\pro
Dim fwdatestr As String
For Each m As Match In mc
sb.Append(m.Groups("Dates"
Next
fwdatestr = sb.ToString
sw.WriteLine(fwdatestr)
sw.Close()
End Sub
End Module
..........................
And here's the code that I am trying to incorporate your sample into. The problem I'm having is with the matchcollection for each, sb.Append(m.Groups("Dates"
I can't get my head wrapped around how to get this set ptoperly. Anyway, here's the other code that is also working..
Module Module1
Sub Main()
Dim projfile As String = "C:\1\out\projects.csv"
Try
If File.Exists(projfile) Then
File.Delete(projfile)
Console.WriteLine("'projec
Else
Console.WriteLine("'projec
End If
For Each datafile As String In Directory.GetFiles("c:\1\"
Dim sr As StreamReader = New StreamReader(datafile)
Dim filetext As String = sr.ReadToEnd()
sr.Close()
Console.WriteLine("Process
Dim repermitnumber As Regex = New Regex("U-\d{2}-MQ-\d{1,4}[
Dim reprojectcounty As Regex = New Regex("\w* ?\w+ ?county, *Utah", RegexOptions.IgnoreCase)
Dim rereportnumber As Regex = New Regex("(?<=Report No\. ?)\d\d-\d\d\d?")
Dim mpermitnumber As Match = repermitnumber.Match(filet
Dim mprojectcounty As Match = reprojectcounty.Match(file
Dim mreportnumber As Match = rereportnumber.Match(filet
Dim sw As StreamWriter = New StreamWriter(projfile, True)
Dim dataline As String = mpermitnumber.Groups(0).Va
Console.WriteLine(" Writing: " & dataline)
sw.WriteLine(dataline)
sw.Close()
Next
Catch E As Exception
Console.WriteLine("An error was encountered:")
Console.WriteLine(E.Messag
End Try
End Sub
End Module
Thanks again for your help.
Hi montarch;
Below is your code which I have moved some lines around and have added my solution to it. I have also placed some comments in the code.
Fernando
Below is your code which I have moved some lines around and have added my solution to it. I have also placed some comments in the code.
Fernando
Sub Main()
Dim projfile As String = "C:\1\out\projects.csv"
Try
If File.Exists(projfile) Then
File.Delete(projfile)
Console.WriteLine("'projects.csv' file found, deleted file.")
Else
Console.WriteLine("'projects.csv' file NOT found, NO deleted file.")
End If
' These Regex objects only need to be created one. Doing it this was improves performance
Dim repermitnumber As Regex = New Regex("U-\d{2}-MQ-\d{1,4}[a-z]*")
Dim reprojectcounty As Regex = New Regex("\w* ?\w+ ?county, *Utah", RegexOptions.IgnoreCase)
Dim rereportnumber As Regex = New Regex("(?<=Report No\. ?)\d\d-\d\d\d?")
Dim pattern As String = "(?im)(?=.*?fieldwork\s)(?=.*?" & _
"(?<Month>January|February|March|April|May|June|July|August|September|October|November|December))" & _
"(?:.*?(?<Dates>\d{1,2}\s+\k<Month>.*?)|.*?(?<Dates>\k<Month>.*?))(?:\r\n|$)"
Dim redatechanged As Regex = New Regex(pattern)
' Initialize the StringBuilder
Dim sb As StringBuilder = New StringBuilder
For Each datafile As String In Directory.GetFiles("c:\1\")
' Set the lenth of the string in StringBuilder to 0 this
' makes the string within the StringBuilder = String.Empty
sb.Length = 0
Dim sr As StreamReader = New StreamReader(datafile)
Dim filetext As String = sr.ReadToEnd()
sr.Close()
Console.WriteLine("Processing: " & datafile)
Dim mpermitnumber As Match = repermitnumber.Match(filetext)
Dim mprojectcounty As Match = reprojectcounty.Match(filetext)
Dim mreportnumber As Match = rereportnumber.Match(filetext)
Dim sw As StreamWriter = New StreamWriter(projfile, True)
' Adding this to the StringBuilder object
sb.Append(mpermitnumber.Groups(0).Value & "|" & _
mprojectcounty.Groups(0).Value & "|" & mreportnumber.Groups(0).Value & vbCrLf)
Dim mc As MatchCollection = redatechanged.Matches(filetext)
For Each m As Match In mc
sb.Append(m.Groups("Dates").Value & vbCrLf)
Next
Console.WriteLine(" Writing: " & sb.ToString())
sw.WriteLine(sb.ToString())
sw.Close()
Next
Catch E As Exception
Console.WriteLine("An error was encountered:")
Console.WriteLine(E.Message)
End Try
End Sub
This code snippet should do what you need.
Fernando
Open in new window