Link to home
Start Free TrialLog in
Avatar of asc2010
asc2010

asked on

Need Help With Regular Expressions & VB.NET

Greetings!  I am working on a vb.net application to extract and format text from a text file.  
The basic goal: remove certain text and create 4 separate text files from the remaining file.

I've not gotten very far and am probably in over my head, but I'll post the little bit of code I do have.

I am willing to learn, but I will need help.
'remove 1st column and its trailing white space

    Sub StepOne()
        Dim strCurrent As String = ""
        Dim strRoot As String = ""
        Try
            strCurrent = Directory.GetCurrentDirectory()
            'strRoot = Directory.GetDirectoryRoot(strCurrent)
            'strRoot = Directory.GetDirectoryRoot("\")
        Catch E As Exception
            'Console.WriteLine("Error determining root directory")
            MessageBox.Show(E.Message)
        End Try
        'read hostsfile.txt
        Dim strfileName As String
        strfileName = strCurrent & "\" & "hotlist.txt"

        Replace(strfileName, "\n1 ", "\n")

    End Sub


    'GENERIC REGEX SEARCH & REPLACE.  Use Regular Expressions to search/replace within file
    Function Replace(ByRef file As String, ByRef searchFor As String, ByRef replaceWith As String) As Boolean
        'function used during Validation
        'use Regular Expressions to search for and replace user defined variables
        Try
            Dim reader As New StreamReader(file)            'get a StreamReader for reading the file
            Dim contents As String = reader.ReadToEnd()     'read the entire file at once
            reader.Close()
            reader.Dispose()                                'close up and dispose
            'use regular expressions to search and replace text
            contents = Regex.Replace(contents, searchFor, replaceWith, RegexOptions.IgnoreCase Or RegexOptions.Compiled)
            Dim writer As New StreamWriter(file)            'get a StreamWriter for writing the new text to the file
            writer.Write(contents)                          'write the contents
            writer.Close()
            writer.Dispose()                                'close up and dispose
            Return True                                     'return successful
        Catch generatedExceptionName As Exception
            MessageBox.Show(generatedExceptionName.Message)
            Return False
        End Try
    End Function

Open in new window

Avatar of asc2010
asc2010

ASKER

Ok, I've got the first part done...remove the first two characters of each line.  It seems a bit crude though:

Replace(strfileName, "^1 ", "")
Replace(strfileName, "\r|\n1 ", Chr(10))

Open in new window

Avatar of asc2010

ASKER

No takers here?  Well, I am continuing to work on this, here is the code I have so far:
Option Explicit On

Imports System
Imports System.IO
Imports System.IO.File
Imports System.Text.RegularExpressions
Imports System.Text.RegularExpressions.Match
Imports System.Text.RegularExpressions.MatchCollection
'
'requirements:
'invisible form (will be called via batch file)
'Remove the 1st column and its trailing white space
'Remove all rows that are not Ohio
'Remove columns 3, 4, and 9
'Create text files based on a letter-filter: V=Stolen.txt, W=Wanted.txt, P=LP.txt, M=MP.txt
'Create all text files as comma sperated

Public Class DOJ_Parser_Hilliard

    Public strFileName As String = "hotlist.txt"
    Public strCurrent As String = Directory.GetCurrentDirectory()
    Public strGlobalFilePath As String = strCurrent & "\" & strFileName

    Private Sub DOJ_Parser_Hilliard_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load

        Label1.Text = "Loading . . ."

        If VerifyFileExists() = True Then
            StepOne()
            StepTwo()
        Else
            Me.Close()
        End If


    End Sub

    Function VerifyFileExists()
        'this function runs on form load
        Try
            strCurrent = Directory.GetCurrentDirectory()
        Catch E As Exception
            MessageBox.Show(E.Message)
        End Try
        If Not Exists(strGlobalFilePath) Then
            MessageBox.Show("File Not Found.", "WARNING!")
            Return False
        Else
            Return True
        End If
    End Function

    Sub StepOne()
        'remove 1st column and its trailing white space
        'regex replace
        Replace(strGlobalFilePath, "^1 ", "")
        Replace(strGlobalFilePath, "\r|\n1 ", Chr(10))
    End Sub
    
    Sub StepTwo()
        'remove all rows that are not OH
        'regex match
        Dim strOhioCollection As String = ""
        Dim strRegexMatch As String = Nothing 'prepare a string for regex matching
        Dim srTextFile As New StreamReader(strGlobalFilePath)   'initialize streamreader
        Dim strTextFileContents As String = srTextFile.ReadToEnd()          'read in entire text file
        srTextFile.Close()                                                  'close stream reader
        strRegexMatch = "^.*OH.*$"

        If Regex.IsMatch(strGlobalFilePath, strRegexMatch) Then           'if a match is found in the text file, return it to the requestor method
            Dim theMatchedVariable = Nothing
            Dim r As New Regex(strRegexMatch, RegexOptions.IgnoreCase Or RegexOptions.Compiled)
            theMatchedVariable = r.Match(strTextFileContents).Result("^.*OH.*$")
            'open a streamwriter
            Dim sw As StreamWriter = File.AppendText(strGlobalFilePath)
            sw.Write(vbCrLf & theMatchedVariable) 'write to file
            sw.Flush() 'update file 
            sw.Close()
            sw.Dispose()

        Else 'if file does not exist, display error notice
            MessageBox.Show("WARNING! ERROR!", "ERROR")
        End If

    End Sub


    Function StepThree(ByVal strMatchThis As String)
        're-write hostlist.txt with Ohio content
        'sreamreader
        'streamwriter
        Dim theMessage As String = Nothing
        Dim theMessageTitle As String = Nothing

        Dim srTextFile As New StreamReader(strGlobalFilePath)   'initialize streamreader
        Dim strTextFileContents As String = srTextFile.ReadToEnd()          'read in entire text file
        srTextFile.Close()                                                  'close stream reader

        Dim strRegexMatch As String = Nothing                               'prepare a string for regex matching

        Select Case strMatchThis
            Case "deployment"                                               'if passed variable is deployment, do the following variable assignment
                strRegexMatch = "\$LOCAL::DEPLOYMENTS\$=(?<matchvariable>.+)"
                theMessage = "WARNING!   PlateScanDeploymentInstallationPath Not Found."
                theMessageTitle = "File Not Found"
            Case "enterprise"                                               'if passed variable is enterprise, do the following variable assignment
                strRegexMatch = "\$LOCAL::ENTERPRISE\$=\$LOCAL::DEPLOYMENTS\$(?<matchvariable>.+)"
                theMessage = "WARNING!   PlateScanEnterpriseInstallationPath Not Found."
                theMessageTitle = "File Not Found"
            Case "system"                                                   'if passed variable is system, do the following variable assignment
                strRegexMatch = "\$LOCAL::SYSTEM\$=(?<matchvariable>.+)"
                theMessage = "WARNING!   PlateScanSystemInstallationPath Not Found."
                theMessageTitle = "File Not Found"
        End Select

        If Regex.IsMatch(strTextFileContents, strRegexMatch) Then           'if a match is found in the text file, return it to the requestor method
            Dim theMatchedVariable = Nothing
            Dim r As New Regex(strRegexMatch, RegexOptions.IgnoreCase Or RegexOptions.Compiled)
            theMatchedVariable = r.Match(strTextFileContents).Result("${matchvariable}")
            Return theMatchedVariable
        Else                                                                'if file does not exist, display error notice
            MessageBox.Show(theMessage, theMessageTitle)
            Return False
        End If
    End Function

    'GENERIC REGEX SEARCH & REPLACE.  Use Regular Expressions to search/replace within file
    Function Replace(ByRef file As String, ByRef searchFor As String, ByRef replaceWith As String) As Boolean
        'function used during Validation
        'use Regular Expressions to search for and replace user defined variables
        Try
            Dim reader As New StreamReader(file)            'get a StreamReader for reading the file
            Dim contents As String = reader.ReadToEnd()     'read the entire file at once
            reader.Close()
            reader.Dispose()                                'close up and dispose
            'use regular expressions to search and replace text
            contents = Regex.Replace(contents, searchFor, replaceWith, RegexOptions.IgnoreCase Or RegexOptions.Compiled)
            Dim writer As New StreamWriter(file)            'get a StreamWriter for writing the new text to the file
            writer.Write(contents)                          'write the contents
            writer.Close()
            writer.Dispose()                                'close up and dispose
            Return True                                     'return successful
        Catch generatedExceptionName As Exception
            MessageBox.Show(generatedExceptionName.Message)
            Return False
        End Try
    End Function

Open in new window

Avatar of Dirk Haest
>> The basic goal: remove certain text and create 4 separate text files from the remaining file.

You say a goal, but you never ask a question. What is the problem that you still have ? From what are you starting and what's the aim ?
Ah, a taker :)

I think the question is "How to use RegEx in C# to manipulate text and write 4 files with the result"
>> How to use RegEx in C# to manipulate text and write 4 files with the result

I understand that but there is nothing mentioned of how the starting text looks like, based on what information the text needs to split to multiple files, ...
Avatar of asc2010

ASKER

First, I apologize for offering monetary compensation.  My offer was not intended to offend anyone, only expedite my request for help.

@ Dhaest: My question should have been stated as mplungjan put it - "How to use RegEx in C# or VB.NET to manipulate text and write 4 files with the result"

@ mplungjan: Thank you for clarifying my request.

@ aikimark: I put this request into the C# zone because I can figure out how to work with C# if someone here can help.  I have experience with VB.NET so I started with that language.

Attached is an excerpt of the text file I am working with.  There are  11 columns in the file, however, some are blank and I need to account for them as well.  I only need the rows that contain the text "OH" in the 3rd column.  Out of those, I only need columns 2, 3, 6, 7, 8, 9, and 11.

Once I get the correct rows and columns, I need to comma separate them and write them to separate text files based on column 11.
If column 11 starts with "V", create a file called "Stolen.txt"
If column 11 starts with "W", create a file called "Wanted.txt"
If column 11 starts with "P", create a file called "LP.txt"
If column 11 starts with "M", create a file called "MP.txt"
example.txt
I'll take a look at it tomorrow ...
Another question: why are you so eager to use regular expressions ? There are easier ways to handle this.
I'm thinking out loud: reading the file into an list of objects (with 11 properties) and use linq to get the correct results to store into the files.

Is the file a fixed-length file (or not like the example you provided, or is there a fault in the 4th line) ?
Avatar of asc2010

ASKER

@ Dhaest: I'm going the regex route because I could not think of anything else...I am open for anything simpler.  This program will be placed on a shared computer and called from a batch file.  The user should never even know it exists, so as long it gets the job done, I am open to suggestions!
Is there a file-format known ? Does it have fixed lengths, separator, ... ?
Avatar of asc2010

ASKER

I'm not sure what you mean.  The attached file is exactly what I'm working with...it sucks I know.  The end result - the 4 text files - will be comma separated for columns.
It looks fixed width but the 4th line doesn't line up properly...

We'd need to know EXACTLY how the file is formatted to be able to help you!
VariableFixedWidth.jpg
I'll create the query's tomorrow. Here you already have an example of how to load the file into a list of objects


   Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
        parseFile()

end sub
    Private Sub parseFile()
        Dim objReader As New System.IO.StreamReader("c:\ee.txt")
        Dim TextLine As String

        Dim fileLines As List(Of FileLine) = New List(Of FileLine)()
        Dim myLine As FileLine

        Do While objReader.Peek() <> -1
            TextLine = objReader.ReadLine()
            myLine = ParseLine(TextLine)
            If Not myLine Is Nothing Then
                fileLines.Add(ParseLine(TextLine))
            End If
        Loop

    End Sub

    Private Function ParseLine(ByVal textline As String)
        Dim textSplit() As String = textline.Split(" ")
        Dim line As FileLine = New FileLine(textSplit)

        If line.column3.Contains("OH") Then
            Return line
        End If
        Return Nothing
    End Function


Public Class FileLine
    Public column1 As String
    Public column2 As String
    Public column3 As String
    Public column4 As String
    Public column5 As String
    Public column6 As String
    Public column7 As String
    Public column8 As String
    Public column9 As String
    Public column10 As String
    Public column11 As String



    Public Sub New(ByVal params() As String)
        column1 = params(0)
        column2 = params(1)
        column3 = params(2)
        column4 = params(3)
        column5 = params(4)
        column6 = params(5)
        column7 = params(6)
        column8 = params(7)
        column9 = params(8)
        column10 = params(9)
        column11 = params(10)
    End Sub

End Class

Open in new window

Avatar of asc2010

ASKER

Idle_Mind,
The file is very ugly and not fixed-width.  My approach was going to be finding the white space and try to format it based on that...sorry
We really can't help unless you can explain in plain English how the "records" are structured.

Reading lines and writing lines in a file is simple...but if we can't discern where the columns start/stop then we are of no use to you.

Yes...you'd need to explain in EXCRUCIATING detail how the file works...   =\
Avatar of asc2010

ASKER

Idle_Mind,

Each white space starts/ends a new column.  The data originates from a website.  someone extracts this data and it gets placed into the text file exactly as I have sent it.  Each record originally has 11 columns.  Some of those columns are empty and some are not.  
Ah...gotcha!

...approximately how many lines in the actual file?

*We need to decide if the whole thing will be read into memory at once or if it should be processed only one line at a time.
Avatar of asc2010

ASKER

Idle_Mind,

The number of lines in the file vary by day.  The example file I provided contained only the first few lines out of 1,249.  Unfortunately, this will be an unknown amount as it changes on a daily basis.
Ok...

1,000 lines is no problem.

10,000 lines would probably be fine too on most systems.

100,000 lines we might want to consider a different approach.

1,000,000 lines definitely needs a something else.

Is there a reasonable upper bound for the max # of lines that might be in the daily file?
Avatar of asc2010

ASKER

Idle_Mind,

lol I see your point there.  I do not believe there will be more than 5,000 lines on any given day.  If there are, than this agency has got way bigger issues than processing power!
Here's an in-elegant solution that simply appends the lines to the existing data in the output files.

It is based on your rules:

    I only need the rows that contain the text "OH" in the 3rd column.
    Out of those, I only need columns 2, 3, 6, 7, 8, 9, and 11.

    Once I get the correct rows and columns, I need to comma separate them and write them to separate text files based on column 11.
    If column 11 starts with "V", create a file called "Stolen.txt"
    If column 11 starts with "W", create a file called "Wanted.txt"
    If column 11 starts with "P", create a file called "LP.txt"
    If column 11 starts with "M", create a file called "MP.txt"

*I didn't do any bounds checking!  If lines might exist with LESS than 11 columns or it there are complement blank lines then you'll need to do some extra checking:
Public Class Form1

    Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
        Dim Path As String = "C:\Users\Mike\Documents\Downloads"
        Dim DataFile As String = "Example.txt"
        Dim V_File As String = "Stolen.txt"
        Dim W_File As String = "Wanted.txt"
        Dim LP_File As String = "LP.txt"
        Dim MP_File As String = "MP.txt"

        Using sr As New System.IO.StreamReader(System.IO.Path.Combine(Path, DataFile))
            While Not sr.EndOfStream
                Dim values As New List(Of String)
                values.AddRange(sr.ReadLine.Split(" "))
                If values(2).ToUpper = "OH" Then ' process lines with "OH" in the 3rd column
                    ' remove columns 10,5,4,1
                    values.RemoveAt(9)
                    values.RemoveAt(5)
                    values.RemoveAt(4)
                    values.RemoveAt(0)
                    Dim output As String = String.Join(",", values.ToArray) & Environment.NewLine

                    ' output to file based on first character of value in last colmun
                    Select Case values(values.Count - 1).Substring(0, 1).ToUpper
                        Case "V"
                            My.Computer.FileSystem.WriteAllText(System.IO.Path.Combine(Path, V_File), output, True)

                        Case "W"
                            My.Computer.FileSystem.WriteAllText(System.IO.Path.Combine(Path, W_File), output, True)

                        Case "P"
                            My.Computer.FileSystem.WriteAllText(System.IO.Path.Combine(Path, LP_File), output, True)

                        Case "M"
                            My.Computer.FileSystem.WriteAllText(System.IO.Path.Combine(Path, MP_File), output, True)

                    End Select
                End If
            End While
        End Using

    End Sub

End Class

Open in new window

Avatar of asc2010

ASKER

Idle_Mind,

WOW!  This works like a charm!  I can't believe how quickly you came up with this!!  I understand the WriteAllText() has a true/false parameter to append or overwrite.  But is there a way to overwrite any existing data in the files just prior to writing the new data to it?  I can throw in a a quick function to erase any existing data, but am curious to see your opinion.
Quick fix...
    Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
        Dim Path As String = "C:\Users\Mike\Documents\Downloads"
        Dim DataFile As String = "Example.txt"
        Dim V_File As String = "Stolen.txt"
        Dim W_File As String = "Wanted.txt"
        Dim LP_File As String = "LP.txt"
        Dim MP_File As String = "MP.txt"

        Using sw_V As New System.IO.StreamWriter(System.IO.Path.Combine(Path, V_File), False)
            Using sw_W As New System.IO.StreamWriter(System.IO.Path.Combine(Path, W_File), False)
                Using sw_LP As New System.IO.StreamWriter(System.IO.Path.Combine(Path, LP_File), False)
                    Using sw_MP As New System.IO.StreamWriter(System.IO.Path.Combine(Path, MP_File), False)

                        Using sr As New System.IO.StreamReader(System.IO.Path.Combine(Path, DataFile))
                            While Not sr.EndOfStream
                                Dim values As New List(Of String)
                                values.AddRange(sr.ReadLine.Split(" "))
                                If values(2).ToUpper = "OH" Then ' process lines with "OH" in the 3rd column
                                    ' remove columns 10,5,4,1
                                    values.RemoveAt(9)
                                    values.RemoveAt(5)
                                    values.RemoveAt(4)
                                    values.RemoveAt(0)
                                    Dim output As String = String.Join(",", values.ToArray)

                                    ' output to file based on first character of value in last colmun
                                    Select Case values(values.Count - 1).Substring(0, 1).ToUpper
                                        Case "V"
                                            sw_V.WriteLine(output)

                                        Case "W"
                                            sw_W.WriteLine(output)

                                        Case "P"
                                            sw_LP.WriteLine(output)

                                        Case "M"
                                            sw_MP.WriteLine(output)

                                    End Select
                                End If
                            End While
                        End Using

                    End Using
                End Using
            End Using
        End Using
    End Sub

Open in new window

It might be possible to bring the data down from the web page.  What is the URL?
Avatar of asc2010

ASKER

@ aikimark:  I do not have access to the website, nor will I be granted access to the website.  I am only allowed access to the data provided in the text file because all personal information has been removed from it.
ASKER CERTIFIED SOLUTION
Avatar of Mike Tomlinson
Mike Tomlinson
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of asc2010

ASKER

Idle_Mind,

I did get that, thank you.  I have been testing it on a couple of older computers and cannot find fault with your code.  You sir, are a genius!!  I will accept your answer as the solution.

Thank you so much!!