JRockFL
asked on
Regex Help
This is a continuation from...
https://www.experts-exchange.com/questions/21788762/Parse-text-file-regex-or-something-else.html
I thought the sample data would remain the same, but it changes. The first one is for a check, the other is for a DIRECTDEBIT. As far as I can tell, these are the only two variations.
FernandoSoto has already pointed me in the right direction with this...
Dim pattern As String = "<STMTTRN>.*?<TRNTYPE>(?<T RNTYPE>.*? )\n" & _
"<DTPOSTED>(?<DTPOSTED>.*? )\n.*?</ST MTTRN>"
Then I used it to get
Dim pattern As String = "<STMTTRN>.*?<TRNTYPE>(?<T RNTYPE>.*? )\n" & _
"<DTPOSTED>(?<DTPOSTED>.*? )\n.*?<TRN AMT>(?<TRN AMT>.*?)\n .*?" & _
"<FITID>(?<FITID>.*?)\n.*? <NAME>(?<N AME>.*?)\n .*?" & _
"<MEMO>(?<MEMO>.*?)\n.*?</ STMTTRN>"
It works fine aslong as the TRNTYPE is not check.
<STMTTRN>
<TRNTYPE>CHECK
<DTPOSTED>20060103170000
<TRNAMT>-0000000000100.00
<FITID>2006010305
<CHECKNUM>373
<NAME>CHECK
</STMTTRN>
<STMTTRN>
<TRNTYPE>DIRECTDEBIT
<DTPOSTED>20060103170000
<TRNAMT>-0000000000058.05
<FITID>2006010306
<REFNUM>9500000000
<NAME>TECO PEOPLES GAS ONLINE PMT
<MEMO>ONLINE PMT
</STMTTRN>
Any suggestions for me?
https://www.experts-exchange.com/questions/21788762/Parse-text-file-regex-or-something-else.html
I thought the sample data would remain the same, but it changes. The first one is for a check, the other is for a DIRECTDEBIT. As far as I can tell, these are the only two variations.
FernandoSoto has already pointed me in the right direction with this...
Dim pattern As String = "<STMTTRN>.*?<TRNTYPE>(?<T
"<DTPOSTED>(?<DTPOSTED>.*?
Then I used it to get
Dim pattern As String = "<STMTTRN>.*?<TRNTYPE>(?<T
"<DTPOSTED>(?<DTPOSTED>.*?
"<FITID>(?<FITID>.*?)\n.*?
"<MEMO>(?<MEMO>.*?)\n.*?</
It works fine aslong as the TRNTYPE is not check.
<STMTTRN>
<TRNTYPE>CHECK
<DTPOSTED>20060103170000
<TRNAMT>-0000000000100.00
<FITID>2006010305
<CHECKNUM>373
<NAME>CHECK
</STMTTRN>
<STMTTRN>
<TRNTYPE>DIRECTDEBIT
<DTPOSTED>20060103170000
<TRNAMT>-0000000000058.05
<FITID>2006010306
<REFNUM>9500000000
<NAME>TECO PEOPLES GAS ONLINE PMT
<MEMO>ONLINE PMT
</STMTTRN>
Any suggestions for me?
ASKER
Fernando,
Thanks for your help again. I'm looking to insert both types into the database. Would I have to build two different regex expressions?
Thanks for your help again. I'm looking to insert both types into the database. Would I have to build two different regex expressions?
Hi JRockFL;
Regular Expressions work something like this, using your test data below:
1. Get the first character from the input string
2. Check to see if is a match with the pattern
3. If it matched the pattern then get the next character from the input string
If it did not match the pattern, are there more characters in the input string
If there is then start matching from the beginning of the pattern string.
If there is not then matched failed.
4. We continue to do this until we are at the end of the input string and there are matches
Or not are reported back to the caller.
5. IF we are at the end of the pattern string then we have a match. And if there are still input
Characters in the input string then start looking for the next match from the beginning of
The pattern string.
The original pattern,
Dim pattern As String = "<STMTTRN>.*?<TRNTYPE>(?<T RNTYPE>.*? )\n" & _
"<DTPOSTED>(?<DTPOSTED>.*? )\n.*?<TRN AMT>(?<TRN AMT>.*?)\n .*?" & _
"<FITID>(?<FITID>.*?)\n.*? <NAME>(?<N AME>.*?)\n .*?" & _
"<MEMO>(?<MEMO>.*?)\n.*?</ STMTTRN>"
Starts looking for <STMTTRN> when it finds it it goes onto <TRNTYPE> then <DTPOSTED> and so on. The input data was in the correct order until it go to <CHECKNUM> which is not in the pattern so it fails at that point. Then because there are more input it starts the matching again. This time this field <REFNUM> makes it fail again. And just return those capture that it did find.
The new pattern now looks for the fields not in any particular order except that it must start with <STMTTRN> and end with </STMTTRN>. Everything in between is now an Or operation, so all the fields can be found in any order. Well the question now is when we come to a field that is not in the pattern; well I have added this to the pattern to catch any fields we do not want and that is this, (?:[^\n]+), this is a non capture group and all it does is that when it has checked all the Ored fields and none of the fields match we continues until it finds a new line character and starts looking for the next field. The Regex meta character for Oring is the | character.
<STMTTRN>
<TRNTYPE>CHECK
<DTPOSTED>20060103170000
<TRNAMT>-0000000000100.00
<FITID>2006010305
<CHECKNUM>373
<NAME>CHECK
</STMTTRN>
<STMTTRN>
<TRNTYPE>DIRECTDEBIT
<DTPOSTED>20060103170000
<TRNAMT>-0000000000058.05
<FITID>2006010306
<REFNUM>9500000000
<NAME>TECO PEOPLES GAS ONLINE PMT
<MEMO>ONLINE PMT
</STMTTRN>
I hope that this made some sense.
Fernando
Regular Expressions work something like this, using your test data below:
1. Get the first character from the input string
2. Check to see if is a match with the pattern
3. If it matched the pattern then get the next character from the input string
If it did not match the pattern, are there more characters in the input string
If there is then start matching from the beginning of the pattern string.
If there is not then matched failed.
4. We continue to do this until we are at the end of the input string and there are matches
Or not are reported back to the caller.
5. IF we are at the end of the pattern string then we have a match. And if there are still input
Characters in the input string then start looking for the next match from the beginning of
The pattern string.
The original pattern,
Dim pattern As String = "<STMTTRN>.*?<TRNTYPE>(?<T
"<DTPOSTED>(?<DTPOSTED>.*?
"<FITID>(?<FITID>.*?)\n.*?
"<MEMO>(?<MEMO>.*?)\n.*?</
Starts looking for <STMTTRN> when it finds it it goes onto <TRNTYPE> then <DTPOSTED> and so on. The input data was in the correct order until it go to <CHECKNUM> which is not in the pattern so it fails at that point. Then because there are more input it starts the matching again. This time this field <REFNUM> makes it fail again. And just return those capture that it did find.
The new pattern now looks for the fields not in any particular order except that it must start with <STMTTRN> and end with </STMTTRN>. Everything in between is now an Or operation, so all the fields can be found in any order. Well the question now is when we come to a field that is not in the pattern; well I have added this to the pattern to catch any fields we do not want and that is this, (?:[^\n]+), this is a non capture group and all it does is that when it has checked all the Ored fields and none of the fields match we continues until it finds a new line character and starts looking for the next field. The Regex meta character for Oring is the | character.
<STMTTRN>
<TRNTYPE>CHECK
<DTPOSTED>20060103170000
<TRNAMT>-0000000000100.00
<FITID>2006010305
<CHECKNUM>373
<NAME>CHECK
</STMTTRN>
<STMTTRN>
<TRNTYPE>DIRECTDEBIT
<DTPOSTED>20060103170000
<TRNAMT>-0000000000058.05
<FITID>2006010306
<REFNUM>9500000000
<NAME>TECO PEOPLES GAS ONLINE PMT
<MEMO>ONLINE PMT
</STMTTRN>
I hope that this made some sense.
Fernando
No just use the new one I just posted.
ASKER
I'm trying to understand this! Make take some time to fully grasp it.
This first example works fine, they both get inserted into the db
<STMTTRN>
<TRNTYPE>DIRECTDEBIT
<DTPOSTED>20060103170000
<TRNAMT>-0000000000058.05
<FITID>2006010306
<REFNUM>9500000000
<NAME>TECO PEOPLES GAS ONLINE PMT
<MEMO>ONLINE PMT
</STMTTRN>
<STMTTRN>
<TRNTYPE>DIRECTDEBIT
<DTPOSTED>20060103170000
<TRNAMT>-0000000000058.05
<FITID>2006010307
<REFNUM>9500000000
<NAME>TECO PEOPLES GAS ONLINE PMT
<MEMO>ONLINE PMT
</STMTTRN>
========================== ========== ========== ========== ===
In this example only the first gets inserted
<STMTTRN>
<TRNTYPE>CHECK
<DTPOSTED>20060103170000
<TRNAMT>-0000000000100.00
<FITID>2006010305
<CHECKNUM>373
<NAME>CHECK
</STMTTRN>
<STMTTRN>
<TRNTYPE>DIRECTDEBIT
<DTPOSTED>20060103170000
<TRNAMT>-0000000000058.05
<FITID>2006010306
<REFNUM>9500000000
<NAME>TECO PEOPLES GAS ONLINE PMT
<MEMO>ONLINE PMT
</STMTTRN>
========================== ========== ========== ========== ===
In this example....None get inserted
<STMTTRN>
<TRNTYPE>CHECK
<DTPOSTED>20060103170000
<TRNAMT>-0000000000100.00
<FITID>2006010305
<CHECKNUM>373
<NAME>CHECK
</STMTTRN>
<STMTTRN>
<TRNTYPE>CHECK
<DTPOSTED>20060103170000
<TRNAMT>-0000000000100.00
<FITID>2006010306
<CHECKNUM>373
<NAME>CHECK
</STMTTRN>
This first example works fine, they both get inserted into the db
<STMTTRN>
<TRNTYPE>DIRECTDEBIT
<DTPOSTED>20060103170000
<TRNAMT>-0000000000058.05
<FITID>2006010306
<REFNUM>9500000000
<NAME>TECO PEOPLES GAS ONLINE PMT
<MEMO>ONLINE PMT
</STMTTRN>
<STMTTRN>
<TRNTYPE>DIRECTDEBIT
<DTPOSTED>20060103170000
<TRNAMT>-0000000000058.05
<FITID>2006010307
<REFNUM>9500000000
<NAME>TECO PEOPLES GAS ONLINE PMT
<MEMO>ONLINE PMT
</STMTTRN>
==========================
In this example only the first gets inserted
<STMTTRN>
<TRNTYPE>CHECK
<DTPOSTED>20060103170000
<TRNAMT>-0000000000100.00
<FITID>2006010305
<CHECKNUM>373
<NAME>CHECK
</STMTTRN>
<STMTTRN>
<TRNTYPE>DIRECTDEBIT
<DTPOSTED>20060103170000
<TRNAMT>-0000000000058.05
<FITID>2006010306
<REFNUM>9500000000
<NAME>TECO PEOPLES GAS ONLINE PMT
<MEMO>ONLINE PMT
</STMTTRN>
==========================
In this example....None get inserted
<STMTTRN>
<TRNTYPE>CHECK
<DTPOSTED>20060103170000
<TRNAMT>-0000000000100.00
<FITID>2006010305
<CHECKNUM>373
<NAME>CHECK
</STMTTRN>
<STMTTRN>
<TRNTYPE>CHECK
<DTPOSTED>20060103170000
<TRNAMT>-0000000000100.00
<FITID>2006010306
<CHECKNUM>373
<NAME>CHECK
</STMTTRN>
ASKER
Actually I may have it worked out....dont post anything yet....trying to figure this out on my own
I Will be looking at it too.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks so much!
ASKER
I dont know how to word this...but when I insert into the DB it adds a little box after the ID
command.Parameters.Add("FI TID", SqlDbType.VarChar).Value = FITID
but if i do
command.Parameters.Add("FI TID", SqlDbType.VarChar).Value = "Test" then no box
Any ideas?
command.Parameters.Add("FI
but if i do
command.Parameters.Add("FI
Any ideas?
ASKER
If not no big deal...might be something with the SP
Dim SQL As String = "INSERT INTO TRNS (FITID) VALUES (" & FITID & ")"
This works fine and inserts it without the little box.....hmmm just one of those days!
Dim SQL As String = "INSERT INTO TRNS (FITID) VALUES (" & FITID & ")"
This works fine and inserts it without the little box.....hmmm just one of those days!
Hi JRockFL ;
The box you are seeing is most likely a control character \r so try the following pattern and see if it works better.
Dim pattern As String = "<STMTTRN>\r\n(<TRNTYPE>(? <TRNTYPE>. *?)\r\n|" & _
"<DTPOSTED>(?<DTPOSTED>.*? )\r\n|<TRN AMT>(?<TRN AMT>.*?)\r \n|" & _
"<FITID>(?<FITID>.*?)\r\n| <NAME>(?<N AME>.*?)\r \n|<MEMO>( ?<MEMO>.*? )\r\n|" & _
"(?:[^\n]+\n))+</STMTTRN>"
Fernando
The box you are seeing is most likely a control character \r so try the following pattern and see if it works better.
Dim pattern As String = "<STMTTRN>\r\n(<TRNTYPE>(?
"<DTPOSTED>(?<DTPOSTED>.*?
"<FITID>(?<FITID>.*?)\r\n|
"(?:[^\n]+\n))+</STMTTRN>"
Fernando
ASKER
Fernando,
What are you using to get rid of the box? Here is the pattern I have been working with. I wanted to start with a simplier example and work my way up.
Dim pattern As String = "<TRNTYPE>(?<TRNTYPE>.*?)\ n.*?" & _
"<DTPOSTED>(?<DTPOSTED>.*? )\n.*?" & _
"<TRNAMT>(?<TRNAMT>.*?)\n. *?" & _
"<FITID>(?<FITID>.*?)\n.*? " & _
"<NAME>(?<NAME>.*?)\n.*?"
What are you using to get rid of the box? Here is the pattern I have been working with. I wanted to start with a simplier example and work my way up.
Dim pattern As String = "<TRNTYPE>(?<TRNTYPE>.*?)\
"<DTPOSTED>(?<DTPOSTED>.*?
"<TRNAMT>(?<TRNAMT>.*?)\n.
"<FITID>(?<FITID>.*?)\n.*?
"<NAME>(?<NAME>.*?)\n.*?"
ASKER
I got it!
Dim pattern As String = "<TRNTYPE>(?<TRNTYPE>.*?)\ r\n.*?" & _
"<DTPOSTED>(?<DTPOSTED>.*? )\r\n.*?" & _
"<TRNAMT>(?<TRNAMT>.*?)\r\ n.*?" & _
"<FITID>(?<FITID>.*?)\r\n. *?" & _
"<NAME>(?<NAME>.*?)\r\n.*? "
Dim pattern As String = "<TRNTYPE>(?<TRNTYPE>.*?)\
"<DTPOSTED>(?<DTPOSTED>.*?
"<TRNAMT>(?<TRNAMT>.*?)\r\
"<FITID>(?<FITID>.*?)\r\n.
"<NAME>(?<NAME>.*?)\r\n.*?
JRockFL;
In a file on DOS systems you have a carrage return and line feed characters as a line terminater. The pattern I gave you only had the line feed characters. So this pattern which you just posted has only the line feed characers in them. You will most likely need to change the pattern to this.
Dim pattern As String = "<TRNTYPE>(?<TRNTYPE>.*?)\ r\n.*?" & _
"<DTPOSTED>(?<DTPOSTED>.*? )\r\n.*?" & _
"<TRNAMT>(?<TRNAMT>.*?)\r\ n.*?" & _
"<FITID>(?<FITID>.*?)\r\n. *?" & _
"<NAME>(?<NAME>.*?)\r\n.*? "
Now remember that this pattern is not Oring the fields so the input must be in the same order as the pattern otherwise you will only see the matches upto the failure and then it will start searching again.
This one all the fields are found
<STMTTRN>
<TRNTYPE>CHECK
<DTPOSTED>20060103170000
<TRNAMT>-0000000000100.00
<FITID>2006010305
<NAME>CHECK
</STMTTRN>
This one only NAME is not found because CHECKNUM makes the pattern fail.
<STMTTRN>
<TRNTYPE>CHECK
<DTPOSTED>20060103170000
<TRNAMT>-0000000000100.00
<FITID>2006010305
<CHECKNUM>373
<NAME>CHECK
</STMTTRN>
Fernando
In a file on DOS systems you have a carrage return and line feed characters as a line terminater. The pattern I gave you only had the line feed characters. So this pattern which you just posted has only the line feed characers in them. You will most likely need to change the pattern to this.
Dim pattern As String = "<TRNTYPE>(?<TRNTYPE>.*?)\
"<DTPOSTED>(?<DTPOSTED>.*?
"<TRNAMT>(?<TRNAMT>.*?)\r\
"<FITID>(?<FITID>.*?)\r\n.
"<NAME>(?<NAME>.*?)\r\n.*?
Now remember that this pattern is not Oring the fields so the input must be in the same order as the pattern otherwise you will only see the matches upto the failure and then it will start searching again.
This one all the fields are found
<STMTTRN>
<TRNTYPE>CHECK
<DTPOSTED>20060103170000
<TRNAMT>-0000000000100.00
<FITID>2006010305
<NAME>CHECK
</STMTTRN>
This one only NAME is not found because CHECKNUM makes the pattern fail.
<STMTTRN>
<TRNTYPE>CHECK
<DTPOSTED>20060103170000
<TRNAMT>-0000000000100.00
<FITID>2006010305
<CHECKNUM>373
<NAME>CHECK
</STMTTRN>
Fernando
I modified the pattern to work with what you wanted. Give me a minute and I will post how it works.
Dim pattern As String = "<STMTTRN>\n(<TRNTYPE>(?<T
"<DTPOSTED>(?<DTPOSTED>.*?
"<FITID>(?<FITID>.*?)\n|<N
"(?:[^\n]+))+</STMTTRN>"
Fernando