Solved

Text File Delimited Problem - Comma

Posted on 2013-10-22
47
455 Views
Last Modified: 2013-11-14
I have a text file that is comma delimited, but there is a field that has some "non-delimiter" commas included in the text.  This is causing the parsing to be incorrect when I import this file to Access.  THis is causing about 100,000 records to come into the database table incorrectly.  Is there a solution to this problem?
0
Comment
Question by:tomfarrar
  • 16
  • 14
  • 5
  • +5
47 Comments
 
LVL 7

Author Comment

by:tomfarrar
Comment Utility
Can I "save as" the current comma delimited file, and it will save with the new delimiter?  Then the comma will be replaced by the new delimiter?  I cannot download the original file before it was saved as comma delimited.
0
 
LVL 44

Expert Comment

by:Rainer Jeschor
Comment Utility
Hi,
do you have any chance to get into the process before the comma-delimited file is created?
Normally you would have to encapsulate the fields containing the delimiter - this is called Text qualifier and normally the quotes character is used.

Sample:
ID,Text,AnotherText
1,This should work,As there is no delimiter
2.This will fail, as it contains,a delimiert in, both field

would become
1,This should work,As there is no delimiter
2."This will fail, as it contains","a delimiert in, both field"

This would be your only chance - and that has to be done before you start importing the file as only the tool/program which generates the file knows, where this field starts and ends.

HTH
Rainer
0
 
LVL 65

Expert Comment

by:RobSampson
Comment Utility
If you only have the final file available, and have no control over the export, there is only one other i
0
 
LVL 65

Expert Comment

by:RobSampson
Comment Utility
.....option I can think of. Since you say there is "a" field that has these extra commas, then can we assume that there is always a fixed amount of known fields either side of that?  So say you have 10 fields, if the first three never have extra commas, and the last six never have extra commas, then
 anything in between, no matter how many commas, would be the fourth field. We could write a script to reformat the file....

Rob.
0
 
LVL 74

Expert Comment

by:Jeffrey Coachman
Comment Utility
Coming late to the party but...

You can simply do a find/replace on the source file and change the commas to another delimiter (the "|" is a common option)
Then the import should work.
Then when the import is done, do a find replace on the imported table to change the "|" back to a comma.
0
 
LVL 44

Expert Comment

by:Rainer Jeschor
Comment Utility
@Rob: wunderful idea!

@boag2000: How should that work? If I replace every "," with a "|" I would not change anything - a line with originally 5 delimiters + 1 wrong comma will then have also 5 + 1 delimiters
0
 
LVL 74

Expert Comment

by:Jeffrey Coachman
Comment Utility
RainerJ,

Oops, yes you are correct, sorry about that.

Despite reading your post, I was thinking that this was a space delimited file, and the comma was only in the address...

Again, sorry about that...
;-)

Jeff
0
 
LVL 30

Expert Comment

by:hnasr
Comment Utility
Can you do that manually for few records?
If yes, do you follow a logical process in performing this? If yes, then it can be done.

If you have to do a mental processing, meaning "ah this is a name of a person", then it proves to be a hard task unless you tell the process to expect a comma after such words.

List few anticipated records of the file and see if a logical process exists.
Waiting for your feedback.

"I was thinking that this was a space delimited file"
This is why we keep asking for example of records to be included in the body of the question, to save the brain from unnecessarily working trivial things.
0
 
LVL 22

Expert Comment

by:rspahitz
Comment Utility
As mentioned, there are several options, including going back to the source and having it saved as a proper CSV file (with "qualified" fields), or using some basic "intelligence" to determine where the comma-riddled field is located (pull off the from and back pieces.)  Or if you can manually locate those problem items and fix them, that works.

Beyond that you need to add more computer logic into the process, which can get complex.  For example, if you know that a comma-riddled field is surrounded by numeric fields, you could search for the numeric field separated by command and conclude that the piece before is your problem field (although you have to watch for your problem containing the sequence of a comma and number) or maybe the next field is a fixed length or a date or some other way to distinguish where the next field is located.  Basically, the more complex the data, the more complex the logic needed to find the different parts.
0
 
LVL 7

Author Comment

by:tomfarrar
Comment Utility
Hi All - Thanks for the comments.  A couple of facts you have pointed out seem to indicate the process of correctly parsing the file will most likely require some manual manipulation.

More facts:  there are half-million records, each record has 23 fields, and two of those fields appear to be problem areas.  One is an account description field that has data such as "Mileage, gas, and travel" (there are probably six or so descriptions similar to this in this field.  The second field is a company name like "My Company, Inc" and there are probably 50 different combinations of names that include one or more commas.  Because the file cannot be recreated, it appears the solution is to open in Excel and manually adjust columns until the proper alignment is made for the 23 fields.  I have begun this process.

Thanks much for your thoughts, and if you have other thoughts, please share.
0
 
LVL 22

Expert Comment

by:rspahitz
Comment Utility
>field that has data such as "Mileage, gas, and travel"

Can you post some examples?
If it's something like 12345,75,ontario then you may be able to look for characteristics of the data and "link" them together as a single field.  Even if it works only 80% of the time, that means you only have to manually update 20% of the records.
0
 
LVL 30

Expert Comment

by:hnasr
Comment Utility
If you are happy with excel, go ahead. But "half-million records, " I doubt if it is a good idea to do that.

At least you may isolate most of the fields in access table fields, and the rest in a one field.
Use VBA to manipulate the one field to split it further.

As rspahitz suggested, you reduce the volume of data to be processed. In addition you may automate the process and reduce introducing new errors.
0
 
LVL 39

Expert Comment

by:als315
Comment Utility
You can also read this file from VBA and save data from proper strings to one table, but wrong strings to other and analyze them manually. If you can upload some sample with good and problem strings, may be we can find any solution
0
 
LVL 7

Author Comment

by:tomfarrar
Comment Utility
Okay, I can give that a try.
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
It would be useful for us if you posted some of the problematic lines.
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
If you apply the following regular expression pattern to each line of your CSV file, you should be able to get the data out of the file correctly.
"{0,1}([^"]*?)"{0,1},"{0,1}([^"]*?)"{0,1},"{0,1}([^"]*?)"{0,1},"{0,1}([^"]*?)"{0,1},"{0,1}([^"]*?)"{0,1},"{0,1}([^"]*?)"{0,1},"{0,1}([^"]*?)"{0,1},"{0,1}([^"]*?)"{0,1},"{0,1}([^"]*?)"{0,1},"{0,1}([^"]*?)"{0,1},"{0,1}([^"]*?)"{0,1},"{0,1}([^"]*?)"{0,1},"{0,1}([^"]*?)"{0,1},"{0,1}([^"]*?)"{0,1},"{0,1}([^"]*?)"{0,1},"{0,1}([^"]*?)"{0,1},"{0,1}([^"]*?)"{0,1},"{0,1}([^"]*?)"{0,1},"{0,1}([^"]*?)"{0,1},"{0,1}([^"]*?)"{0,1},"{0,1}([^"]*?)"{0,1},"{0,1}([^"]*?)"{0,1},"{0,1}([^"]*?)"{0,1}$

Open in new window

three field example (not the 23 in the pattern above):
After reading the following line:
2,"This will fail, as it contains","a delimiert in, both field"
The Matches collection will contain the following:

2
This will fail, as it contains
a delimiert in, both field
0
 
LVL 44

Expert Comment

by:Rainer Jeschor
Comment Utility
Hi,
just an idea:
It would be a multiple step process and you (we) should be able to use Powershell or VB Script.
1. Step 1: Filter correct lines
Open the original file and read line by line. For each line make a String.Split using the comma as separator. If the result array length is 22 (so 23 fields in the line) then the line is OK and should be able to be processed without any further manipulations. Save this line into a new file. If the number is greater then redirect this line to a second file

2. Step 2: Replace known issue values
Now open the second file and replace the account description field content where you know that there are only a limited number of variations. Replace the original string like
,Mileage, gas, and travel,

Open in new window

with something like
,"@@1@@",

Open in new window

and do this for every of your six possible strings.
Now run again a split on comma. If the resulting array length is now 22 then you can post process this line by replacing the temporary dummy value back from
,"@@1@@",

Open in new window

to
,"Mileage, gas, and travel",

Open in new window

(now with the quotes) and save this into a third file. Lines which contain now more than 22 elements have to be stored in a fourth file (without replacing the dummy values).

3. Step 3: Company name field processing
Now in the final unprocessed lines you have to find the position of the first character (based on the fields (commas) before this field) and the comma position of the following fields - so getting the positions from the left and from the right. Now you can use substring functions to get the left + a quote + the middle + a quote + the right string. Then you have to replace the dummy values and save the line into a fifth file.

4. Step 4: Combining the files
Now you can simply combine the content of the first, third and fifth file and you should be able to use this without issues.

As other experts already mentioned, if you could provide a sample of lines (3-5 correct ones, 3-5 with a account description with comma, 3-5 with a company name with comma and 3-5 with commas in both) then we would be happy to assist you by creating this post processing script.

Just my 2ct
Rainer
0
 
LVL 7

Author Comment

by:tomfarrar
Comment Utility
Here is an example of three records:  the first two are problems in AccountDescription and ParehtEntityDescription, the third record parses correctly.  The first line on the spreadsheet is the field header names.
EE-Example.xlsx
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
From what I can see of your data, it is not a well-formed CSV format.  Whenever a field contains a comma, it should be enclosed in quotes.  Are there quotes in the raw CSV text file?
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
If there are no quotes around these fields, then we have to use some of the data in the lines to help the Regexp engine.  The first line is well formed CSV.
First line pattern:
([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)$

Open in new window

However, the malformed CSV lines seem to have some data fields with constant values.  We can use these constant values in our pattern.
Subsequent line patterns:
([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),(USD),(.*?),(EXPENSE),([^,]*),([^,]*),(.*),([^,]*),(Expense or Deduction),(.*)$

Open in new window

Testing this pattern against one of your malformed lines:
2010,10,10001,D,523110,$39540.04,2010,67108875,16,17,492,,39540.04,Dec,USD,Meals, Other Exp Employees,EXPENSE,Local,522K,SLP-Pearland-Gem Division,Inc.,6260038,Expense or Deduction,4/15/2011 14:22:02
Produces the following parsing results:

SubMatch 0: 2010
SubMatch 1: 10
SubMatch 2: 10001
SubMatch 3: D
SubMatch 4: 523110
SubMatch 5: $39540.04
SubMatch 6: 2010
SubMatch 7: 67108875
SubMatch 8: 16
SubMatch 9: 17
SubMatch 10: 492
SubMatch 11:
SubMatch 12: 39540.04
SubMatch 13: Dec
SubMatch 14: USD
SubMatch 15: Meals, Other Exp Employees
SubMatch 16: EXPENSE
SubMatch 17: Local
SubMatch 18: 522K
SubMatch 19: SLP-Pearland-Gem Division,Inc.
SubMatch 20: 6260038
SubMatch 21: Expense or Deduction
SubMatch 22: 4/15/2011 14:22:02
0
 
LVL 7

Author Comment

by:tomfarrar
Comment Utility
Here is the sample you requested.
EE-Example.xlsx
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
I tested all the non-header lines in one pass with the following pattern.  It is almost identical to the pattern I posted earlier, but it allows for multiple matches, delimited by a new line as well as the end of the file.  I think that reading the CSV file a line-at-a-time and then parsing with with reliable regular expression pattern(s) is the best approach to solving the problem.
([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),(USD),(.*?),(EXPENSE),([^,]*),([^,]*),(.*),([^,]*),(Expense or Deduction),(.*)(?:\r|\n|$)

Open in new window


Match 0

SubMatch 0: 2010
SubMatch 1: 10
SubMatch 2: 10001
SubMatch 3: D
SubMatch 4: 523110
SubMatch 5: $39540.04
SubMatch 6: 2010
SubMatch 7: 67108875
SubMatch 8: 16
SubMatch 9: 17
SubMatch 10: 492
SubMatch 11:
SubMatch 12: 39540.04
SubMatch 13: Dec
SubMatch 14: USD
SubMatch 15: Meals, Other Exp Employees
SubMatch 16: EXPENSE
SubMatch 17: Local
SubMatch 18: 522K
SubMatch 19: SLP-Pearland-Gem Division.
SubMatch 20: 6260038
SubMatch 21: Expense or Deduction
SubMatch 22: 4/15/2011 14:22:02
Match 1

SubMatch 0: 2010
SubMatch 1: 10
SubMatch 2: 10001
SubMatch 3: D
SubMatch 4: 523110
SubMatch 5: $39540.04
SubMatch 6: 2010
SubMatch 7: 67108875
SubMatch 8: 16
SubMatch 9: 17
SubMatch 10: 492
SubMatch 11:
SubMatch 12: 39540.04
SubMatch 13: Dec
SubMatch 14: USD
SubMatch 15: Mileage, Tolls, Parking, Taxis
SubMatch 16: EXPENSE
SubMatch 17: Local
SubMatch 18: 522K
SubMatch 19: SLP-Orange-Open Division.
SubMatch 20: 6260038
SubMatch 21: Expense or Deduction
SubMatch 22: 4/15/2011 14:22:03
Match 2

SubMatch 0: 2010
SubMatch 1: 10
SubMatch 2: 10001
SubMatch 3: D
SubMatch 4: 523110
SubMatch 5: $39540.04
SubMatch 6: 2010
SubMatch 7: 67108875
SubMatch 8: 16
SubMatch 9: 17
SubMatch 10: 492
SubMatch 11:
SubMatch 12: 39540.04
SubMatch 13: Dec
SubMatch 14: USD
SubMatch 15: Meals, Other Exp Employees
SubMatch 16: EXPENSE
SubMatch 17: Local
SubMatch 18: 522K
SubMatch 19: SLP-Sugar-Closed Division.
SubMatch 20: 6260038
SubMatch 21: Expense or Deduction
SubMatch 22: 4/15/2011 14:22:04
Match 3

SubMatch 0: 2010
SubMatch 1: 10
SubMatch 2: 10001
SubMatch 3: D
SubMatch 4: 523110
SubMatch 5: $39540.04
SubMatch 6: 2010
SubMatch 7: 67108875
SubMatch 8: 16
SubMatch 9: 17
SubMatch 10: 492
SubMatch 11:
SubMatch 12: 39540.04
SubMatch 13: Dec
SubMatch 14: USD
SubMatch 15: Payroll
SubMatch 16: EXPENSE
SubMatch 17: Local
SubMatch 18: 522K
SubMatch 19: SLP-Pearland, Limited.
SubMatch 20: 6260038
SubMatch 21: Expense or Deduction
SubMatch 22: 4/15/2011 14:22:02
Match 4

SubMatch 0: 2010
SubMatch 1: 10
SubMatch 2: 10001
SubMatch 3: D
SubMatch 4: 523110
SubMatch 5: $39540.04
SubMatch 6: 2010
SubMatch 7: 67108875
SubMatch 8: 16
SubMatch 9: 17
SubMatch 10: 492
SubMatch 11:
SubMatch 12: 39540.04
SubMatch 13: Dec
SubMatch 14: USD
SubMatch 15: Misc Expense
SubMatch 16: EXPENSE
SubMatch 17: Local
SubMatch 18: 522K
SubMatch 19: SLP-Orange, Calif.
SubMatch 20: 6260038
SubMatch 21: Expense or Deduction
SubMatch 22: 4/15/2011 14:22:03
Match 5

SubMatch 0: 2010
SubMatch 1: 10
SubMatch 2: 10001
SubMatch 3: D
SubMatch 4: 523110
SubMatch 5: $39540.04
SubMatch 6: 2010
SubMatch 7: 67108875
SubMatch 8: 16
SubMatch 9: 17
SubMatch 10: 492
SubMatch 11:
SubMatch 12: 39540.04
SubMatch 13: Dec
SubMatch 14: USD
SubMatch 15: Subscriptions
SubMatch 16: EXPENSE
SubMatch 17: Local
SubMatch 18: 522K
SubMatch 19: SLP-Sugarland, Texas.
SubMatch 20: 6260038
SubMatch 21: Expense or Deduction
SubMatch 22: 4/15/2011 14:22:04
Match 6

SubMatch 0: 2010
SubMatch 1: 10
SubMatch 2: 10002
SubMatch 3: D
SubMatch 4: 523132
SubMatch 5: $150.00
SubMatch 6: 2010
SubMatch 7: 67108875
SubMatch 8: 16
SubMatch 9: 17
SubMatch 10: 494
SubMatch 11:
SubMatch 12: 150.00
SubMatch 13: Dec
SubMatch 14: USD
SubMatch 15: Mileage, Tolls, Parking, Taxis
SubMatch 16: EXPENSE
SubMatch 17: Local
SubMatch 18: 522W
SubMatch 19: LP-Dallas-Universal,Inc
SubMatch 20: 6260035
SubMatch 21: Expense or Deduction
SubMatch 22: 4/15/2011 14:22:02
Match 7

SubMatch 0: 2010
SubMatch 1: 10
SubMatch 2: 10002
SubMatch 3: D
SubMatch 4: 523132
SubMatch 5: $150.00
SubMatch 6: 2010
SubMatch 7: 67108875
SubMatch 8: 16
SubMatch 9: 17
SubMatch 10: 494
SubMatch 11:
SubMatch 12: 150.00
SubMatch 13: Dec
SubMatch 14: USD
SubMatch 15: Meals, Other Exp Employees
SubMatch 16: EXPENSE
SubMatch 17: Local
SubMatch 18: 522W
SubMatch 19: LP-Dallas-Memoco, LTD
SubMatch 20: 6260035
SubMatch 21: Expense or Deduction
SubMatch 22: 4/15/2011 14:22:03
Match 8

SubMatch 0: 2010
SubMatch 1: 10
SubMatch 2: 10002
SubMatch 3: D
SubMatch 4: 523132
SubMatch 5: $150.00
SubMatch 6: 2010
SubMatch 7: 67108875
SubMatch 8: 16
SubMatch 9: 17
SubMatch 10: 494
SubMatch 11:
SubMatch 12: 150.00
SubMatch 13: Dec
SubMatch 14: USD
SubMatch 15: Tolls, Parking, Taxis
SubMatch 16: EXPENSE
SubMatch 17: Local
SubMatch 18: 522W
SubMatch 19: LP-Houston-Moore, LSMT
SubMatch 20: 6260035
SubMatch 21: Expense or Deduction
SubMatch 22: 4/15/2011 14:22:04
Match 9

SubMatch 0: 2010
SubMatch 1: 10
SubMatch 2: E1022
SubMatch 3: D
SubMatch 4: 523990
SubMatch 5: $4252.16
SubMatch 6: 2010
SubMatch 7: 67108875
SubMatch 8: 16
SubMatch 9: 17
SubMatch 10: 498
SubMatch 11:
SubMatch 12: 4252.16
SubMatch 13: Dec
SubMatch 14: USD
SubMatch 15: Travel & Entertainment - Other
SubMatch 16: EXPENSE
SubMatch 17: Local
SubMatch 18: D522K
SubMatch 19: WU-Goodland-Engine Division
SubMatch 20: 6260040
SubMatch 21: Expense or Deduction
SubMatch 22: 4/15/2011 14:22:02
Match 10

SubMatch 0: 2010
SubMatch 1: 10
SubMatch 2: E1022
SubMatch 3: D
SubMatch 4: 523990
SubMatch 5: $4252.16
SubMatch 6: 2010
SubMatch 7: 67108875
SubMatch 8: 16
SubMatch 9: 17
SubMatch 10: 498
SubMatch 11:
SubMatch 12: 4252.16
SubMatch 13: Dec
SubMatch 14: USD
SubMatch 15: Payroll - Other
SubMatch 16: EXPENSE
SubMatch 17: Local
SubMatch 18: D522K
SubMatch 19: WU-Denver Division
SubMatch 20: 6260040
SubMatch 21: Expense or Deduction
SubMatch 22: 4/15/2011 14:22:03
Match 11

SubMatch 0: 2010
SubMatch 1: 10
SubMatch 2: E1022
SubMatch 3: D
SubMatch 4: 523990
SubMatch 5: $4252.16
SubMatch 6: 2010
SubMatch 7: 67108875
SubMatch 8: 16
SubMatch 9: 17
SubMatch 10: 498
SubMatch 11:
SubMatch 12: 4252.16
SubMatch 13: Dec
SubMatch 14: USD
SubMatch 15: Misc Expense - Other
SubMatch 16: EXPENSE
SubMatch 17: Local
SubMatch 18: D522K
SubMatch 19: Philadelphia Division
SubMatch 20: 6260040
SubMatch 21: Expense or Deduction
SubMatch 22: 4/15/2011 14:22:04
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
If you need/want to minimize the regex pattern's use of data, we can probably reduce my earlier pattern to this, which only uses EXPENSE:
([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),(.*),(EXPENSE),([^,]*),([^,]*),(.*),([^,]*),([^,]*),([^,]*)$

Open in new window

Note: this pattern is for line-by-line processing of your CSV file.

I had tried dividing the parsing operation into two steps. The first step parses the leading 15 fields and the trailing three fields from the line, since they are all well formed CSV fields. However, the parsing of the two malformed, separated by three well formed CSV fields required the use of the EXPENSE literal.  If there are other values that go in that field, we can tweak the pattern to provide an alternation operation for that field.  If the number of values for that field is large, we might have to iterate through a list of patterns.

There are other potential patterns that might be used, but we are basing our recommendations off of a very small sample size.
0
Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

 
LVL 30

Expert Comment

by:hnasr
Comment Utility
Are Expense, and Expense or Deduction values repeated in all records as shown in the uploaded sample?
0
 
LVL 7

Author Comment

by:tomfarrar
Comment Utility
The field that carries the EXPENSE literal has four values:

ASSET
EXPENSE
LIABILITY
REVENUE

I am afraid I will need to understand the "regex pattern" a little better.  This is unfamiliar to me.  Thanks.  - Tom
0
 
LVL 7

Author Comment

by:tomfarrar
Comment Utility
THe additional value in addition to the "Expense or Deduction" are:

Asset
Income
Liability
Liab. /Equity
0
 
LVL 7

Author Comment

by:tomfarrar
Comment Utility
By the way, the pattern you showed did parse all line items of the data correctly.
0
 
LVL 45

Accepted Solution

by:
aikimark earned 400 total points
Comment Utility
That makes it easy.  You can use either of these patterns to parse each line of the CSV file.
([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),(.*),(ASSET|EXPENSE|LIABILITY|REVENUE),([^,]*),([^,]*),(.*),([^,]*),([^,]*),([^,]*)$

Open in new window

([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),(.*),([A-Z]*),([^,]*),([^,]*),(.*),([^,]*),([^,]*),([^,]*)$

Open in new window


============
Example code:
Dim oRE As Object
Dim oMatches As Object
Dim oSubMatch As Object
Dim intFN As Integer
Dim strLine As String
Dim vItem As Variant
Dim lngFieldNum As Long
Set oRE = CreateObject("vbscript.regexp")
oRE.Pattern = "([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)$"
intFN = Freefile
Open "your CSV file path and name" For Input As #intFN
Line Input #intFN, strLine
set oMatches = oRE.Execute(strLine)
'* * * Add your processing code for the header line here * * *

oRE.Pattern = "([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),(.*),([A-Z]*),([^,]*),([^,]*),(.*),([^,]*),([^,]*),([^,]*)$"

Do Until EOF(intFN)
  Line Input #intFN, strLine
  set oMatches = oRE.Execute(strLine)
  '* * * Add your processing code for each data line here * * *
  
Loop

Open in new window

Processing the fields can be done like you would iterate any collection, such as the Fields collection of a recordset object.
Example 1:
For Each vItem In oMatches(0).Submatches
  Debug.Print vItem
Next

Open in new window

Example 2:
For lngFieldNum = 0 To oMatches(0).Submatches.Count - 1
  Debug.Print oMatches(0).Submatches(lngFieldNum)
Next

Open in new window

0
 
LVL 30

Expert Comment

by:hnasr
Comment Utility
One way to use vba in access.

This works by linking excel table as is to access.
Loop through record set to split fields taking the possible values mentioned into account, to isolate the problem fields.

Many comments are there, and if you exhaust other approaches suggested, I'll give my idea a try. It is an idea at the moment, and it may turn out to be time consuming.
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
You have posted seemingly contradictory comments:
I am afraid I will need to understand the "regex pattern" a little better.  This is unfamiliar to me.
and
By the way, the pattern you showed did parse all line items of the data correctly.

Your first comment indicates (to me) that you do not know how to use regular expressions in your VBA code.  Your second comment indicates (to me) that you have successfully used the patterns to parse your data.  I'm not sure how to respond.
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
If you just want to better understand the regex patterns you'll be using...
([^,]*),
capture (as a submatch item) any sequence of zero-or-more characters that are not a comma, followed by a comma character
(.*),
capture (as a submatch item) any sequence of zero-or-more characters, followed by a comma character
([A-Z]*),
capture (as a submatch item) any sequence of zero-or-more upper case characters, followed by a comma character
$
end of string indicator.  This is not captured and is not a submatch item.

You can also get an explanation of regex patterns if you do your testing at http://www.myregextester.com and click the EXPLAIN checkbox.
0
 
LVL 44

Assisted Solution

by:Rainer Jeschor
Rainer Jeschor earned 100 total points
Comment Utility
Hi,
and here a small Powershell script which processes all lines:
# Variables for file paths
$basePath = "F:\EE\EECSV"
$orgFileName = "$basePath\EESample3CSV.csv"
$fileNameExport = "$basePath\EECSVSample_Export.csv"

# Cleanup old files
Write-Host "Cleanup"
if (Test-Path($fileNameExport)) { Remove-Item $fileNameExport }
Write-Host "Cleanup done"
# Known values to be replaced in field Account Description
$replacementStrings =  @("Meals, Other Exp Employees","Mileage, Tolls, Parking, Taxis","Tolls, Parking, Taxis")

# Read original file and skip header line
Write-Progress "Read source file ..."
$originalContent = Get-Content $orgFileName

Write-Host "File read. Start processing"

$lineCounter = 0
$allLines = $originalContent.Length
# Check each line for additional commas
ForEach ($unprocessedLine in $originalContent) {
    Write-Progress "Processing line $($lineCounter + 1) of $allLines"
    if ($lineCounter -eq 0) {
        # Case 0: Header line
        Add-Content $fileNameExport $unprocessedLine
    } else {
        if ($unprocessedLine.Split(",").Length -eq 23)
        {
            # Case 1: Everything OK
            Add-Content $fileNameExport $unprocessedLine
        } else {
            For ($i=0; $i -lt $replacementStrings.Length; $i++)
            {
                $replaceString = ",$($replacementStrings[$i]),"
                $unprocessedLine = $unprocessedLine.Replace($replaceString,",@@$i@@,")
            }
    
            if ($unprocessedLine.Split(",").Length -ne 23) {
                # Case 3: Additional commas
                $workingArray = $unprocessedLine.Split(",")

                $leftPart = $workingArray[0..18]
                $rightPart = $workingArray[-3..-1]

                $unprocessedLine =  [string]::join(",", $leftPart) + ",""" + [string]::join(",", $workingArray[19..($workingArray.Length-4)])+""","+[string]::join(",", $rightPart)
            }

            # Now back-replacing case 2 and 3
            For ($k=0;$k -lt $replacementStrings.Length; $k++)
            {
                $backreplaceString2 = ",""$($replacementStrings[$k])"","
                $unprocessedLine = $unprocessedLine.Replace(",@@$k@@,", $backreplaceString2)
            }
            
            Add-Content $fileNameExport $unprocessedLine
        }
    }
    $lineCounter += 1
}
Write-Host "Done :-)"

Open in new window


HTH
Rainer
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
@Rainer

Where/how did you get a complete list of values for the AccountDescription field?
0
 
LVL 7

Author Comment

by:tomfarrar
Comment Utility
I am not a VBA person at this point in my life.  I am working on that.  I did not understand the pattern, and I only said that the "data" was parsed correctly because you provided back to me the data samples I gave you, and they examples appeared to be parsed correctly.

Match 0


SubMatch 0: 2010
SubMatch 1: 10
SubMatch 2: 10001
SubMatch 3: D
SubMatch 4: 523110
SubMatch 5: $39540.04
SubMatch 6: 2010
SubMatch 7: 67108875
SubMatch 8: 16
SubMatch 9: 17
SubMatch 10: 492
SubMatch 11:
SubMatch 12: 39540.04
SubMatch 13: Dec
SubMatch 14: USD
SubMatch 15: Meals, Other Exp Employees
SubMatch 16: EXPENSE
SubMatch 17: Local
SubMatch 18: 522K
SubMatch 19: SLP-Pearland-Gem Division.
SubMatch 20: 6260038
SubMatch 21: Expense or Deduction
SubMatch 22: 4/15/2011 14:22:02


Match 1


SubMatch 0: 2010
SubMatch 1: 10
SubMatch 2: 10001
SubMatch 3: D
SubMatch 4: 523110
SubMatch 5: $39540.04
SubMatch 6: 2010
SubMatch 7: 67108875
SubMatch 8: 16
SubMatch 9: 17
SubMatch 10: 492
SubMatch 11:
SubMatch 12: 39540.04
SubMatch 13: Dec
SubMatch 14: USD
SubMatch 15: Mileage, Tolls, Parking, Taxis
SubMatch 16: EXPENSE
SubMatch 17: Local
SubMatch 18: 522K
SubMatch 19: SLP-Orange-Open Division.
SubMatch 20: 6260038
SubMatch 21: Expense or Deduction
SubMatch 22: 4/15/2011 14:22:03

Etc....
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
@tomfarrar
I am not a VBA person at this point in my life
How are you going to integrate the VBA code I posted with the VBA code (?you already have?) that imports the CSV file data into your Access database?
Do you have any programming experience?  If so, what languages?
Do you understand the code I posted earlier?

If you do not have VBA code to do the importing, then that is an additional requirement for which we will need information about your database (target table names, field names, your 'import scheme', etc.)

By 'import scheme', I mean the data flow you are using.
* import CSV into a new table each time you import, append those rows into a production table
* use/reuse the same table for each import, deleting its rows before you start the import process, append those rows into a production table
* import directly into the production table, appending the imported rows
0
 
LVL 44

Expert Comment

by:Rainer Jeschor
Comment Utility
@aikimark:
Sorry - I forgot to mention that the replacement strings array has to be adjusted / extend with all of the 6-8 known values.
0
 
LVL 7

Author Comment

by:tomfarrar
Comment Utility
So I kind of understand the pattern in that the first 15 fields can be parsed correctly because the field patterns are "well formed", and the 16th field appears to parse at the point the field reads any of the following: ASSET, EXPENSE, LIABILITY, REVENUE which are the 17th field values (what if one of those words is included in the problem AccountDescription field?).  Then you know the 18th-19th fields should be parsed on the comma because they are "well formed".  Then I guess you work from the right to parse the 23rd/22nd/21st fields based on the comma since they are well formed.  That leave the remaining "not well formed" string that makes up the 20th field.  Is this what is going on?
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
@tomfarrar

Page up and reread my comments, especially my explanations about the patterns.

The regexp object does the parsing, based on the pattern.  Since all of the values are upper case words, I simplified the pattern I supply to the regexp object variable.

Understanding the regexp pattern should be secondary to importing the data after it has been properly parsed by the regexp object.
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
@Rainer

I think your PS script would be simplified if you used either the -match operator or the .Net regular expression namespace that is available in PS environments.  Since .Net allows repeating capture groups, your patterns will be simpler/shorter.
This TechNet article would be a nice place to start familiarizing yourself with regular expressions in the Powershell environment.
http://social.technet.microsoft.com/wiki/contents/articles/4310.powershell-working-with-regular-expressions-regex.aspx

I'm impressed with the number and quality of PS\regex examples on this page, which may help you in your assimilation of regular expressions into your PS scripts:
http://www.powershellcookbook.com/recipe/qAxK/appendix-b-regular-expression-reference
0
 
LVL 7

Author Comment

by:tomfarrar
Comment Utility
@aikimark

Yes, I am fine with not having to understand the pattern (was just trying to learn), but I need a little help in understanding how to "adjust" the VBA code to accomodate getting the file into an Access database "parsed".  My skill set is not to the level of the others commenting to my original post.  Thanks.
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
@tomfarrar

What does your code currently look like?
Please answer the questions pose in my earlier comment.
0
 
LVL 7

Author Comment

by:tomfarrar
Comment Utility
@aikimark

Your questions:

How are you going to integrate the VBA code I posted with the VBA code (?you already have?) that imports the CSV file data into your Access database?
Do you have any programming experience?  If so, what languages?
Do you understand the code I posted earlier?

My answers:
I have no VBA code other than what you have suggested.  So as comprehensive as your suggested solution may be, I will need to get someone to help me to implement it.  I am not a programmer so this is part of the gap in communication.  I have a vague understanding of the code you posted earlier.  Thanks.
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
Please post an empty database (.mdb format) containing only the table definition you wish to append to or replace.  Also, post a few thousand rows of the CSV file.  You may include both of these as a single attached zip file if it makes it easier for you.

If there is someone who can help you with this locally, we can wait until they have a chance to see this thread and decide if they have any questions.  Such import routines are pretty standard stuff.
0
 
LVL 7

Author Comment

by:tomfarrar
Comment Utility
Thanks a lot for your ideas and solutions on this, aikimark.  The data I have is not really mine to share, and I would be remiss in doing this.  I will work with your ideas and my people to see if we can get the solution to the next level.  I wish I could include you as you have been integral to improving my understanding of what can be done in these cases.  I will take care of the point awards later.

Thank you too, Rainer.  I imagine going down the road with you would have been beneficial too.

Also thanks to the others who shared their commnets.  - Tom
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
@tomfarrar

How are your people doing with the code and regex patterns I posted?
0
 
LVL 7

Author Closing Comment

by:tomfarrar
Comment Utility
Great thoughts on a complicated issue (as I see it anyway).  Thank you.
0

Featured Post

Complete Microsoft Windows PC® & Mac Backup

Backup and recovery solutions to protect all your PCs & Mac– on-premises or in remote locations. Acronis backs up entire PC or Mac with patented reliable disk imaging technology and you will be able to restore workstations to a new, dissimilar hardware in minutes.

Join & Write a Comment

PaperPort has a feature called the "Send To Bar". It provides a convenient, drag-and-drop interface for using other installed software, such as Microsoft Office. However, this article shows that the latest Office 2016 apps (installed with an Office …
Describes a method of obtaining an object variable to an already running instance of Microsoft Access so that it can be controlled via automation.
In Microsoft Access, learn the trick to repeating sub-report headings at the top of each page. The problem with sub-reports and headings: Add a dummy group to the sub report using the expression =1: Set the “Repeat Section” property of the dummy…
The viewer will learn how to use the =DISCRINV command to create a discrete random variable, use this command to model a set of probabilities and outcomes in a Monte Carlo simulation, and learn how to find the standard deviation of a set of probabil…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now