Link to home
Start Free TrialLog in
Avatar of PremierK
PremierK

asked on

"Cleaning" a text file

I have a text file that is essentially the output from an application reporting system.  I need to take the contents of the text file, and erase the headers, and some other data, and then load the data into a Microsoft Access database.  The problem is that the data is not separated by commas, or anything, it's just tab delimited.  Also the report contains report, and page headers throughout the document that need to be removed before it's loaded into the Access database.  

I need to rip away the unneed text, and load the useful text into an access database.

Ideas on how I can do this??
Avatar of Mike McCracken
Mike McCracken

Basically you need to write a program to look at each line, word, etc and determine if you need it.

Is the text you need deleted unique in any way?

Could you post the file somewhere?

mlmcc
Hi,
Unfortunately you will have to parse the file.  That is, you will have to load it, and remove the header information in code.  Typically header blocks are delimited by some recognisable character codes, or character sequences, or quite frequently they are a fixed number of fields in length.
You will need detail on the schema, or structure of the text file.  You can hard code this schema into your cleaning routines, or you can load the structure from a database, or from an xml file.  Loading the schema is more difficult to implement, but it does allow you to adapt if the structure changes by changing your database or xml file without having to recode to reflect changes.
This is an increasingly common problem as systems become more and more integrated.
You don't mention which technologies, or operating systems you are employing.  However these problems exist for all types of systems and technologies and are normally quite easy to implement, the difficulty lies in nailing down the structure.  Use several different examples of different sizes, from different times of the year, etc. to ensure that you can define your requirements fully.  
Other things to consider are the byte-size and character set, i.e. single, double, multi, ascii, unicode, etc.
The fact that the output is from a reporting is a tremendous benefit in this case, as it will most likely generate reports according to very specific rules.  If the reporting system is an internal development you may even have access to source-code, or the development team responsible for its support.  If its a production or off-the-shelf system you can probably contact their tech support for more detailed info.
Also, the fact that output is from a reporting system may limit your need to dynamically load the schema, as the schema will not change unless the reporting system changes.

HTH,

Martin
ASKER CERTIFIED SOLUTION
Avatar of richardlee
richardlee

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of PremierK

ASKER

The specific data that I need in the text file will always start the line with a 14 or a 12, so I'm think that I can write some VB code to look at the first two characters in the line, if it's a "14" or "12", I will copy the entire line, and write it to a text file, then use the "clean" text file to do the import into MS Access.  Sound good??

richardlee, I think you may be awarded the points, we'll see.

Thanks!
I'm back at sqaure one.  I tried a couple of things, but they won't work.  The idea that I had to search each line for a "14" or "12" won't work, because the text file doesn't have a carriage return on every line to denote a new line, so when I use the VB.NET ReadLine method, it reads multiple lines of text, and not one line at a time, which would have let me check for the occurence of 12 or 14 as the first two characters in the line.

Second, I tried to search the text file for a "12" or "14", the select the number of characters needed after that, but if a 12 or 14 exist in the currency data, then that won't work, because it's no way to differentiate between the case 12 or 14, and currency data 12 or 14.

I'm just thinking the document doesn't have enough "structure" to programmatically code something to handle the conditions.

I'm out of ideas folks, HELP me if you can!!
Hi PremierK

You need to "tokenize" each character to decipher what it means in context of the entire string.

Remember the sample code which I posted earlier.  I will change it a bit.

Sample code that you can use:
-----------------------------

sub cleanup()
 dim szLine as string
 dim szData as string
 dim sztmp as string
 dim i as integer

 open "datafile.txt" for input as #1
 open "output.txt" for output as #2

 do while not eof(1)
   line input #1, szLine

   ' do some checking, cleanup here
   ' --- TIP:
   ' * you can use Mid to get portions of the string
   '   (eg. Mid(szLine,1,4))
   ' * you can check if string is numeric using
   '   IsNumeric() function

   '--- added new loop to tokenize ---
   for i = 1 to len(szLine)
      sztmp = Mid(szLine,i,1)  ' this will give you each character
      ' -- decide what to do with it
      ' eg. if the first char is "1", then look
      '     at the next char to see if it's "4" or "2".
      '     it will fulfil your criteria as "14" or "12"

      ' -- you got to do some fancy programming logic here
      '    to get the data that you need.

      ' -- better to get data with carriage return so the
      '    logic is easier.
   next i

   'szData = mid(szLine,1,4)

   print #2, szData  ' writes data to "output.txt"
 loop

 close #1
 close #2

end sub


--- if you still need help, send me a sample data file and explain what is required - email: ryflee@yahoo.com

No promises, but see how can I help.

-richard
PremierK:
This old question needs to be finalized -- accept an answer, split points, or get a refund.  For information on your options, please click here-> http:/help/closing.jsp#1 
EXPERTS:
Post your closing recommendations!  No comment means you don't care.