?
Solved

"Cleaning" a text file

Posted on 2003-02-20
7
Medium Priority
?
267 Views
Last Modified: 2010-04-17
I have a text file that is essentially the output from an application reporting system.  I need to take the contents of the text file, and erase the headers, and some other data, and then load the data into a Microsoft Access database.  The problem is that the data is not separated by commas, or anything, it's just tab delimited.  Also the report contains report, and page headers throughout the document that need to be removed before it's loaded into the Access database.  

I need to rip away the unneed text, and load the useful text into an access database.

Ideas on how I can do this??
0
Comment
Question by:PremierK
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
7 Comments
 
LVL 101

Expert Comment

by:mlmcc
ID: 7990020
Basically you need to write a program to look at each line, word, etc and determine if you need it.

Is the text you need deleted unique in any way?

Could you post the file somewhere?

mlmcc
0
 
LVL 1

Expert Comment

by:kelly_mrt
ID: 7990081
Hi,
Unfortunately you will have to parse the file.  That is, you will have to load it, and remove the header information in code.  Typically header blocks are delimited by some recognisable character codes, or character sequences, or quite frequently they are a fixed number of fields in length.
You will need detail on the schema, or structure of the text file.  You can hard code this schema into your cleaning routines, or you can load the structure from a database, or from an xml file.  Loading the schema is more difficult to implement, but it does allow you to adapt if the structure changes by changing your database or xml file without having to recode to reflect changes.
This is an increasingly common problem as systems become more and more integrated.
You don't mention which technologies, or operating systems you are employing.  However these problems exist for all types of systems and technologies and are normally quite easy to implement, the difficulty lies in nailing down the structure.  Use several different examples of different sizes, from different times of the year, etc. to ensure that you can define your requirements fully.  
Other things to consider are the byte-size and character set, i.e. single, double, multi, ascii, unicode, etc.
The fact that the output is from a reporting is a tremendous benefit in this case, as it will most likely generate reports according to very specific rules.  If the reporting system is an internal development you may even have access to source-code, or the development team responsible for its support.  If its a production or off-the-shelf system you can probably contact their tech support for more detailed info.
Also, the fact that output is from a reporting system may limit your need to dynamically load the schema, as the schema will not change unless the reporting system changes.

HTH,

Martin
0
 

Accepted Solution

by:
richardlee earned 150 total points
ID: 7990837
Hi,

I'm doing something similar where I'm taking data from a report, clean it up and convert it to a format readable by Access Database.  For customized cleanup job, you need to write some Visual Basic Code in Access or in Excel.

The szData string below is basically the format acceptable by Access, and can be space or comma delimited.

See if this helps.

Sample code that you can use:
-----------------------------

sub cleanup()
  dim szLine as string
  dim szData as string

  open "datafile.txt" for input as #1
  open "output.txt" for output as #2

  do while not eof(1)
    line input #1, szLine

    ' do some checking, cleanup here
    ' --- TIP:
    ' * you can use Mid to get portions of the string
    '   (eg. Mid(szLine,1,4))
    ' * you can check if string is numeric using
    '   IsNumeric() function

    szData = mid(szLine,1,4)

    print #2, szData  ' writes data to "output.txt"
  loop

  close #1
  close #2

end sub

0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:PremierK
ID: 8010593
The specific data that I need in the text file will always start the line with a 14 or a 12, so I'm think that I can write some VB code to look at the first two characters in the line, if it's a "14" or "12", I will copy the entire line, and write it to a text file, then use the "clean" text file to do the import into MS Access.  Sound good??

richardlee, I think you may be awarded the points, we'll see.

Thanks!
0
 

Author Comment

by:PremierK
ID: 8020397
I'm back at sqaure one.  I tried a couple of things, but they won't work.  The idea that I had to search each line for a "14" or "12" won't work, because the text file doesn't have a carriage return on every line to denote a new line, so when I use the VB.NET ReadLine method, it reads multiple lines of text, and not one line at a time, which would have let me check for the occurence of 12 or 14 as the first two characters in the line.

Second, I tried to search the text file for a "12" or "14", the select the number of characters needed after that, but if a 12 or 14 exist in the currency data, then that won't work, because it's no way to differentiate between the case 12 or 14, and currency data 12 or 14.

I'm just thinking the document doesn't have enough "structure" to programmatically code something to handle the conditions.

I'm out of ideas folks, HELP me if you can!!
0
 

Expert Comment

by:richardlee
ID: 8021835
Hi PremierK

You need to "tokenize" each character to decipher what it means in context of the entire string.

Remember the sample code which I posted earlier.  I will change it a bit.

Sample code that you can use:
-----------------------------

sub cleanup()
 dim szLine as string
 dim szData as string
 dim sztmp as string
 dim i as integer

 open "datafile.txt" for input as #1
 open "output.txt" for output as #2

 do while not eof(1)
   line input #1, szLine

   ' do some checking, cleanup here
   ' --- TIP:
   ' * you can use Mid to get portions of the string
   '   (eg. Mid(szLine,1,4))
   ' * you can check if string is numeric using
   '   IsNumeric() function

   '--- added new loop to tokenize ---
   for i = 1 to len(szLine)
      sztmp = Mid(szLine,i,1)  ' this will give you each character
      ' -- decide what to do with it
      ' eg. if the first char is "1", then look
      '     at the next char to see if it's "4" or "2".
      '     it will fulfil your criteria as "14" or "12"

      ' -- you got to do some fancy programming logic here
      '    to get the data that you need.

      ' -- better to get data with carriage return so the
      '    logic is easier.
   next i

   'szData = mid(szLine,1,4)

   print #2, szData  ' writes data to "output.txt"
 loop

 close #1
 close #2

end sub


--- if you still need help, send me a sample data file and explain what is required - email: ryflee@yahoo.com

No promises, but see how can I help.

-richard
0
 

Expert Comment

by:CleanupPing
ID: 9447613
PremierK:
This old question needs to be finalized -- accept an answer, split points, or get a refund.  For information on your options, please click here-> http:/help/closing.jsp#1 
EXPERTS:
Post your closing recommendations!  No comment means you don't care.
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Entering a date in Microsoft Access can be tricky. A typo can cause month and day to be shuffled, entering the day only causes an error, as does entering, say, day 31 in June. This article shows how an inputmask supported by code can help the user a…
The SignAloud Glove is capable of translating American Sign Language signs into text and audio.
An introduction to basic programming syntax in Java by creating a simple program. Viewers can follow the tutorial as they create their first class in Java. Definitions and explanations about each element are given to help prepare viewers for future …
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

764 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question