Link to home
Create AccountLog in
Avatar of rpeters9889
rpeters9889

asked on

Split text file

Hello!
Kind of hard to put this into words, but here goes.

I have a comma delimited text file with records that that look like this........(file attached)

"81205;99134;66173","1;2;3","abcdefg;hijk;tuvwxyz12",,"7;8;9"

Each delimited 'section' contains one 'piece' of a complete record each separated by a semi-colon. These 'pieces' of information will always follow the same order.

What I need to end up with is this.............

"81205,1,abcdefg",,"7"
"99134,2,hijk",,"8"
"66173,3,tuvwxyz12",,"9"

Any and all help is appreciated!
TSTCOM.TXT
Avatar of SmartIntel
SmartIntel
Flag of United States of America image

Try this
$line='your line content';

@ll=split(/"(,)+"/,$line);
foreach $kk(@ll)
{
if ($kk!~/^,$/)
{
$kk=~s/"//g;
@mm=split(/[;,]/,$kk);
print "@mm\n";
}
}

This will split the line with , first then split the 'section' with , or ; to get the 'pieces'

$line='your line';
 
@ll=split(/"(,)+"/,$line);
foreach $kk(@ll)
{
if ($kk!~/^,$/)
{
$kk=~s/"//g;
@mm=split(/[;,]/,$kk);
print "@mm\n";
}
}

Open in new window

Avatar of rpeters9889
rpeters9889

ASKER

Thanks for the response SmartIntel...

I know this going to sound really stupid...but how/where do I run this routine? My knowledge is pretty much limited to Foxpro (which can't deal with this problem due to field length restrictions).

ps...this is a file (tstcom.txt) with approx 1.5 million records (lines).
You have to run this for each line of record. also u have to modify the code little.. print the @mm outside the if condition loop. By this u will get the required data for each line. Now u have parsed data of each line which u can use do whatevery u want to do..

open (FH,"tstcom.txt") ;
while(<FH>)
{
$line=$_;
chomp $line;
@ll=split(/"(,)+"/,$line);
foreach $kk(@ll)
{
if ($kk!~/^,$/)
{
$kk=~s/"//g;
@mm=split(/[;,]/,$kk);
}
print "@mm\n" ## u have the data stored in this array.. u can do for further manipulation or whatever u      have to do.

}

}

This is just to give u some idea to do..do u want me to write the whole code???? : )
Thanks again SmartIntel,

Well, I have to admit that I'm confused. Not only in regards to the coding itself, but also in regard to how I run it once the code is functional. Obviously, this is not going to run from a command line. What application am I going to need to make it work if I can find someone willing to write the code :)

Can you offer up further assistance?
Then you need to tell me your requirement more clearly.
upto my understanding, U have a datafile tstcomm.txt which has 1.5 records of same kind.(comma seperated but has comma in between values).
u need to parse the file and get all values of each record. If my understanding is correct, please let me know what u r going to do with the parsed data. whether u r loading it into db or display it in screen or writing it into file.

SmartIntel,

I actually have five of these ascii files, each with anywhere from 800,000 to 1.5 million records (lines) each. They are comma delimited with semi-colons separating the values (exactly like the sample that was attached).

I need the data parsed into separate records so that I can load them into a .dbf file for further manipulation & integration with several other datasets. I have these same files in dbf format and I have a routine that will separate the values & create new records, but unfortunately I'm limited to a 255 character field length & many of these records exceed that.  If I can get the ascii records parsed correctly, I can easily bring all of the ascii records into Fox. If it would be of any help, I can upload a larger sample of the data.

Hope that helps to explain what I'm doing & Thanks again!
Somehow I got your requirement. But I am not familiar with dbf files. but if u can send me some sample dbf record, i could make the code to write the record in that format.
tell me one thing.if i give u the parsed record in dbf file, u can load it into db, right?
if that is the case, with the dbf format sample, i can make the parsed data to be written in the dbf files.

OK....

I've attached a sample dbf file and the sample .txt file that needs to be parsed. You will need to rename the file 'sample.xls' to 'sample.dbf' as the dbf extension is not recognized for uploading to the site. You will see that the sample dbf file has the same data in the fields as the .txt file...but the data in the field labeled 'ref_num' has been truncated due to field length constraints.

Yes, if the ascii data can be parsed into additional ascii records or directly into dbf format I can readily work with it.

Thanks1
testcom.xls
TSTCOM.TXT
Avatar of mikelfritz
so the testcom.txt has 22 records with 33 fields each(sort of - some records don't have 33)?

To clarify - you have one extremely long line of data with no "newline" characters - the record separator is a comma and the field separator is a semicolon.  You need to basically turn it sideways - 22 records with 33 fields would become 33 records with 22 fields.  If that is the case then awk would work fine - I have a basic structure for it but I would like you to clarify that this is what is needed first - I hate chasing my tail.
mikelfritz,

You are correct. Once the data is parsed, I would end up with 33 records and (I believe) 21 fields.

Thanks!
Once u have seperate 33 records, how do you want to make it fit into 21 fields. we can write this 33 records into file with 21 fields.
For that you need to show one unparsed data and how u want to display the 33 fields in result file.
Hello SmartIntel,

Actually, all we need is to have the data parsed into separate ascii file records like the example in the original question. Once we have that, we can easily pull it into a dbf format.

Am I missing something? :)...probably

Thanks again!
What you have is an array of data, in the example it's 33 wide by 21 high.  It needs to be turned to 21 wide by 33 high.  I tried using awk and got close, but not all the way there.  I'll work on it a little more but...

For anyone else that wants to try - here it is:
original records are comma seperated and the fields are semicolon seperated

original:
a1 a2 a3 a4 a5 a6 ...
b1 b2 b3 b4 b5 b6 ...
c1 c2 c3 c4 c5 c6 ...
...

needs to be:
a1 b1 c1 ...
a2 b2 c2 ...
a3 b3 c3 ...
a4 b4 c4 ...
a5 b5 c5 ...
a6 b6 c6 ...
...
Is it what you need rpeters9889? Please let me know. I will try in Perl
Hi...

I've attached two new samples. The file 'tstcom.txt' is the raw or 'unparsed' file. The file'sample.txt' is what the file would look like after being parsed.

You could consider the first delimited 'group' of semi-colon separated values as the 'key' field. In other words, if there are 19 semi-colon separated values in the first delimited group, we would end up with 19 individual records.

If you have a delimited 'group' that contains only one value (no semi-colons) that number is common to all 19 'new' records (such as "0033" or "3110").

I hope this helps to clarify what I'm after.

Many thanks for all the help!!


TSTCOM.TXT
SAMPLE.TXT
All....

Please note in the above samples that only the first 19 records from 'tstcom.txt' are shown in 'sample.tx'. There would actually be 33 records total when 'tstcom.txt' is parsed.
Anyone...anyone?
I was quite busy wit my work.i will send it soon.
Thanks!
ASKER CERTIFIED SOLUTION
Avatar of mish33
mish33
Flag of United States of America image

Link to home
membership
Create an account to see this answer
Signing up is free. No credit card required.
Create Account

Mish33,
Thanks for the reply. I'm not familiar with Python, but I did download & install it. When I run the code from the command line, I get an error of:

File 'Parser.py' line 3
for i in len(r.split(';')):
attribute error: 'list' object has no attribute 'split'

Sorry, my bad.
BTW, the code assumes the first line has as many fields as you need output lines (33 in your example). All following lines may have 0, 1 or that many fields only.
import csv
r=csv.reader(open("tstcom.txt")).next()
for i in len(r[0].split(';')):
  f=[]
  for s in r:
    k=s.split(';')
    if len(k)>1:
      f.append(k[i])
    else:
      f.append(s)
  print ','.join('"%s"' % k for k in f)

Open in new window