Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

Help with perl script to convert UIEE file to csv

Posted on 2007-11-22
11
350 Views
Last Modified: 2010-04-21
Hi Experts,

I have a huge file (308mb) which I believe to be in the UIEE format. An example of one entry....

**START
IB 0702026174
BI Hardback
AU EVANS, WILLIAM CHARLES
BC GBC
CO SCOTLAND
EI 15 REV ED
IU 275 ILLS.
PD 20011213
NP 600
RP 68.99
RI 68.99
RE 68.99
PU ELSEVIER HEALTH SCIENCES
YP 2001
RC U
RS TERTIARY EDUCATION (US: COLLEGE)
TI TREASE AND EVANS PHARMACOGNOSY
DE Serves as the encyclopaedic reference work on pharmacognosy, the study of those natural substances, principally plants, that find a use in medicine. This book balances between classical and modern aspects of this branch of science, and covers the importance of complementary medicines, including herbal, homeopathic and aromatherapy.
EA 9780702026171
RF R
WE 1780
SG 1
GC O00
I3 9780702026171
**

Obviously it's all book related :) All entries follow the same format, with the first 2 characters indicating the column header. Now my ideal is to convert this file to a csv with one set of column headings. I've been given a perl script by a colleague, and he believes it was designed to do what I want but has never got it to work, or even know if it does. The script is as follows. Would this script work? If not does it need to be tweaked or a do over from scratch?

#! /usr/bin/perl -w
 
@columns = ();
@values = ();
$query = "";
 
while( $line = <STDIN> ){
    $line =~ s/\n//g;
    if( $line =~ /BOOS\|/ ){
        @columns = ();
        @values = ();
        while( $line !~ /BOOE\|/ ){
            $line = <STDIN>;
            $line =~ s/\n//g;
            if( $line =~ /(.+)\|(.+)/ ){
                push( @columns, $1 );
                push( @values, $2 );
            }
        }
        $query = "INSERT INTO books\n(";
        foreach $column ( @columns ){
            $query .= " `$column`,\n";
        }
        chop( $query );
        chop( $query );
        $query .= " )\nVALUES\n(";
        foreach $value ( @values ){
            $value =~ s/\'/\\\'/g;
            $value =~ s/\r//g;
            $query .= " \'$value\',\n";
        }
        chop( $query );
        chop( $query );
        $query .= " );\n";
        print $query;
    }
}

Open in new window

0
Comment
Question by:fyreheart
  • 5
  • 5
11 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 20333734
If we change $line =~ /(.+)\|(.+)/
to $line =~ /(.+)\s+(.+)/
and add the line
BOOS|
to the start of the file, and
BOOE|
to the end of the file
then running the program produces

INSERT INTO books
( `IB`,
 `BI`,
 `AU EVANS, WILLIAM`,
 `BC`,
 `CO`,
 `EI 15 REV`,
 `IU 275`,
 `PD`,
 `NP`,
 `RP`,
 `RI`,
 `RE`,
 `PU ELSEVIER HEALTH`,
 `YP`,
 `RC`,
 `RS TERTIARY EDUCATION (US:`,
 `TI TREASE AND EVANS`,
 `DE Serves as the encyclopaedic reference work on pharmacognosy, the study of those natural substances, principally plants, that find a use in medicine. This book balances between classical and modern aspects of this branch of science, and covers the importance of complementary medicines, including herbal, homeopathic and`,
 `EA`,
 `RF`,
 `WE`,
 `SG`,
 `GC`,
 `I3` )
VALUES
( '0702026174',
 'Hardback',
 'CHARLES',
 'GBC',
 'SCOTLAND',
 'ED',
 'ILLS.',
 '20011213',
 '600',
 '68.99',
 '68.99',
 '68.99',
 'SCIENCES',
 '2001',
 'U',
 'COLLEGE)',
 'PHARMACOGNOSY',
 'aromatherapy.',
 '9780702026171',
 'R',
 '1780',
 '1',
 'O00',
 '9780702026171' );

Is that what you wanted it to do?
0
 

Author Comment

by:fyreheart
ID: 20334783
Thankyou for the response as I have no idea with scripting! The output file is not quite want I had hoped for. I would like it to seperate out the first 2 letters and produce something more along the lines of....

INSERT INTO books
( `IB`,
 `BI`,
 `AU`,
 `BC`,
 `CO`,
 `EI`,
 `IU`,
 `PD`,
 `NP`,
 `RP`,
 `RI`,
 `RE`,
 `PU`,
 `YP`,
 `RC`,
 `RS`,
 `TI`,
 `DE`,
 `EA`,
 `RF`,
 `WE`,
 `SG`,
 `GC`,
 `I3` )
VALUES
( '0702026174',
 'Hardback',
 'CHARLES EVANS, WILLIAM',
 'GBC',
 'SCOTLAND',
 '15 REV ED',
 '275 ILLS.',
 '20011213',
 '600',
 '68.99',
 '68.99',
 '68.99',
 'ELSEVIER HEALTH SCIENCES',
 '2001',
 'U',
 'TERTIARY EDUCATION (US: COLLEGE)',
 'TREASE AND EVANS PHARMACOGNOSY',
 'Serves as the encyclopaedic reference work on pharmacognosy, the study of those natural substances, principally plants, that find a use in medicine. This book balances between classical and modern aspects of this branch of science, and covers the importance of complementary medicines, including herbal, homeopathic and aromatherapy.',
 '9780702026171',
 'R',
 '1780',
 '1',
 'O00',
 '9780702026171' );

Is that do-able within the script?
0
 
LVL 84

Expert Comment

by:ozo
ID: 20337001
$line =~ /(.+?)\s+(.+)/
0
Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
LVL 6

Assisted Solution

by:nitinsawhney
nitinsawhney earned 200 total points
ID: 20337968
Hi,

It is not feasible to edit an 308 MB file and change the start and end notation of the records as suggested by ozo in the first comment.
The following code is all you would need to parse your file. It assumes that the INPUT file is named as input.txt.

Cheers
#! /usr/bin/perl -w
 
@columns = ();
@values = ();
$query = "";
 
open(INFILE,'< input.txt') || die "File not found";
while( $line = <INFILE> ){
    $line =~ s/\n//g;
    if( $line =~ /^\*\*START$/ ){
        @columns = ();
        @values = ();
        while( $line !~ /^\*\*$/ ){
            $line = <INFILE>;
            $line =~ s/\n//g;
            if( $line =~ /^(.{2})\s+?(.+)$/ ){
                push( @columns, $1 );
                push( @values, $2 );
            }
        }
        $query = "INSERT INTO books\n(";
        foreach $column ( @columns ){
            $query .= " `$column`,\n";
        }
        chop( $query );
        chop( $query );
        $query .= " )\nVALUES\n(";
        foreach $value ( @values ){
            $value =~ s/\'/\\\'/g;
            $value =~ s/\r//g;
            $query .= " \'$value\',\n";
        }
        chop( $query );
        chop( $query );
        $query .= " );\n";
        print $query;
    }
}
close INFILE;

Open in new window

0
 
LVL 84

Expert Comment

by:ozo
ID: 20338043
Assuming that
**START
in the file in meant to take the place of BOOS|  and
**
takes the place of BOOE|

perl -ne 'BEGIN{$/="**START\n"}print "INSERT INTO books\n(`",join("`,\n `",keys %h),"` )\nVALUES\n(`",join("`,\n `",values %h),"` );\n" if %h=/^(\w\w)\s+(.*)/gm;' input.txt
0
 

Author Comment

by:fyreheart
ID: 20338219
Ok so now for an entirely dense question, at the end of the program running will it then output to a csv? I see the insert command..but I just wanna check I don't need an already created file for it to output to or an open file of some sort?

Many thanks to both of you for your assistance so far!
0
 
LVL 84

Expert Comment

by:ozo
ID: 20338318
running the program will print

INSERT INTO books
( `IB`,
 `BI`,
 `AU`,
 `BC`,
 `CO`,
 `EI`,
 `IU`,
 `PD`,
 `NP`,
 `RP`,
 `RI`,
 `RE`,
 `PU`,
 `YP`,
 `RC`,
 `RS`,
 `TI`,
 `DE`,
 `EA`,
 `RF`,
 `WE`,
 `SG`,
 `GC`,
 `I3` )
VALUES
( '0702026174',
 'Hardback',
 'CHARLES EVANS, WILLIAM',
 'GBC',
 'SCOTLAND',
 '15 REV ED',
 '275 ILLS.',
 '20011213',
 '600',
 '68.99',
 '68.99',
 '68.99',
 'ELSEVIER HEALTH SCIENCES',
 '2001',
 'U',
 'TERTIARY EDUCATION (US: COLLEGE)',
 'TREASE AND EVANS PHARMACOGNOSY',
 'Serves as the encyclopaedic reference work on pharmacognosy, the study of those natural substances, principally plants, that find a use in medicine. This book balances between classical and modern aspects of this branch of science, and covers the importance of complementary medicines, including herbal, homeopathic and aromatherapy.',
 '9780702026171',
 'R',
 '1780',
 '1',
 'O00',
 '9780702026171' );

what you do with that output is up to you.
0
 

Author Comment

by:fyreheart
ID: 20338397
Ok so now I'm entirely confused...not a difficult thing to do!

If I run the code nitinsawhney suggested, my dos screen scrolls endless through

INSERT INTO books )
VALUES );
INSERT INTO books )
VALUES );

You get the idea...but doesn't appear to output anything.

Ozo....where should your code (below) be put?

perl -ne 'BEGIN{$/="**START\n"}print "INSERT INTO books\n(`",join("`,\n `",keys %h),"` )\nVALUES\n(`",join("`,\n `",values %h),"` );\n" if %h=/^(\w\w)\s+(.*)/gm;' input.txt
0
 
LVL 84

Accepted Solution

by:
ozo earned 300 total points
ID: 20340220
> my dos screen scrolls endless through
Does input.txt contain endless
**START
**


> where should your code (below) be put?
type it (or cut and paste it) into the dos command shell
But in a dos command line you would have to use " instead of '
perl -ne "BEGIN{$/='**START'}print qq'INSERT INTO books\n(`',join(qq'`,\n `',keys %h),qq'` )\nVALUES\n(`',join(qq'`,\n `',values %h),qq'` );\n' if %h=/^(\w\w)\s+(.*)/gm;" input.txt
0
 

Author Closing Comment

by:fyreheart
ID: 31410541
Have had to abandon this script due to a problem discovered within the input data. Thanks for all your help.
0
 

Author Comment

by:fyreheart
ID: 20342840
Thanks for  your help, however have had to abandon this script as it was discovered there was a problem with the data.
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
perl match and sort unique result 2 141
Strange perl issue 6 129
remove duplicates from the csv file 13 113
Perl passing in variables to do substitution 6 75
I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use m…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

809 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question