Solved

Help with perl script to convert UIEE file to csv

Posted on 2007-11-22
11
346 Views
Last Modified: 2010-04-21
Hi Experts,

I have a huge file (308mb) which I believe to be in the UIEE format. An example of one entry....

**START
IB 0702026174
BI Hardback
AU EVANS, WILLIAM CHARLES
BC GBC
CO SCOTLAND
EI 15 REV ED
IU 275 ILLS.
PD 20011213
NP 600
RP 68.99
RI 68.99
RE 68.99
PU ELSEVIER HEALTH SCIENCES
YP 2001
RC U
RS TERTIARY EDUCATION (US: COLLEGE)
TI TREASE AND EVANS PHARMACOGNOSY
DE Serves as the encyclopaedic reference work on pharmacognosy, the study of those natural substances, principally plants, that find a use in medicine. This book balances between classical and modern aspects of this branch of science, and covers the importance of complementary medicines, including herbal, homeopathic and aromatherapy.
EA 9780702026171
RF R
WE 1780
SG 1
GC O00
I3 9780702026171
**

Obviously it's all book related :) All entries follow the same format, with the first 2 characters indicating the column header. Now my ideal is to convert this file to a csv with one set of column headings. I've been given a perl script by a colleague, and he believes it was designed to do what I want but has never got it to work, or even know if it does. The script is as follows. Would this script work? If not does it need to be tweaked or a do over from scratch?

#! /usr/bin/perl -w
 

@columns = ();

@values = ();

$query = "";
 

while( $line = <STDIN> ){

    $line =~ s/\n//g;

    if( $line =~ /BOOS\|/ ){

        @columns = ();

        @values = ();

        while( $line !~ /BOOE\|/ ){

            $line = <STDIN>;

            $line =~ s/\n//g;

            if( $line =~ /(.+)\|(.+)/ ){

                push( @columns, $1 );

                push( @values, $2 );

            }

        }

        $query = "INSERT INTO books\n(";

        foreach $column ( @columns ){

            $query .= " `$column`,\n";

        }

        chop( $query );

        chop( $query );

        $query .= " )\nVALUES\n(";

        foreach $value ( @values ){

            $value =~ s/\'/\\\'/g;

            $value =~ s/\r//g;

            $query .= " \'$value\',\n";

        }

        chop( $query );

        chop( $query );

        $query .= " );\n";

        print $query;

    }

}

Open in new window

0
Comment
Question by:fyreheart
  • 5
  • 5
11 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 20333734
If we change $line =~ /(.+)\|(.+)/
to $line =~ /(.+)\s+(.+)/
and add the line
BOOS|
to the start of the file, and
BOOE|
to the end of the file
then running the program produces

INSERT INTO books
( `IB`,
 `BI`,
 `AU EVANS, WILLIAM`,
 `BC`,
 `CO`,
 `EI 15 REV`,
 `IU 275`,
 `PD`,
 `NP`,
 `RP`,
 `RI`,
 `RE`,
 `PU ELSEVIER HEALTH`,
 `YP`,
 `RC`,
 `RS TERTIARY EDUCATION (US:`,
 `TI TREASE AND EVANS`,
 `DE Serves as the encyclopaedic reference work on pharmacognosy, the study of those natural substances, principally plants, that find a use in medicine. This book balances between classical and modern aspects of this branch of science, and covers the importance of complementary medicines, including herbal, homeopathic and`,
 `EA`,
 `RF`,
 `WE`,
 `SG`,
 `GC`,
 `I3` )
VALUES
( '0702026174',
 'Hardback',
 'CHARLES',
 'GBC',
 'SCOTLAND',
 'ED',
 'ILLS.',
 '20011213',
 '600',
 '68.99',
 '68.99',
 '68.99',
 'SCIENCES',
 '2001',
 'U',
 'COLLEGE)',
 'PHARMACOGNOSY',
 'aromatherapy.',
 '9780702026171',
 'R',
 '1780',
 '1',
 'O00',
 '9780702026171' );

Is that what you wanted it to do?
0
 

Author Comment

by:fyreheart
ID: 20334783
Thankyou for the response as I have no idea with scripting! The output file is not quite want I had hoped for. I would like it to seperate out the first 2 letters and produce something more along the lines of....

INSERT INTO books
( `IB`,
 `BI`,
 `AU`,
 `BC`,
 `CO`,
 `EI`,
 `IU`,
 `PD`,
 `NP`,
 `RP`,
 `RI`,
 `RE`,
 `PU`,
 `YP`,
 `RC`,
 `RS`,
 `TI`,
 `DE`,
 `EA`,
 `RF`,
 `WE`,
 `SG`,
 `GC`,
 `I3` )
VALUES
( '0702026174',
 'Hardback',
 'CHARLES EVANS, WILLIAM',
 'GBC',
 'SCOTLAND',
 '15 REV ED',
 '275 ILLS.',
 '20011213',
 '600',
 '68.99',
 '68.99',
 '68.99',
 'ELSEVIER HEALTH SCIENCES',
 '2001',
 'U',
 'TERTIARY EDUCATION (US: COLLEGE)',
 'TREASE AND EVANS PHARMACOGNOSY',
 'Serves as the encyclopaedic reference work on pharmacognosy, the study of those natural substances, principally plants, that find a use in medicine. This book balances between classical and modern aspects of this branch of science, and covers the importance of complementary medicines, including herbal, homeopathic and aromatherapy.',
 '9780702026171',
 'R',
 '1780',
 '1',
 'O00',
 '9780702026171' );

Is that do-able within the script?
0
 
LVL 84

Expert Comment

by:ozo
ID: 20337001
$line =~ /(.+?)\s+(.+)/
0
 
LVL 6

Assisted Solution

by:nitinsawhney
nitinsawhney earned 200 total points
ID: 20337968
Hi,

It is not feasible to edit an 308 MB file and change the start and end notation of the records as suggested by ozo in the first comment.
The following code is all you would need to parse your file. It assumes that the INPUT file is named as input.txt.

Cheers
#! /usr/bin/perl -w
 

@columns = ();

@values = ();

$query = "";
 

open(INFILE,'< input.txt') || die "File not found";

while( $line = <INFILE> ){

    $line =~ s/\n//g;

    if( $line =~ /^\*\*START$/ ){

        @columns = ();

        @values = ();

        while( $line !~ /^\*\*$/ ){

            $line = <INFILE>;

            $line =~ s/\n//g;

            if( $line =~ /^(.{2})\s+?(.+)$/ ){

                push( @columns, $1 );

                push( @values, $2 );

            }

        }

        $query = "INSERT INTO books\n(";

        foreach $column ( @columns ){

            $query .= " `$column`,\n";

        }

        chop( $query );

        chop( $query );

        $query .= " )\nVALUES\n(";

        foreach $value ( @values ){

            $value =~ s/\'/\\\'/g;

            $value =~ s/\r//g;

            $query .= " \'$value\',\n";

        }

        chop( $query );

        chop( $query );

        $query .= " );\n";

        print $query;

    }

}

close INFILE;

Open in new window

0
 
LVL 84

Expert Comment

by:ozo
ID: 20338043
Assuming that
**START
in the file in meant to take the place of BOOS|  and
**
takes the place of BOOE|

perl -ne 'BEGIN{$/="**START\n"}print "INSERT INTO books\n(`",join("`,\n `",keys %h),"` )\nVALUES\n(`",join("`,\n `",values %h),"` );\n" if %h=/^(\w\w)\s+(.*)/gm;' input.txt
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 

Author Comment

by:fyreheart
ID: 20338219
Ok so now for an entirely dense question, at the end of the program running will it then output to a csv? I see the insert command..but I just wanna check I don't need an already created file for it to output to or an open file of some sort?

Many thanks to both of you for your assistance so far!
0
 
LVL 84

Expert Comment

by:ozo
ID: 20338318
running the program will print

INSERT INTO books
( `IB`,
 `BI`,
 `AU`,
 `BC`,
 `CO`,
 `EI`,
 `IU`,
 `PD`,
 `NP`,
 `RP`,
 `RI`,
 `RE`,
 `PU`,
 `YP`,
 `RC`,
 `RS`,
 `TI`,
 `DE`,
 `EA`,
 `RF`,
 `WE`,
 `SG`,
 `GC`,
 `I3` )
VALUES
( '0702026174',
 'Hardback',
 'CHARLES EVANS, WILLIAM',
 'GBC',
 'SCOTLAND',
 '15 REV ED',
 '275 ILLS.',
 '20011213',
 '600',
 '68.99',
 '68.99',
 '68.99',
 'ELSEVIER HEALTH SCIENCES',
 '2001',
 'U',
 'TERTIARY EDUCATION (US: COLLEGE)',
 'TREASE AND EVANS PHARMACOGNOSY',
 'Serves as the encyclopaedic reference work on pharmacognosy, the study of those natural substances, principally plants, that find a use in medicine. This book balances between classical and modern aspects of this branch of science, and covers the importance of complementary medicines, including herbal, homeopathic and aromatherapy.',
 '9780702026171',
 'R',
 '1780',
 '1',
 'O00',
 '9780702026171' );

what you do with that output is up to you.
0
 

Author Comment

by:fyreheart
ID: 20338397
Ok so now I'm entirely confused...not a difficult thing to do!

If I run the code nitinsawhney suggested, my dos screen scrolls endless through

INSERT INTO books )
VALUES );
INSERT INTO books )
VALUES );

You get the idea...but doesn't appear to output anything.

Ozo....where should your code (below) be put?

perl -ne 'BEGIN{$/="**START\n"}print "INSERT INTO books\n(`",join("`,\n `",keys %h),"` )\nVALUES\n(`",join("`,\n `",values %h),"` );\n" if %h=/^(\w\w)\s+(.*)/gm;' input.txt
0
 
LVL 84

Accepted Solution

by:
ozo earned 300 total points
ID: 20340220
> my dos screen scrolls endless through
Does input.txt contain endless
**START
**


> where should your code (below) be put?
type it (or cut and paste it) into the dos command shell
But in a dos command line you would have to use " instead of '
perl -ne "BEGIN{$/='**START'}print qq'INSERT INTO books\n(`',join(qq'`,\n `',keys %h),qq'` )\nVALUES\n(`',join(qq'`,\n `',values %h),qq'` );\n' if %h=/^(\w\w)\s+(.*)/gm;" input.txt
0
 

Author Closing Comment

by:fyreheart
ID: 31410541
Have had to abandon this script due to a problem discovered within the input data. Thanks for all your help.
0
 

Author Comment

by:fyreheart
ID: 20342840
Thanks for  your help, however have had to abandon this script as it was discovered there was a problem with the data.
0

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This video shows how to remove a single email address from the Outlook 2010 Auto Suggestion memory. NOTE: For Outlook 2016 and 2013 perform the exact same steps. Open a new email: Click the New email button in Outlook. Start typing the address: …

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now