How to split files and use data from inside the data to name the file

A new question to a new format of file i need broken up.
Last time i asked it was different, and I did not know how to reopen question.

1) Replace any "&" in file to "H"
2) Replace any "#" in file to "I"  (Letter I)
3) Break file by $$$$$$$.
4) Use the second entry in the $$$$$$$$ as part of the file name
5) Remove the $$$$$$$$ from the new file
6) Place new file into a new directory

Parameters

$INDIR = /data/out/
$OUTDIR = /data/newout


Input file:

Name sad.123.sad

$$$$$$$$|I231_0081788682|
HEADER|INV|20180224|20180224||0004165036|0004165036|0081788682|||||||||||
ITEM|900001|10|4L7S6101367|1381|9999|100|CS|0081788682|10|180211C01|20180121
ITEM|900002|10|4L7S6101367|1381|9999|100|CS|0081788682|10|180503C02|20180219
ITEM|900001|20|87QS5930150|1381|9999|200|CS|0081788682|20|180181C01|20180118
ITEM|900002|20|87QS5930150|1381|9999|100|CS|0081788682|20|180182C02|20180118
&EADER|DELV|20180224|20180224||0004165036|0004165036|0081788682|||||||||||
#TEM|900001|10|4L7S6101367|1381|9999|100|CS|0081788682|10|180211C01|20180121
#TEM|900002|10|4L7S6101367|1381|9999|100|CS|0081788682|10|180503C02|20180219
#TEM|900001|20|87QS5930150|1381|9999|200|CS|0081788682|20|180181C01|20180118
#TEM|900002|20|87QS5930150|1381|9999|100|CS|0081788682|20|180182C02|20180118
$$$$$$$$|I231_0081788684|
HEADER|INV|20180224|20180224||0004165036|0004165036|0081788684|||||||||||
ITEM|900001|10|4L7S6101367|1381|9999|100|CS|0081788684|10|180211C01|20180121
ITEM|900002|10|4L7S6101367|1381|9999|100|CS|0081788684|10|180503C02|20180219
ITEM|900001|20|87QS5930150|1381|9999|200|CS|0081788684|20|180181C01|20180118
ITEM|900002|20|87QS5930150|1381|9999|100|CS|0081788684|20|180182C02|20180118
&EADER|DELV|20180224|20180224||0004165036|0004165036|0081788684|||||||||||
#TEM|900001|10|4L7S6101367|1381|9999|100|CS|0081788684|10|180211C01|20180121
#TEM|900002|10|4L7S6101367|1381|9999|100|CS|0081788684|10|180503C02|20180219
#TEM|900001|20|87QS5930150|1381|9999|200|CS|0081788684|20|180181C01|20180118
#TEM|900002|20|87QS5930150|1381|9999|100|CS|0081788684|20|180182C02|20180118
$$$$$$$$|I266_0081788699|
HEADER|INV|20180224|20180224||0004165036|0004165036|0081788699|||||||||||
ITEM|900001|10|4L7S6101367|1381|9999|100|CS|0081788699|10|180211C01|20180121
ITEM|900002|10|4L7S6101367|1381|9999|100|CS|0081788699|10|180503C02|20180219
ITEM|900001|20|87QS5930150|1381|9999|200|CS|0081788699|20|180181C01|20180118
ITEM|900002|20|87QS5930150|1381|9999|100|CS|0081788699|20|180182C02|20180118
&EADER|DELV|20180224|20180224||0004165036|0004165036|0081788699|||||||||||
#TEM|900001|10|4L7S6101367|1381|9999|100|CS|0081788699|10|180211C01|20180121
#TEM|900002|10|4L7S6101367|1381|9999|100|CS|0081788699|10|180503C02|20180219
#TEM|900001|20|87QS5930150|1381|9999|200|CS|0081788699|20|180181C01|20180118
#TEM|900002|20|87QS5930150|1381|9999|100|CS|0081788699|20|180182C02|20180118

Break into files (This sample above is 3)

1  File    Should be named:  I231_0081788682_02252018_051500.txt

HEADER|INV|20180224|20180224||0004165036|0004165036|0081788682|||||||||||
ITEM|900001|10|4L7S6101367|1381|9999|100|CS|0081788682|10|180211C01|20180121
ITEM|900002|10|4L7S6101367|1381|9999|100|CS|0081788682|10|180503C02|20180219
ITEM|900001|20|87QS5930150|1381|9999|200|CS|0081788682|20|180181C01|20180118
ITEM|900002|20|87QS5930150|1381|9999|100|CS|0081788682|20|180182C02|20180118
HEADER|DELV|20180224|20180224||0004165036|0004165036|0081788682|||||||||||
ITEM|900001|10|4L7S6101367|1381|9999|100|CS|0081788682|10|180211C01|20180121
ITEM|900002|10|4L7S6101367|1381|9999|100|CS|0081788682|10|180503C02|20180219
ITEM|900001|20|87QS5930150|1381|9999|200|CS|0081788682|20|180181C01|20180118
ITEM|900002|20|87QS5930150|1381|9999|100|CS|0081788682|20|180182C02|20180118

2) file: should be            Should be named:  I231_0081788684_02252018_051500.txt
HEADER|INV|20180224|20180224||0004165036|0004165036|0081788684|||||||||||
ITEM|900001|10|4L7S6101367|1381|9999|100|CS|0081788684|10|180211C01|20180121
ITEM|900002|10|4L7S6101367|1381|9999|100|CS|0081788684|10|180503C02|20180219
ITEM|900001|20|87QS5930150|1381|9999|200|CS|0081788684|20|180181C01|20180118
ITEM|900002|20|87QS5930150|1381|9999|100|CS|0081788684|20|180182C02|20180118
HEADER|DELV|20180224|20180224||0004165036|0004165036|0081788684|||||||||||
ITEM|900001|10|4L7S6101367|1381|9999|100|CS|0081788684|10|180211C01|20180121
ITEM|900002|10|4L7S6101367|1381|9999|100|CS|0081788684|10|180503C02|20180219
ITEM|900001|20|87QS5930150|1381|9999|200|CS|0081788684|20|180181C01|20180118
ITEM|900002|20|87QS5930150|1381|9999|100|CS|0081788684|20|180182C02|20180118

3) File   name    Should be named:  I266_0081788699_02252018_051500.txt
HEADER|INV|20180224|20180224||0004165036|0004165036|0081788699|||||||||||
ITEM|900001|10|4L7S6101367|1381|9999|100|CS|0081788699|10|180211C01|20180121
ITEM|900002|10|4L7S6101367|1381|9999|100|CS|0081788699|10|180503C02|20180219
ITEM|900001|20|87QS5930150|1381|9999|200|CS|0081788699|20|180181C01|20180118
ITEM|900002|20|87QS5930150|1381|9999|100|CS|0081788699|20|180182C02|20180118
HEADER|DELV|20180224|20180224||0004165036|0004165036|0081788699|||||||||||
ITEM|900001|10|4L7S6101367|1381|9999|100|CS|0081788699|10|180211C01|20180121
ITEM|900002|10|4L7S6101367|1381|9999|100|CS|0081788699|10|180503C02|20180219
ITEM|900001|20|87QS5930150|1381|9999|200|CS|0081788699|20|180181C01|20180118
ITEM|900002|20|87QS5930150|1381|9999|100|CS|008178869|20|180182C02|20180118
wiestassocAsked:
Who is Participating?
 
Bill PrewCommented:
Well, sorry, sort of.  I did test, it works here, but my test actually looked like:

INDIR=/b/EE/EE29085981/in
OUTDIR=/b/EE/EE29085981/out

But then when I was ready to post here I copied the authors "commands" in without sanitizing them fully.


»bp
0
 
nociSoftware EngineerCommented:
#!/bin/bash

INDIR=/data/out
OUTDIR=/data/newout
OUTFILE=""
cat $INDIR/sad.123.sad | sed 's/#/I/g' | while  IFS="|" read head data tail
do
        if [ "$HEAD" eq "$$$$$$$$" ]
        then
                OUTFILE=$OUTDIR/$data_$( date +%02m%02d%04Y_%02H%02M%02S ).txt
        else    
                echo >>$OUTFILE "$head|$data|$tail" 
        fi      
done    
~                                                                                                                                                                            

Open in new window

0
 
tel2Commented:
Hi wiestassoc,
Are you open to Perl solutions?
0
Cloud Class® Course: Python 3 Fundamentals

This course will teach participants about installing and configuring Python, syntax, importing, statements, types, strings, booleans, files, lists, tuples, comprehensions, functions, and classes.

 
Bill PrewCommented:
Here is an AWK based approach, like prior question...

BASH script to drive it for each input data file:

#!/bin/bash

INDIR = /data/out/
OUTDIR = /data/newout

for f in $(ls $INDIR); 
do 
    awk -f ee29085981.awk -v destdir=$OUTDIR $INDIR/$f; 
done

Open in new window

AWK script that is called:

BEGIN {
    FS = "|"
    currentDate = strftime("%Y%m%d_%H%M%S")
}

{
    if ($1 == "$$$$$$$$") {
        fileOut = destdir "/" $2 "_" currentDate ".txt"
        next
    }

    line = $0
    gsub("#", "I", line)
    gsub("&", "H", line)
    print line>>fileOut
}

Open in new window


»bp
0
 
tel2Commented:
Have you tested that, Bill?

The first things that stand out for me are these lines:
    INDIR = /data/out/
    OUTDIR = /data/newout
which aren't valid bash commands.  If you don't believe me, try running them.
0
 
nociSoftware EngineerCommented:
#!/bin/bash

INDIR=/data/out
OUTDIR=/data/newout
OUTFILE=""
cat $INDIR/sad.123.sad | sed 's/#/I/g' | while  IFS="|" read head data tail
do
        if [ "$HEAD" eq "$$$$$$$$" ]
        then
                OUTFILE=$OUTDIR/$data_$( date +%02m%02d%04Y_%02H%02M%02S ).txt
                 rm -f $OUTFILE
        else    
                echo >>$OUTFILE "$head|$data|$tail" 
        fi      
done    

Open in new window


Small amendment, removing the file before use,  just in case it exist....
and removal of the ~ (copy pase issue from vim).
0
 
tel2Commented:
...not to mention removing the spaces in these lines:
    INDIR = /data/out/
    OUTDIR = /data/newout
and the trailing slash.
0
 
tel2Commented:
I'll accept that answer, Bill.   8)

Pls disregard my last post, noci & Bill.  I was thinking noci's post was from Bill.
0
 
tel2Commented:
Still waiting to hear whether you accept Perl solutions, weistassoc.  Meanwhile, I'll give you one anyway:

#!/bin/bash

INDIR=/data/out
export OUTDIR=/data/newout
export DT=`date +%m%d%Y_%H%M%S`
perl -pe '$_="",open STDOUT, ">>$ENV{OUTDIR}/$1_$ENV{DT}.txt" if /^\${8}\|(.+?)\|/;tr /&#/HI/' $INDIR/sad.123.sad

Open in new window

No "export" required in the "INDIR=/data/out" command since it's not referenced from within Perl.

I notice that the last line of your input file is:
    #TEM|900002|20|87QS5930150|1381|9999|100|CS|0081788699|20|180182C02|20180118
But you've said the last line of your last output file (I266_0081788699_02252018_051500.txt) should be:
    ITEM|900002|20|87QS5930150|1381|9999|100|CS|008178869|20|180182C02|20180118
Was the latter an error?  Should the "008178869" have been "0081788699"?
0
 
tel2Commented:
...or if you prefer to pass the input file name (or file names) as arguments to the script, you could use this:
#!/bin/bash

INDIR=/data/out
export OUTDIR=/data/newout
export DT=`date +%m%d%Y_%H%M%S`
perl -pe '$_="",open STDOUT, ">>$ENV{OUTDIR}/$1_$ENV{DT}.txt" if /^\${8}\|(.+?)\|/;tr /&#/HI/' $INDIR/$*

Open in new window

Then run it like this:
    ./myscript.sh sad.123.sad
for one file, or this:
    ./myscript.sh sad.123.sad sad.234.sad sad.3*
for many.
0
 
skullnobrainsCommented:
i like this one

sed -e 's/^#/I/g ; /^\$\$\$\$\$\$\$\$/ { s/\$\$\$\$\$\$\$\$|\([^|]*\)|.*/"\1_02252018_051500.txt"/ ; h ; s/.*/truncate -s 0 &;/p ; x ; s/.*/ >> &/ ; h ; d } ; s/.*/echo "&"/ ; G ; s/\n// ' /PATH/TO/INPUT/FILE | sh -s

Open in new window


obviously replace /PATH/TO/INPUT/FILE with whatever fits
and check what it does first by removing the '| sh -s ' if you're unsure about executing this mess. it does work though ;)

i lazily assumed neither the output file name or the input file would contain double quotes or backslashes or other stuff that would need a little extra effort to handle properly in a shell. it's easy though not bullet-proof to double all backslashes in the input file first.
0
 
tel2Commented:
Hi wiestassoc,

Is the "02252018_051500" part of your sample output filenames meant to be always literally that string (as skullnobrains seems to have assumed), or is it meant to be the current date & time (as the rest of us have)?
0
 
wiestassocAuthor Commented:
02252018_051500 is only a sample date/time.  It should be current data and time.
0
 
wiestassocAuthor Commented:
thank you all.
0
 
Bill PrewCommented:
Welcome.


»bp
0
 
tel2Commented:
Some (hopefully constructive) comments on the solutions supplied:

Bill's solution creates output filename datestamps of the format YYYYMMDD instead of DDMMYYYY.  The sample seems to require the latter (e.g. 02252018).  This can be easily fixed, of course.  And the 2 semicolons can be removed from the ends of lines 6 & 8 in the bash script.  It seems to work though, and processes all files in $INDIR, (as my 2nd solution can do if run like: ./myscript.sh "*").  The fact that it requires 2 scripts (bash and awk) is a bit messy, but maybe that could be resolved with a heredoc or similar, if required.

noci's solution gives me errors when I run it, has $HEAD and $head (2 different variables), and seems to make no attempt to replace "&" with "H" (requirement #1).

skullnobrains's solution puts the output files in the current directory instead of into $OUTDIR, assumes a fixed date/timestamp (due to unclear requirements), and does not replace "&" with "H" (requirement #1), and it requires hard coding of the input directory instead of referencing the supplied $INDIR.
0
 
skullnobrainsCommented:
still for the fun and not actually tested, this time. thkx

IN=/PATH/TO/INPUT/FILE
( cd "`dirname "$IN"`" && sed -e 's/^#/I/g ; s/^&/H/ ; /^\$\$\$\$\$\$\$\$/ { s/\$\$\$\$\$\$\$\$|\([^|]*\)|.*/"\1_'`date +%d%m%Y_%H%M%S`'.txt"/ ; h ; s/.*/truncate -s 0 &;/p ; d } ; s/.*/echo "&" >> / ; G ; s/\n// ' "$IN" | sh -s )

Open in new window

made is a little simpler as well.
btw i'm only replacing # and & at the beginning of lines which may or may not be what is expected and the fixed timestamp was out of laziness, rather ;)

if we remove the truncate stuff, and optimize a bit the ereg, we can build a size competition with the perl version. ;) ... rooting for sed but expecting to loose
0
 
tel2Commented:
> "if we remove the truncate stuff, and optimize a bit the ereg, we can build a size competition with the perl version. ;) ... rooting for sed but expecting to loose"
I accept the challenge, skullnobrains.
But to make it a level playing ground, I suggest you start with something like this:
#!/bin/bash

INDIR=data/out
OUTDIR=data/newout
DT=`date +%m%d%Y_%H%M%S`

Open in new window

The $INDIR & $OUTDIR assignments were supplied by the asker so may well exist in a bigger script which will have our solution added to it, so let's just set those ourselves.  (I've removed the leading "/" before "data" for our convenience.)
If you generate your date/timestamp from within your loop, then it will not only be slower, but runs the risk of having the timestamp change when the second changes, resulting your output being split into extra files.

Here are some free starter tips:
- Loose the "-e " switch.
- Change your 2 batches of:
    /^\$\$\$\$\$\$\$\$/
to:
    /^\$\{8\}/
- Remove extra spaces like:
    ; G ;
which could be:
    ; G;
or if you like:
    ;G;

But even if you can make it more concise than my Perl solution, it'll be hard to beat Perl's performance, especially if you continue to pipe each "echo" through "sh -s".  But we ignore performance if you like.

If you can test your code before posting your next version, that would be good.
0
 
skullnobrainsCommented:
lol, i was kidding around... not really up for trying to make it another 10-15 characters shorter...

regarding performance, if the date command is not spawned multiple times as you rightfully suggested, you'd probably be surprised at how good they would actually be for a big input file. remember that both the sed and sh are only spawned once.

actually, it is likely possible to beat perl performance-wise. to achieve that, i'd probably change a little the script by opening the file descriptor with exec at the time the files are truncated and then change the redirections to ">>&FD" or much better and easier generate a heredoc syntax so writing and truncating each file is done in a single tee/cat/sponge... command. and obviously use a small and fast shell and add a few flags so it does not load profiles and the likes on startup.
0
 
tel2Commented:
OK, we can forget about size if you like.

Good point about "date" & "sh -s" being run only once.  I didn't realise that.

Regarding performance, I've made some adjustments to your script so it basically works, got rid of some unnecessary bits including "truncate...", added a "basename...", and here it is:
IN=data/out/testdata1
time (cd `dirname $IN` && sed 's/^#/I/g; s/^&/H/; /^\$\{8\}/ {s/\$\{8\}|\([^|]*\)|.*/"\1_'`date +%m%d%Y_%H%M%S`'.txt"/; h; d}; s/.*/echo "&" >>/; G; s/\n//' `basename $IN` | sh -s)

Open in new window

Currently when I run it on testdata1 (a 50,005 line file which splits into 5 files), it takes about 4 secs on the machine I use, while my Perl script is taking about 10% of that time using the same data.  So if you think you can make it faster than Perl, go ahead, make my day!  Personally, I think unless you get rid of those repeated shell commands (i.e. "echo..." in this case), it will be a bit (or even a byte) difficult.

That testdata1 file is attached, if you want it.  Unzip it and put it in data/out.  It's extracted size should be 3825090 bytes.
testdata1.zip
0
 
nociSoftware EngineerCommented:
perl ( or any scripting language based on micro-engines) should be faster then any shell.
But this problem match very well with perl because of the regexes needed.
0
 
tel2Commented:
True noci, although it looks to me as if sed's regexability is not the limiting issue in this case.  It's writing out the results into different files that seems to be the issue.  But I'm not very good at sed, so someone else may know better.
0
 
skullnobrainsCommented:
faster then any shell

this is debatable : shell scripts suffer from 2 main drawbacks :
- spawning external commands is slow ( context switches, fork, execve ... )
- reparsing the same command lines is slow

the first one is alleviated with the use of builtins
the second may be alleviated in some case but not all

i've made quite a few comparative benchmarks and already seen quite a few cases in which the shell could compete with a reasonably efficient perl or php program.

in the above case, if the above mentioned optimizations where made, the result is not that easily predictable :
- setting up the shell pipeline is lighter that spawning perl, loading modules, precompiling the perl code and the likes,
- using the pipeline is heavier than what perl does ( context switches )
- parsing the sed's output with the shell is likely neglectible ( with optimizations )
- regular expressions in sed are lighter than in perl with a decent sed. not sure about gnu sed. but more expressions are required in the sed code
- writing to the disk is roughly the same as long as the shell command that performs the writing writes to a file handle open by the shell or the command writes blocks rather than lines. ( shells have about the same kind of optimizations regarding output buffering as perl as long as readline is not involved )

feel free to test it in terms of number of CPU cycles required

most likely perl would win on large files but not by orders of magnitude


It's writing out the results into different files that seems to be the issue.

yes :  that's the reason why we have to spawn an extra sh command. afaik it is not possible to instruct sed to write to specific files unless you know the file names beforehand.

--

btw, a proper nawk implementation should beat both perl and the shell by far while producing the most readable code
0
 
tel2Commented:
Beautiful theory about sed (with echo) competing with Perl's speed, skullnobrains.
But unless you can prove it by supplying tested code, it will remain theory.  The spec's are in the original post.
I don't think we were ever talking just about CPU cycles, but overall speed.
sed (with (or without) echo) might beat Perl with small input files (I haven't tested that yet), but I doubt it with large.
Anyway, enough theory, let's see the proof.  The sample data I used is attached to my post 42485969, above.

BTW, correction to what I said above:
   "Bill's solution creates output filename datestamps of the format YYYYMMDD instead of DDMMYYYY."
I meant:
   "Bill's solution creates output filename datestamps of the format YYYYMMDD instead of MMDDYYYY."
0
 
skullnobrainsCommented:
I don't think we were ever talking just about CPU cycles, but overall speed.

if that is the concern, you can be sure than roughly anything will be equivalent if the input is big. the reason being that the script will use up little CPU whatever the interpreter and will write to disk quite heavily. so basically the program will be IO-bound. meaning you'll be benchmarking your hard drive. so the only difference might be regarding opening and closing the file many times ( hence the above mentioned optimizations ), writing asynchronously or not and writing lines, small or big blocks. other than that the interpreter simply won't matter.

i cannot compare a benchmark on my computer with yours.
feel free to adapt the sed so it produces an output like

cat <<EOF > FILE1
...
EOF
cat <<EOF > FILE2
...
EOF

and so on.

or even a little less efficient

exec > FILE1
echo "..."
echo "..."
...
exec > FILE2
echo "..."
echo "..."
...

the theoretical results with gnu's cat are a bit below perl's because gnu cat will write in a line oriented fashion ( AFAIK ) while perl has output buffering enabled and will write blocks of 4096 or 8192 bytes. but this won't make much of a difference unless you write synchronously to the disk.

basically if you ignore CPU cycles, the only thing that matters is how many times you open and close the same file, how big the writes, how many times you flush (if at all) but not the interpreter since they all will use the same system calls if they use the same algorithms.


a quick bench on my current laptop reading from /dev/zero and writing to a tmp file on a SATA drive produces between 108 and 116 MB/s for 5 Go with anything that flushes : tried dd, perl ( grep-like usage ), tee and pv

i cannot bench non-flushing commands since i have 16Go of RAM so i'd need to run tests with about 100-200G of data to make them meaningful and i have neither the available disk space or the time.
0
 
tel2Commented:
> "i cannot compare a benchmark on my computer with yours."
No problem.  I didn't ask you to do that, and I don't expect you to.  I know both scripts need to be run in the same environment to compare timings.
I'm asking you to prove your theory by providing the sed (with echo) script which proves it.  I'm not planning to finish the script for you, because I don't see how it can be done, and I don't want to be going back and forth for more modifications.  (The changes I made so far were mainly to get a simple working version of your latest script which I could time.)  It's your claim, so I'll let you prove it with a script if you can.
If you want to compare it with my Perl script yourself, then you can run both my script and yours on your machine with my test data.
But if you post your script here, I can compare timings myself on my machine.

> "basically if you ignore CPU cycles..."
I hope you realise that I never suggested we ignore CPU cycles, either.  I said "I don't think we were ever talking just about CPU cycles, but overall speed."  Overall speed includes CPU.
0
 
skullnobrainsCommented:
IN=testdata1
DATE=`date +%Y%m%d`
DT="$DATE"

date +%s

for i in `seq 0 100`
do
eval "$(sed -e 's/^#/I/g ; s/^&/H/ ; /^\$\$\$\$\$\$\$\$/ { s/\$\$\$\$\$\$\$\$|\([^|]*\)|.*/"\1_'$DATE'.txt"/ ; s/.*/EOF\ndd b$
done

date +%s
for i in `seq 0 100`
do
perl -pe '$_="",open STDOUT, ">>$ENV{OUTDIR}/$1_$ENV{DT}.txt" if /^\${8}\|(.+?)\|/;tr /&#/HI/' $IN
done

date +%s

Open in new window


RESULTS

not beaten but equalized

$ dash /tmp/tst
1520171413
1520171420
1520171429
$ dash /tmp/tst
1520171456
1520171464
1520171472
$ dash /tmp/tst
1520171474
1520171482
1520171490

from here i changed the block size to 10 M ( not significant but slightly better )

$ dash /tmp/tst
1520171537
1520171545
1520171553
$ dash /tmp/tst
1520171556
1520171563
1520171572
$ dash /tmp/tst
1520171588
1520171595
1520171604
$ dash /tmp/tst
1520171607
1520171615
1520171624

i believe this demonstrates there is no significant difference other than the number of spawned commands open/close files and number of writes
0
 
skullnobrainsCommented:
and actually wxithout eval, it gets better than the perl

IN=testdata1
DATE=`date +%Y%m%d`
DT="$DATE"

date +%s

for i in `seq 0 100`
do
sed -e 's/^#/I/g ; s/^&/H/ ; /^\$\$\$\$\$\$\$\$/ { s/\$\$\$\$\$\$\$\$|\([^|]*\)|.*/"\1_'$DATE'.txt"/ ; s/.*/EOF\ndd bs=10M if$
done

date +%s
for i in `seq 0 100`
do
perl -pe '$_="",open STDOUT, ">>$ENV{OUTDIR}/$1_$ENV{DT}.txt" if /^\${8}\|(.+?)\|/;tr /&#/HI/' $IN
done

date +%s

Open in new window


$ dash /tmp/tst
1520171797
1520171803
1520171812
$ dash /tmp/tst
1520171813
1520171819
1520171828
$ dash /tmp/tst
1520171828
1520171843
1520171852


... using bash, eval was more efficient than spawning an extra shell

in this variant, i'm under 6 seconds with the shell version while the perl version is around 8-9
0
 
tel2Commented:
Hi skullnobrains,

Thanks for your tests.

I'm not surprised that sed is faster than Perl for that kind of command.  I was never trying to say it wasn't.  I was talking about "sed (with echo)", which is what your previous solutions had.

Here are 4 places I think I made that pretty clear:
> 'But even if you can make it more concise than my Perl solution, it'll be hard to beat Perl's performance, especially if you continue to pipe each "echo" through "sh -s".'
> "Personally, I think unless you get rid of those repeated shell commands (i.e. "echo..." in this case), it will be a bit (or even a byte) difficult."
> "Beautiful theory about sed (with echo) competing with Perl's speed, skullnobrains."
> "I'm asking you to prove your theory by providing the sed (with echo) script which proves it."

And in case this statement of mine confused my point:
> "sed (with (or without) echo) might beat Perl with small input files (I haven't tested that yet), but I doubt it with large."
Then sorry for the confusion.  What I meant by that was:
    sed (with (or without) echo) might beat Perl with small input files, but I doubt sed (with echo) would beat Perl with large input files.

But I can't even run your scripts to see if they are meeting requirements.  I don't have the dash shell, but when I run them in bash, both of these lines fail:
    sed -e 's/^#/I/g ; s/^&/H/ ; /^\$\$\$\$\$\$\$\$/ { s/\$\$\$\$\$\$\$\$|\([^|]*\)|.*/"\1_'$DATE'.txt"/ ; s/.*/EOF\ndd bs=10M if$
    eval "$(sed -e 's/^#/I/g ; s/^&/H/ ; /^\$\$\$\$\$\$\$\$/ { s/\$\$\$\$\$\$\$\$|\([^|]*\)|.*/"\1_'$DATE'.txt"/ ; s/.*/EOF\ndd b$
Looks as if there's something wrong with the end of the commands.  Are they really meant to finish with "$"?  Looks as if there might be some missing things like "/", "}" and " ' ", but I'm not sure what you're trying to do so I can't tell.
Is your use of "dd" meant to be acting as a replacement for "echo"?

If your script is going to meet requirements, I suggest you start with something like this:
#!/bin/bash

INDIR=data/out            # Specified by the asker, but I've removed the leading "/"
export OUTDIR=data/newout # Specified by the asker.  "export" required for my Perl script
export DT=`date +%m%d%Y_%H%M%S`  # This is required for my Perl script

Open in new window

Please check your output files are correct, as per the requirements in the original post.  When you have a working solution, let me know.

Thanks.
0
 
skullnobrainsCommented:
this is getting boring

1)
quote from the very first post mentionning speed

actually, it is likely possible to beat perl performance-wise. to achieve that, i'd probably change a little the script by ... and then change the redirections to ">>&FD" or much better and easier generate a heredoc syntax so writing and truncating each file is done in a single tee/cat/sponge... command.

never went past that and suggested these optimizations the very first time i mentioned speed

not sure why you really want to proove me wrong that bad.

i'm here to help folks, not to proove stuff i've tested and experimented with over and over.


2) large/small files : you may want to test that. actually you'd get the exact same results as long as the files are big enough to render the loading of the commands ( perl sed and sh ) neglectible. other than that, you're only benchmarking your disk drive. and i'm letting prooving that point as an exercise to whoever is interested. i already did many times before, and once more because you insisted that much


3) the script did work and i actually did run the tests i posted. the copy-pastes came from nano and truncated the end of the longer lines.
sorry about that. i have not kept the script. the results talk for themselves, though.

i'm pretty sure you're eager enough to proove me wrong to bother hacking the code back together yourself ( basically finish the dd. in the first one, you'd need to end the $(...) and the second sed is piped into dash which is actually faster than eval and will work whatever the file size). i'm way too bored to do that once more. and unmonitoring the thread.
1
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.