Link to home
Start Free TrialLog in
Avatar of modsiw
modsiw

asked on

using tar in a pipeline, extra a file from the tar and send that files contents along in the pipe

I have a pipeline, in it, are a lot of tar files that are cat'ed together. I need a step in the pipeline that reads these concatenated tar files and writes to stdout the contents of a single file from each.


Context:
Each tar in the pipe contains several logs from one day. (several tar / days are cat'ed together) I need to extract only one log from each tar and send it along in the pipe


Any thoughts on the best / easiest way to do this?
Avatar of MikeOM_DBA
MikeOM_DBA
Flag of United States of America image


Which is(are) the command(s) that you use to create cat'ed together of the tar files?
Avatar of woolmilkporc
Who would "cat" tar files together?

How would you ever manage to "de-cat" them?

wmp
Avatar of modsiw
modsiw

ASKER

The file I have, for example, is two_tars_cated_together.tar

I'd like some program to read raw.tsv out of each tar and pipe their contents to stdout



Real situation:
I have a file server. it has a tar.gz for each day for years. each tar.gz has many different logs in it.
I need a single log out of each tar.gz from every file. These logs are processed on another server.
These things are huge. I want to avoid disk io

On the file server:
cat *.tar.gz | nc

On the processing server
nc | gzip -d | {the thing I want} | {stuff to parse the log file}


tar -czvf raw1.tar.gz raw.tsv
tar -czvf raw2.tar.gz raw.tsv
cat raw1.tar.gz raw2.gz | gzip -dc > two_tars_cated_together.tar

Open in new window

Avatar of modsiw

ASKER

The below works for gziped files. I basically want to do the samething with tar in the mix.

two_raw_cated.tsv will be a file with twice the contents of raw.tsv
gzip -c raw.tsv > raw1.gz
gzip -c raw.tsv > raw2.gz
cat raw1.gz raw2.gz | gzip -dc > two_raw_cated.tsv

Open in new window

Where in your "real situation" is the cat'ed file containing several tarballs?
Avatar of modsiw

ASKER

correction:
cat raw1.tar.gz raw2.gz | gzip -dc > two_tars_cated_together.tar


cat raw1.tar.gz raw2.tar.gz | gzip -dc > two_tars_cated_together.tar

Open in new window

Avatar of modsiw

ASKER

How would you ever manage to "de-cat" them?

A tar may very well not have a recognizable end-of-tar marker or a header marker. If this is the case, no dice. I don't know if this is the case or not.
Avatar of modsiw

ASKER

The more complete real situation.

On the file server I would execute:
cat *.tar.gz | nc -l -p 1234

This would allow me to execute the following on the processing server:
nc 10.10.10.10 1234 | gzip -dc | {the thing I want} | java -jar LogCleaner.jar > mynamedpipe


Informatica PowerCenter is reading from mynamedpipe (made with mkfifo). It blocks until data is present, and then takes off doing various aggregations and statistics on the logs. PowerCenter's requirement (Unless I go poke the powercenter programmers with a stick) is for all of daily logs to appear as one big log.

-----------------------------------------------------------------

The file server is very weak. Without cat'ing the tar.gz this way, the only solution I see is to do this for each file on the file server:

tar -xzvf {for: each days file} raw.tsv | gzip -c >> cated_raw.tsv.gz

then do a:
cat cated_raw.tsv.gz | nc -l -p 12345
and skip {the thing I want} on the processing size.

This introduces an extra zip/unzip cycle, and worse yet, that cycle is going to take place on a slow slow machine
Yep, and that's the problem.

This step would work (I'll omit the "nc" step, that's not the problem):

cat *.tar.gz | gzip -dc

But the next possible step:

cat *.tar.gz | gzip -dc | tar -xvf -

would extract only the contents of the first tar.gz file in the "cat" conglomerate, because extracting stops when the first tar trailer is encountered!

So the problem is indeed what I wrote above: tar will not work on cat'ed tarballs except for the very first one.

You will have to process the tar.gz files one by one, without cat'ing them together.

wmp
Do you have ssh?

If so, you could try

for tgz in $(ls -1 *.tgz)
 do
  cat $tgz | ssh user@remothehost 'tar -zxv -O -f - my_desired_file 2>/dev/null | java -jar LogCleaner.jar > mynamedpipe'
 done

The problem is that my_desired_file must always be of the same name here, but we could also make it a variable.
The above uses GNU tar (the "-z" thing unzips "on the fly", and "-O" writes to stdout).
"2>/dev/null" is used to filter out the filename which is otherwise displayed in the first line when using the "-O" flag.

wmp
Avatar of modsiw

ASKER

Wool,

I believe: Your code would result in writes being open / closed on mynamedpipe. This would signal to the reader of mynamedpipe that it is done when more data actually remains.

Do you happen to see away around this?
ASKER CERTIFIED SOLUTION
Avatar of woolmilkporc
woolmilkporc
Flag of Germany image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
If the tar files are tar'ed  UP  together, then its possible to extract one single file from each tar.

'cat' is only for the ASCII files ( or text files ).
Avatar of modsiw

ASKER

balasundaram,

They aren't tar'ed up together.

Suppose I make a TarExtracter.jar based on my class above, then `cat z.tar.gz | gzip -dc | java -jar TarExtracter.jar "a" "b"' will produce `123456'


Using my code below, can you extract the `123456' from z.tar.gz using only the shell, gzip, and tar?
echo 123 > a
echo 456 > b
tar -czvf c.tar.gz a
tar -czvf d.tar.gz b
cat c.tar.gz d.tar.gz > z.tar.gz

Open in new window

Avatar of modsiw

ASKER

A perl solution instead of java would prob work best. I'll give it a shot.