Solved

using tar in a pipeline, extra a file from the tar and send that files contents along in the pipe

Posted on 2011-02-18
16
585 Views
Last Modified: 2012-05-11
I have a pipeline, in it, are a lot of tar files that are cat'ed together. I need a step in the pipeline that reads these concatenated tar files and writes to stdout the contents of a single file from each.


Context:
Each tar in the pipe contains several logs from one day. (several tar / days are cat'ed together) I need to extract only one log from each tar and send it along in the pipe


Any thoughts on the best / easiest way to do this?
0
Comment
Question by:modsiw
16 Comments
 
LVL 29

Expert Comment

by:MikeOM_DBA
ID: 34928356

Which is(are) the command(s) that you use to create cat'ed together of the tar files?
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 34928583
Who would "cat" tar files together?

How would you ever manage to "de-cat" them?

wmp
0
 
LVL 3

Author Comment

by:modsiw
ID: 34928697
The file I have, for example, is two_tars_cated_together.tar

I'd like some program to read raw.tsv out of each tar and pipe their contents to stdout



Real situation:
I have a file server. it has a tar.gz for each day for years. each tar.gz has many different logs in it.
I need a single log out of each tar.gz from every file. These logs are processed on another server.
These things are huge. I want to avoid disk io

On the file server:
cat *.tar.gz | nc

On the processing server
nc | gzip -d | {the thing I want} | {stuff to parse the log file}


tar -czvf raw1.tar.gz raw.tsv
tar -czvf raw2.tar.gz raw.tsv
cat raw1.tar.gz raw2.gz | gzip -dc > two_tars_cated_together.tar

Open in new window

0
 
LVL 3

Author Comment

by:modsiw
ID: 34928724
The below works for gziped files. I basically want to do the samething with tar in the mix.

two_raw_cated.tsv will be a file with twice the contents of raw.tsv
gzip -c raw.tsv > raw1.gz
gzip -c raw.tsv > raw2.gz
cat raw1.gz raw2.gz | gzip -dc > two_raw_cated.tsv

Open in new window

0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 34928737
Where in your "real situation" is the cat'ed file containing several tarballs?
0
 
LVL 3

Author Comment

by:modsiw
ID: 34928756
correction:
cat raw1.tar.gz raw2.gz | gzip -dc > two_tars_cated_together.tar


cat raw1.tar.gz raw2.tar.gz | gzip -dc > two_tars_cated_together.tar

Open in new window

0
 
LVL 3

Author Comment

by:modsiw
ID: 34928820
How would you ever manage to "de-cat" them?

A tar may very well not have a recognizable end-of-tar marker or a header marker. If this is the case, no dice. I don't know if this is the case or not.
0
 
LVL 3

Author Comment

by:modsiw
ID: 34928997
The more complete real situation.

On the file server I would execute:
cat *.tar.gz | nc -l -p 1234

This would allow me to execute the following on the processing server:
nc 10.10.10.10 1234 | gzip -dc | {the thing I want} | java -jar LogCleaner.jar > mynamedpipe


Informatica PowerCenter is reading from mynamedpipe (made with mkfifo). It blocks until data is present, and then takes off doing various aggregations and statistics on the logs. PowerCenter's requirement (Unless I go poke the powercenter programmers with a stick) is for all of daily logs to appear as one big log.

-----------------------------------------------------------------

The file server is very weak. Without cat'ing the tar.gz this way, the only solution I see is to do this for each file on the file server:

tar -xzvf {for: each days file} raw.tsv | gzip -c >> cated_raw.tsv.gz

then do a:
cat cated_raw.tsv.gz | nc -l -p 12345
and skip {the thing I want} on the processing size.

This introduces an extra zip/unzip cycle, and worse yet, that cycle is going to take place on a slow slow machine
0
Enterprise Mobility and BYOD For Dummies

Like “For Dummies” books, you can read this in whatever order you choose and learn about mobility and BYOD; and how to put a competitive mobile infrastructure in place. Developed for SMBs and large enterprises alike, you will find helpful use cases, planning, and implementation.

 
LVL 68

Expert Comment

by:woolmilkporc
ID: 34929369
Yep, and that's the problem.

This step would work (I'll omit the "nc" step, that's not the problem):

cat *.tar.gz | gzip -dc

But the next possible step:

cat *.tar.gz | gzip -dc | tar -xvf -

would extract only the contents of the first tar.gz file in the "cat" conglomerate, because extracting stops when the first tar trailer is encountered!

So the problem is indeed what I wrote above: tar will not work on cat'ed tarballs except for the very first one.

You will have to process the tar.gz files one by one, without cat'ing them together.

wmp
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 34929517
Do you have ssh?

If so, you could try

for tgz in $(ls -1 *.tgz)
 do
  cat $tgz | ssh user@remothehost 'tar -zxv -O -f - my_desired_file 2>/dev/null | java -jar LogCleaner.jar > mynamedpipe'
 done

The problem is that my_desired_file must always be of the same name here, but we could also make it a variable.
The above uses GNU tar (the "-z" thing unzips "on the fly", and "-O" writes to stdout).
"2>/dev/null" is used to filter out the filename which is otherwise displayed in the first line when using the "-O" flag.

wmp
0
 
LVL 3

Author Comment

by:modsiw
ID: 34929613
Wool,

I believe: Your code would result in writes being open / closed on mynamedpipe. This would signal to the reader of mynamedpipe that it is done when more data actually remains.

Do you happen to see away around this?
0
 
LVL 68

Accepted Solution

by:
woolmilkporc earned 500 total points
ID: 34930239
No doubt, there will be an open/close with each file.

Yes, there is a way, but it's uuunelegant! We'll need a temporary file at the remote side and moreover, we will have to clean it up first, to be on the safe side

ssh user@remothehost 'rm -f /tmp/tgztemp 2>/dev/null'
for tgz in $(ls -1 *.tgz)
 do
  cat $tgz | ssh user@remothehost 'tar -zxv -O -f - my_desired_file 2>/dev/null | java -jar LogCleaner.jar >> /tmp/tgztemp'
 done
ssh user@remothehost 'cat /tmp/tgztemp > mynamedpipe && rm -f /tmp/tgztemp 2>/dev/null'

Not my usual programming style, but I don't see a better way at the moment.

wmp
0
 
LVL 3

Assisted Solution

by:modsiw
modsiw earned 0 total points
ID: 34930299
I found away to read the cat'd together tar.gz files .

A tar file is a series of blocks. Each block has a head and a conditional body.

The end of a tar files is a series of end sentinel blocks. You can simply pick up reading from the cat'd-together-tar-stream where the previous untar left off.
import org.apache.tools.tar.TarEntry;
import org.apache.tools.tar.TarInputStream;
import java.io.*;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;

public class TarExtract {
    public static void main(String[] args) throws IOException {
        Set<String> files = new HashSet<String>(Arrays.asList(args));
        OutputStream os = new BufferedOutputStream(System.out,16 * 1024 * 1024);
        int eof = 0;
        InputStream is = new BufferedInputStream(System.in, 16 * 1024 * 1024);
        while (eof != 1) {
            eof++;
            TarInputStream tis = new TarInputStream(is);
            for (TarEntry entry = tis.getNextEntry(); entry != null; entry = tis.getNextEntry()) {
                eof = 0;
                if (files.contains(entry.getName()))
                    tis.copyEntryContents(os);
            }
        }
        os.flush();
    }
}

Open in new window

0
 
LVL 5

Expert Comment

by:balasundaram_s
ID: 34930784
If the tar files are tar'ed  UP  together, then its possible to extract one single file from each tar.

'cat' is only for the ASCII files ( or text files ).
0
 
LVL 3

Author Comment

by:modsiw
ID: 34930860
balasundaram,

They aren't tar'ed up together.

Suppose I make a TarExtracter.jar based on my class above, then `cat z.tar.gz | gzip -dc | java -jar TarExtracter.jar "a" "b"' will produce `123456'


Using my code below, can you extract the `123456' from z.tar.gz using only the shell, gzip, and tar?
echo 123 > a
echo 456 > b
tar -czvf c.tar.gz a
tar -czvf d.tar.gz b
cat c.tar.gz d.tar.gz > z.tar.gz

Open in new window

0
 
LVL 3

Author Closing Comment

by:modsiw
ID: 34959263
A perl solution instead of java would prob work best. I'll give it a shot.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Introduction We as admins face situation where we need to redirect websites to another. This may be required as a part of an upgrade keeping the old URL but website should be served from new URL. This document would brief you on different ways ca…
The purpose of this article is to demonstrate how we can use conditional statements using Python.
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
Learn how to find files with the shell using the find and locate commands. Use locate to find a needle in a haystack.: With locate, check if the file still exists.: Use find to get the actual location of the file.:

863 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

23 Experts available now in Live!

Get 1:1 Help Now