Solved

using tar in a pipeline, extra a file from the tar and send that files contents along in the pipe

Posted on 2011-02-18
16
578 Views
Last Modified: 2012-05-11
I have a pipeline, in it, are a lot of tar files that are cat'ed together. I need a step in the pipeline that reads these concatenated tar files and writes to stdout the contents of a single file from each.


Context:
Each tar in the pipe contains several logs from one day. (several tar / days are cat'ed together) I need to extract only one log from each tar and send it along in the pipe


Any thoughts on the best / easiest way to do this?
0
Comment
Question by:modsiw
16 Comments
 
LVL 29

Expert Comment

by:MikeOM_DBA
ID: 34928356

Which is(are) the command(s) that you use to create cat'ed together of the tar files?
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 34928583
Who would "cat" tar files together?

How would you ever manage to "de-cat" them?

wmp
0
 
LVL 3

Author Comment

by:modsiw
ID: 34928697
The file I have, for example, is two_tars_cated_together.tar

I'd like some program to read raw.tsv out of each tar and pipe their contents to stdout



Real situation:
I have a file server. it has a tar.gz for each day for years. each tar.gz has many different logs in it.
I need a single log out of each tar.gz from every file. These logs are processed on another server.
These things are huge. I want to avoid disk io

On the file server:
cat *.tar.gz | nc

On the processing server
nc | gzip -d | {the thing I want} | {stuff to parse the log file}


tar -czvf raw1.tar.gz raw.tsv
tar -czvf raw2.tar.gz raw.tsv
cat raw1.tar.gz raw2.gz | gzip -dc > two_tars_cated_together.tar

Open in new window

0
 
LVL 3

Author Comment

by:modsiw
ID: 34928724
The below works for gziped files. I basically want to do the samething with tar in the mix.

two_raw_cated.tsv will be a file with twice the contents of raw.tsv
gzip -c raw.tsv > raw1.gz
gzip -c raw.tsv > raw2.gz
cat raw1.gz raw2.gz | gzip -dc > two_raw_cated.tsv

Open in new window

0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 34928737
Where in your "real situation" is the cat'ed file containing several tarballs?
0
 
LVL 3

Author Comment

by:modsiw
ID: 34928756
correction:
cat raw1.tar.gz raw2.gz | gzip -dc > two_tars_cated_together.tar


cat raw1.tar.gz raw2.tar.gz | gzip -dc > two_tars_cated_together.tar

Open in new window

0
 
LVL 3

Author Comment

by:modsiw
ID: 34928820
How would you ever manage to "de-cat" them?

A tar may very well not have a recognizable end-of-tar marker or a header marker. If this is the case, no dice. I don't know if this is the case or not.
0
 
LVL 3

Author Comment

by:modsiw
ID: 34928997
The more complete real situation.

On the file server I would execute:
cat *.tar.gz | nc -l -p 1234

This would allow me to execute the following on the processing server:
nc 10.10.10.10 1234 | gzip -dc | {the thing I want} | java -jar LogCleaner.jar > mynamedpipe


Informatica PowerCenter is reading from mynamedpipe (made with mkfifo). It blocks until data is present, and then takes off doing various aggregations and statistics on the logs. PowerCenter's requirement (Unless I go poke the powercenter programmers with a stick) is for all of daily logs to appear as one big log.

-----------------------------------------------------------------

The file server is very weak. Without cat'ing the tar.gz this way, the only solution I see is to do this for each file on the file server:

tar -xzvf {for: each days file} raw.tsv | gzip -c >> cated_raw.tsv.gz

then do a:
cat cated_raw.tsv.gz | nc -l -p 12345
and skip {the thing I want} on the processing size.

This introduces an extra zip/unzip cycle, and worse yet, that cycle is going to take place on a slow slow machine
0
Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

 
LVL 68

Expert Comment

by:woolmilkporc
ID: 34929369
Yep, and that's the problem.

This step would work (I'll omit the "nc" step, that's not the problem):

cat *.tar.gz | gzip -dc

But the next possible step:

cat *.tar.gz | gzip -dc | tar -xvf -

would extract only the contents of the first tar.gz file in the "cat" conglomerate, because extracting stops when the first tar trailer is encountered!

So the problem is indeed what I wrote above: tar will not work on cat'ed tarballs except for the very first one.

You will have to process the tar.gz files one by one, without cat'ing them together.

wmp
0
 
LVL 68

Expert Comment

by:woolmilkporc
ID: 34929517
Do you have ssh?

If so, you could try

for tgz in $(ls -1 *.tgz)
 do
  cat $tgz | ssh user@remothehost 'tar -zxv -O -f - my_desired_file 2>/dev/null | java -jar LogCleaner.jar > mynamedpipe'
 done

The problem is that my_desired_file must always be of the same name here, but we could also make it a variable.
The above uses GNU tar (the "-z" thing unzips "on the fly", and "-O" writes to stdout).
"2>/dev/null" is used to filter out the filename which is otherwise displayed in the first line when using the "-O" flag.

wmp
0
 
LVL 3

Author Comment

by:modsiw
ID: 34929613
Wool,

I believe: Your code would result in writes being open / closed on mynamedpipe. This would signal to the reader of mynamedpipe that it is done when more data actually remains.

Do you happen to see away around this?
0
 
LVL 68

Accepted Solution

by:
woolmilkporc earned 500 total points
ID: 34930239
No doubt, there will be an open/close with each file.

Yes, there is a way, but it's uuunelegant! We'll need a temporary file at the remote side and moreover, we will have to clean it up first, to be on the safe side

ssh user@remothehost 'rm -f /tmp/tgztemp 2>/dev/null'
for tgz in $(ls -1 *.tgz)
 do
  cat $tgz | ssh user@remothehost 'tar -zxv -O -f - my_desired_file 2>/dev/null | java -jar LogCleaner.jar >> /tmp/tgztemp'
 done
ssh user@remothehost 'cat /tmp/tgztemp > mynamedpipe && rm -f /tmp/tgztemp 2>/dev/null'

Not my usual programming style, but I don't see a better way at the moment.

wmp
0
 
LVL 3

Assisted Solution

by:modsiw
modsiw earned 0 total points
ID: 34930299
I found away to read the cat'd together tar.gz files .

A tar file is a series of blocks. Each block has a head and a conditional body.

The end of a tar files is a series of end sentinel blocks. You can simply pick up reading from the cat'd-together-tar-stream where the previous untar left off.
import org.apache.tools.tar.TarEntry;
import org.apache.tools.tar.TarInputStream;
import java.io.*;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;

public class TarExtract {
    public static void main(String[] args) throws IOException {
        Set<String> files = new HashSet<String>(Arrays.asList(args));
        OutputStream os = new BufferedOutputStream(System.out,16 * 1024 * 1024);
        int eof = 0;
        InputStream is = new BufferedInputStream(System.in, 16 * 1024 * 1024);
        while (eof != 1) {
            eof++;
            TarInputStream tis = new TarInputStream(is);
            for (TarEntry entry = tis.getNextEntry(); entry != null; entry = tis.getNextEntry()) {
                eof = 0;
                if (files.contains(entry.getName()))
                    tis.copyEntryContents(os);
            }
        }
        os.flush();
    }
}

Open in new window

0
 
LVL 5

Expert Comment

by:balasundaram_s
ID: 34930784
If the tar files are tar'ed  UP  together, then its possible to extract one single file from each tar.

'cat' is only for the ASCII files ( or text files ).
0
 
LVL 3

Author Comment

by:modsiw
ID: 34930860
balasundaram,

They aren't tar'ed up together.

Suppose I make a TarExtracter.jar based on my class above, then `cat z.tar.gz | gzip -dc | java -jar TarExtracter.jar "a" "b"' will produce `123456'


Using my code below, can you extract the `123456' from z.tar.gz using only the shell, gzip, and tar?
echo 123 > a
echo 456 > b
tar -czvf c.tar.gz a
tar -czvf d.tar.gz b
cat c.tar.gz d.tar.gz > z.tar.gz

Open in new window

0
 
LVL 3

Author Closing Comment

by:modsiw
ID: 34959263
A perl solution instead of java would prob work best. I'll give it a shot.
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Suggested Solutions

Java performance on Solaris - Managing CPUs There are various resource controls in operating system which directly/indirectly influence the performance of application. one of the most important resource controls is "CPU".   In a multithreaded…
Linux users are sometimes dumbfounded by the severe lack of documentation on a topic. Sometimes, the documentation is copious, but other times, you end up with some obscure "it varies depending on your distribution" over and over when searching for …
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now