[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 633
  • Last Modified:

using tar in a pipeline, extra a file from the tar and send that files contents along in the pipe

I have a pipeline, in it, are a lot of tar files that are cat'ed together. I need a step in the pipeline that reads these concatenated tar files and writes to stdout the contents of a single file from each.


Context:
Each tar in the pipe contains several logs from one day. (several tar / days are cat'ed together) I need to extract only one log from each tar and send it along in the pipe


Any thoughts on the best / easiest way to do this?
0
modsiw
Asked:
modsiw
2 Solutions
 
MikeOM_DBACommented:

Which is(are) the command(s) that you use to create cat'ed together of the tar files?
0
 
woolmilkporcCommented:
Who would "cat" tar files together?

How would you ever manage to "de-cat" them?

wmp
0
 
modsiwAuthor Commented:
The file I have, for example, is two_tars_cated_together.tar

I'd like some program to read raw.tsv out of each tar and pipe their contents to stdout



Real situation:
I have a file server. it has a tar.gz for each day for years. each tar.gz has many different logs in it.
I need a single log out of each tar.gz from every file. These logs are processed on another server.
These things are huge. I want to avoid disk io

On the file server:
cat *.tar.gz | nc

On the processing server
nc | gzip -d | {the thing I want} | {stuff to parse the log file}


tar -czvf raw1.tar.gz raw.tsv
tar -czvf raw2.tar.gz raw.tsv
cat raw1.tar.gz raw2.gz | gzip -dc > two_tars_cated_together.tar

Open in new window

0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
modsiwAuthor Commented:
The below works for gziped files. I basically want to do the samething with tar in the mix.

two_raw_cated.tsv will be a file with twice the contents of raw.tsv
gzip -c raw.tsv > raw1.gz
gzip -c raw.tsv > raw2.gz
cat raw1.gz raw2.gz | gzip -dc > two_raw_cated.tsv

Open in new window

0
 
woolmilkporcCommented:
Where in your "real situation" is the cat'ed file containing several tarballs?
0
 
modsiwAuthor Commented:
correction:
cat raw1.tar.gz raw2.gz | gzip -dc > two_tars_cated_together.tar


cat raw1.tar.gz raw2.tar.gz | gzip -dc > two_tars_cated_together.tar

Open in new window

0
 
modsiwAuthor Commented:
How would you ever manage to "de-cat" them?

A tar may very well not have a recognizable end-of-tar marker or a header marker. If this is the case, no dice. I don't know if this is the case or not.
0
 
modsiwAuthor Commented:
The more complete real situation.

On the file server I would execute:
cat *.tar.gz | nc -l -p 1234

This would allow me to execute the following on the processing server:
nc 10.10.10.10 1234 | gzip -dc | {the thing I want} | java -jar LogCleaner.jar > mynamedpipe


Informatica PowerCenter is reading from mynamedpipe (made with mkfifo). It blocks until data is present, and then takes off doing various aggregations and statistics on the logs. PowerCenter's requirement (Unless I go poke the powercenter programmers with a stick) is for all of daily logs to appear as one big log.

-----------------------------------------------------------------

The file server is very weak. Without cat'ing the tar.gz this way, the only solution I see is to do this for each file on the file server:

tar -xzvf {for: each days file} raw.tsv | gzip -c >> cated_raw.tsv.gz

then do a:
cat cated_raw.tsv.gz | nc -l -p 12345
and skip {the thing I want} on the processing size.

This introduces an extra zip/unzip cycle, and worse yet, that cycle is going to take place on a slow slow machine
0
 
woolmilkporcCommented:
Yep, and that's the problem.

This step would work (I'll omit the "nc" step, that's not the problem):

cat *.tar.gz | gzip -dc

But the next possible step:

cat *.tar.gz | gzip -dc | tar -xvf -

would extract only the contents of the first tar.gz file in the "cat" conglomerate, because extracting stops when the first tar trailer is encountered!

So the problem is indeed what I wrote above: tar will not work on cat'ed tarballs except for the very first one.

You will have to process the tar.gz files one by one, without cat'ing them together.

wmp
0
 
woolmilkporcCommented:
Do you have ssh?

If so, you could try

for tgz in $(ls -1 *.tgz)
 do
  cat $tgz | ssh user@remothehost 'tar -zxv -O -f - my_desired_file 2>/dev/null | java -jar LogCleaner.jar > mynamedpipe'
 done

The problem is that my_desired_file must always be of the same name here, but we could also make it a variable.
The above uses GNU tar (the "-z" thing unzips "on the fly", and "-O" writes to stdout).
"2>/dev/null" is used to filter out the filename which is otherwise displayed in the first line when using the "-O" flag.

wmp
0
 
modsiwAuthor Commented:
Wool,

I believe: Your code would result in writes being open / closed on mynamedpipe. This would signal to the reader of mynamedpipe that it is done when more data actually remains.

Do you happen to see away around this?
0
 
woolmilkporcCommented:
No doubt, there will be an open/close with each file.

Yes, there is a way, but it's uuunelegant! We'll need a temporary file at the remote side and moreover, we will have to clean it up first, to be on the safe side

ssh user@remothehost 'rm -f /tmp/tgztemp 2>/dev/null'
for tgz in $(ls -1 *.tgz)
 do
  cat $tgz | ssh user@remothehost 'tar -zxv -O -f - my_desired_file 2>/dev/null | java -jar LogCleaner.jar >> /tmp/tgztemp'
 done
ssh user@remothehost 'cat /tmp/tgztemp > mynamedpipe && rm -f /tmp/tgztemp 2>/dev/null'

Not my usual programming style, but I don't see a better way at the moment.

wmp
0
 
modsiwAuthor Commented:
I found away to read the cat'd together tar.gz files .

A tar file is a series of blocks. Each block has a head and a conditional body.

The end of a tar files is a series of end sentinel blocks. You can simply pick up reading from the cat'd-together-tar-stream where the previous untar left off.
import org.apache.tools.tar.TarEntry;
import org.apache.tools.tar.TarInputStream;
import java.io.*;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;

public class TarExtract {
    public static void main(String[] args) throws IOException {
        Set<String> files = new HashSet<String>(Arrays.asList(args));
        OutputStream os = new BufferedOutputStream(System.out,16 * 1024 * 1024);
        int eof = 0;
        InputStream is = new BufferedInputStream(System.in, 16 * 1024 * 1024);
        while (eof != 1) {
            eof++;
            TarInputStream tis = new TarInputStream(is);
            for (TarEntry entry = tis.getNextEntry(); entry != null; entry = tis.getNextEntry()) {
                eof = 0;
                if (files.contains(entry.getName()))
                    tis.copyEntryContents(os);
            }
        }
        os.flush();
    }
}

Open in new window

0
 
balasundaram_sCommented:
If the tar files are tar'ed  UP  together, then its possible to extract one single file from each tar.

'cat' is only for the ASCII files ( or text files ).
0
 
modsiwAuthor Commented:
balasundaram,

They aren't tar'ed up together.

Suppose I make a TarExtracter.jar based on my class above, then `cat z.tar.gz | gzip -dc | java -jar TarExtracter.jar "a" "b"' will produce `123456'


Using my code below, can you extract the `123456' from z.tar.gz using only the shell, gzip, and tar?
echo 123 > a
echo 456 > b
tar -czvf c.tar.gz a
tar -czvf d.tar.gz b
cat c.tar.gz d.tar.gz > z.tar.gz

Open in new window

0
 
modsiwAuthor Commented:
A perl solution instead of java would prob work best. I'll give it a shot.
0

Featured Post

Nothing ever in the clear!

This technical paper will help you implement VMware’s VM encryption as well as implement Veeam encryption which together will achieve the nothing ever in the clear goal. If a bad guy steals VMs, backups or traffic they get nothing.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now