Solved

bash: /bin/cat: Argument list too long

Posted on 2010-11-11
17
3,478 Views
Last Modified: 2012-05-10
Hi,

I have a folder with 180,000 documents, I was trying to open them all and write to a single file but received the "Argument list too long". Any way to get around this limitation with "cat"?

Essentially, I want to combine all files and unique all records.

Thank you.
0
Comment
Question by:faithless1
  • 7
  • 5
  • 3
  • +1
17 Comments
 
LVL 11

Expert Comment

by:tel2
Comment Utility
Hi faithless1,

Does this work for you?

ls | xargs cat >/tmp/allfiles.out

Feel free to ignore errors about ./ & ../ being directories.

Then you can run uniq or sort -u... on the output (or pipe the above through that before redirecting to allfiles.out).
0
 
LVL 11

Assisted Solution

by:tel2
tel2 earned 100 total points
Comment Utility
Hi again faithless1,

This one allows you to specify wildcards for the types of files you want to process:

find . -name "*.txt" -exec cat '{}' ';' >allfiles.out
0
 
LVL 1

Expert Comment

by:mifergie
Comment Utility
Here you can allow a series of checks on the contents of files:

$ for file in `/usr/bin/ls *.sh`; do grep searchtermhere $file; if [ $? -eq 1 ]; then cat $file>>txt.out; fi; done;
0
 
LVL 11

Expert Comment

by:tel2
Comment Utility
Hi mifergie,

How is that going to get around faithless1's problem of having to process 180,000 documents, which gives an "Argument list too long" error when you do things like:
    cat * >txt.out    # Which he was probably trying to do
or
    ls * | ...              # Which you are essentially doing
?

Your solution fails with the same error.
0
 
LVL 1

Expert Comment

by:mifergie
Comment Utility
My post has been deleted for some reason, so I don't know exactly what I said.  It doesn't seem to be in my bash history either...

I don't claim that cat *>txt.out would be successful.  I claim that you can just do this in a for-loop and get around the large argument list.
0
 
LVL 11

Expert Comment

by:tel2
Comment Utility
Hi mifergie,

I can still see your post (even after a refresh).  Here it is for your reference:
    Here you can allow a series of checks on the contents of files:
    $ for file in `/usr/bin/ls *.sh`; do grep searchtermhere $file; if [ $? -eq 1 ]; then cat $file>>txt.out; fi; done;


The idea of a for loop is OK, but what I'm trying to say is, your "ls *..." will fail just like a "cat *..." will fail, because both of them will expand out the list of filenames, which will blow the limit.
If you don't believe me, run a for loop to "touch" 180,000 files with names like:
    abcdefghijklmnopqrstuvwxyz_0.txt
    abcdefghijklmnopqrstuvwxyz_1.txt
    abcdefghijklmnopqrstuvwxyz_2.txt
    ...etc...
Next, run this:
    cat *.txt >txt1.out
If the cat works, create more files (or files with longer names), until the cat fails.
Next, run your solution (making sure you change your "*.sh" to "*.txt").
Are you with me?
0
 
LVL 1

Expert Comment

by:mifergie
Comment Utility
Gotcha.  Hmmm...
Well, okay, how about dropping the *...

for file in `/usr/bin/ls`; do grep searchtermhere $file; if [ $? -eq 1 ]; then cat $file>>txt.out; fi; done;

if that works on the 180k files, one could easily build a test for file name using grep on the filename.  If I get a chance in the next few minutes I'll provide an example.

0
 
LVL 1

Expert Comment

by:mifergie
Comment Utility
Here's something that doesn't require any long input list:

for file in `/usr/bin/ls`; do if [ `echo $file | grep '\.sh'` ]; then echo $file; cat $file>>txt.out; fi; done;

so as long as ls can operate on 180k files, this should work.
0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 11

Expert Comment

by:tel2
Comment Utility
Now you're talking, mifergie!

A few notes:
- '/usr/bin/ls' will fail on any system which has 'ls' somewhere else (like the webhost I'm using GNU/Linux, which has it in /bin).
- Running the for loop 180,000 times, and running all those commands inside it, will be a lot slower than the solutions I've given.  Yes, I tested it.
- Your final ';' is unnecessary.


Hi faithless1,

I've just realised that my first solution could be simplified (and sped up) as follows:
    ls | cat >/tmp/allfiles.out
Of course, like my first solution, this doesn't filter by filename (one could use my 'find...' for that), but you haven't said filtering is a requirement.
0
 
LVL 1

Expert Comment

by:mifergie
Comment Utility
Yup, I was just copying and pasting from cygwin.  On my system ls is aliased to something that has an asterisk at the end of some files (executable, I think), so I have to specify it exactly.  I also know that it will be slower - but it gives a great amount of flexibility for picking certain files based upon user-supplied criteria.

I kind of wondered why you had the xargs in your first solution...  
0
 

Author Comment

by:faithless1
Comment Utility
Hi,

Thanks for all the responses, very much appreciated. Still having the same issue when running these commands:

ls | xargs cat *  >/tmp/allfiles.out | bash: /usr/bin/xargs: Argument list too long

ls | cat * >/tmp/allfiles.out | bash: /bin/cat: Argument list too long

Filtering isn't a requirement, but helpful to know.

If it isn't possible to do this with standard commands, perhaps I can open first 50K output to output.txt, then open the next 50k and >> to output.txt etc?

Thanks again.
0
 
LVL 10

Expert Comment

by:TRW-Consulting
Comment Utility
Remove the * in your xargs command and make it:

  ls | xargs cat  >/tmp/allfiles.out
0
 
LVL 10

Expert Comment

by:TRW-Consulting
Comment Utility
Oh, and don't give me the points if that 'xargs' solution works. They should go to the very first poster, tel2.

Now if that doesn't work, then an alternative is:

ls |
  while read filename
  do
    cat $filename
  done >/tmp/allfiles.out
0
 
LVL 11

Expert Comment

by:tel2
Comment Utility
Thanks for that, TRW.

Hi faithless1,

My solutions work for me - I have tested them with enough files to simulate your situation.

As TRW has implied, the '*'s you've put in the commands you ran were not in my solutions.

As stated in my last post, the "xargs" is not required (it just slows things down in this case).

So, my recommended solutions are:
    ls | cat >/tmp/allfiles.out   # From my last post
    find . -name "*.txt" -exec cat '{}' ';' >allfiles.out  # From my 2nd post

Enjoy.
0
 

Author Comment

by:faithless1
Comment Utility
Hi,

Thanks again and apologies for including the *, I think this partly solves the problem. I ran both commands, here are the results:

ls |
  while read filename
  do
    cat $filename
  done >/tmp/allfiles.out

TRW, In this case, there are 180K unique files so I'm not sure how I would execute this on the command line. I tried replacing "filename" with allfiles.out but wasn't successful - I'm pretty sure I'm doing this incorrectly.

Tel2,
I was able to pipe all files to 'allfiles.out' which now includes all 180K files. Is there a way to create 'allfiles.out' so it will have contents of each file from the 180K I have vs just a list of 180K files?

Thanks again.
0
 
LVL 11

Expert Comment

by:tel2
Comment Utility
Hi faithless1,

Questions:
1. Are you saying that allfiles.out now contains a list of file NAMES, rather than the contents of the files?
2. Please post the exact command you ran to generate allfiles.out.
3. How big is allfiles.out, in bytes and lines (try: wc allfiles.out).
4. Are the 180,000 files, text files, or what?

Thanks.
0
 
LVL 10

Accepted Solution

by:
TRW-Consulting earned 400 total points
Comment Utility
If you need to do it all on a single command line, use this:

  ls |  while read filename;  do cat $filename;  done >/tmp/allfiles.out

Just copy and paste the line above
0

Featured Post

Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

Join & Write a Comment

I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use m…
Active Directory replication delay is the cause to many problems.  Here is a super easy script to force Active Directory replication to all sites with by using an elevated PowerShell command prompt, and a tool to verify your changes.
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now