Link to home
Start Free TrialLog in
Avatar of enthuguy
enthuguyFlag for Australia

asked on

How to recursively search for specific files with specific extension and zip it in bash

HI Experts,

With your help, would like achieve below
1. How to recursively search for specific files with specific extension with number suffix. (we cannot predict the number. But it should be only number 3-5 digits
      e.g file2.ext456, file3.ext789, file4.ext111, file2.WSDL123 etc
2. Do nothing on files without number suffix
e.g file1.ext, file2.ext, file3.ext, file2.WSDL etc.
3. Rename the file (basically to remove the random # suffix) to something common.
      e.g file2.ext.new, file3.ext.new, file4.ext.new,file2.WSDL.new etc
4. Zip each renamed file and copy the zip file to a common location we specify.
5. Prefer to have it as a function and call it multiple times for each extension pattern etc.
6. As of now, I have around 4000 files to process :)

Just a bit about the context.
I’m trying to compare oracle service bus artifacts (projects) between two export. When OSB OOTB export those projects. It give ransom suffix to certain files which tricks my comparison utility to identify those as new. Even though content within the file are identical. One way to resolve this is to rename those to same filename on both exports and then compare.

sample dir structure and files
path1/path2/file1.ext
path1/path2/file2.ext456
path1/path2/file2.ext
path1/path2/file3.ext789
path1/path2/file3.ext
path1/path2/path3/path4/file4.ext111
path1/path2/path3/path4/file4.ext
path1/path2/file5.ext222
path1/path2/file5.ext
path1/path2/path3/path4/file6.ext333
path1/path2/path3/path4/file6.ext
path1/path2/file7.ext123
path1/path2/file7.ext
path1/path2/file1.WSDL123
path1/path2/file2.WSDL123
path1/path2/file2.WSDL




Thanks in advance.
Avatar of arnold
arnold
Flag of United States of America image

is perl one of the options for shell script?

there are many ways to skin this thing.
perl,
$startingdirectory='/path1'

 open START," /bin/ls \"$startingdirectory\" | " || die "Unable to list directory \"$startingdirectory\":$!\n";
while (<START>)  {
chomp();
if ( -d "$_" ) {
print "Directory: $_\n";
#here you would call the function with the $startingdirectory/$_.
}
else {
        print "File $_\n";
}
 
}

Open in new window


perl has builtin pattern match option where you can /^.*\.[a-z]+[0-9]+$/i meaning the extension must end with numbers.
.......
Avatar of enthuguy

ASKER

Thanks Arnold, since I have perl installed (built in) we can make use of it as well.

I created a file searchFile.pl and pasted above lines and tried to execute it. but I'm getting below error
syntax error at searchFile.pl line 3, near "open "

do we have to import any module?

Also "#here you would call the function with the $startingdirectory/$_." what do you mean by this pls?
the syntax error looks like a missing ; on the previous line
another of the many ways could be to use File::Find

#!/bin/perl
use File::Find;
find(sub{ ($f=$_)=~s/(?<=\.ext)\d{3,5}$/.new/ && (rename $_,$f or warn "$_,$f $!") },".");
Hi ozo,  where can i specify the directory path pls?
Hi arnold, it gives me a list of directories

eg.
File <parent dir name1>
File <parent dir name2>
File <parent dir name3>
File <parent dir name4>
Thx ozo, believe this defines the dir location "."
let me try :)
thx ozo, that worked for me.

slightly challenging now :)
after i further analyzed the source. I see two files with randam # suffix.

path1/path2/file1.WSDL123
path1/path2/file1.WSDL456

What would be your suggestion to handle this pls?

I'm thinking, if it is possible to rename
path1/path2/file1.WSDL123 > path1/path2/file1.WSDL.new1
path1/path2/file1.WSDL456 > path1/path2/file1.WSDL.new2

pls advise
Line 1 missing semi-colon as ISO pointed out.
The test
If ( -d "$startingdirectory/$_" ) {
is the correct test
In your case you were in the location/path you were searching.

The example could be the sub/function that is called. Note the example defines the START, in recursive, it has to be defined/declared  as local variable within.......

Some modules are included/installed.
Www.cpan.org is a repository
Perl -MCPAN  -e 'install bundle::;" if needed.

Your question can be interpreted in two ways:
1) you have a list (text file) whose contents you need to compare
2) or as the initial reply dealing with searching through the file system.

2) can be converted into ...' You need gnu find for that
Find /path1 -type f -name "*[0-9]$" | perl script that will need only deal with what you want it to do.
Note, you can add a file to an archive rather than having to copy/duplicate though, the addition, depending on the archive, will include the path from which the file comes.....
Thanks arnold.
any suggestions on the new challenge pls?

after i further analyzed the source files. I see two or more files with random # suffix.
path1/path2/file1.WSDL123
path1/path2/file1.WSDL456

I'm thinking, if it is possible to rename files like below
path1/path2/file1.WSDL123 > path1/path2/file1.WSDL.new1
path1/path2/file1.WSDL456 > path1/path2/file1.WSDL.new2

pls advise
You can rename, do anything you want, but first you must define the basis on which the processing Logic will work.

Are you only concerned about file1.wsdl within the same path?
I.e /path1/path2/file1.wsdl123
/path1/path3/file1.wsdl123
Would you treat them the same I.e. Compare them if identical (cksum/md5sum) do X if not, do y.
Depending on the number of files and constraints ...
Using a hash based on the ending numbers
During pattern match surrounding ([0-9]+)$ when matched, the numbers will be set in a variable in a single such requirement
/^.*\.[a-zA-Z]+([0-9]+)$/ the numbers will be set in $1 variable.
If you have different behavior consideration when the extention is different
/^.*\.([a-zA-Z]+)([0-9]+)$/
In this case the wsdl will be in $1 while 123 will be in $2 for the first example .wsdl123 and the $2 will. Be 456 in the second.
Thanks arnold. ozo.
is it possible to incorporate the renaming logic in above Ozo's script?

if file name  path1/path2/file1.WSDL.new1 already exist then next file on the same location with different # suffix should be incremented to  path1/path2/file1.WSDL.new2

or is there a better way.

pls help
Sed and awk can have logic built into checking for new1 what about new2?
The examples can be tailored to your needs.
It is best to define what your need is and then implement the logic to achieve it.
are the numerics always follow a specific order, or do you need to rely on the modify date of the file to know which is newer?

could you have a situation where file1.wsdl123 is newer than file1.wsdl345?

lets say you have this process in place, what happens to the files after they are added to the ZIP? Do they remain in place, are they deleted, are they moved to yet another location?
sorry for the delay.

HI arnold, all your questions are valid points

update on why product ootb adds suffix:
1. its a product way of renaming file
2. If the file name first 40 characters are same, then its truncates the filename to 40 and then adds number suffix.

So in above scenario,
these files will become...
filename_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_retrieveService.log
filename_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_reference_parent.log
filename_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx_reference_child.log

This.
filename_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx123.log
filename_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx456.log
filename_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx789.log

Since I'm going to do just file content compare on a temporary area (no impact to the source files). Would like to give a try renaming the files based on the order

e.g. rename below files
filename_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx123.log
filename_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx456.log
filename_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx789.log

to
filename_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx1.log
filename_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx2.log
filename_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx3.log

if we can do this. then I will rename the files on the other build and perform compare and identify the difference. I think this one should work or its worth giving a try :)

pls help me how to script it to have this renaming logic in bash

thanks again.
my preference in this case would be to use perl
the idea is to use hashes
#!/usr/bin/perl

my $directory='.';
my %hash;
open DIR, "/bin/ls $directory |" || die 'Unable to list $directory contents: $!\
n';
while (<DIR>) {
chomp();
print "$_\n";
if ( /^([0-9a-z_.-]+)(\d+)\.([a-z]+)$/ ){
  $hash{$1}->{$2}->{'Filename_suffix'}="$3";
  }
  }
foreach $key (keys %hash) {
print "Filename $key\n";
foreach $key2 (sort keys %{$hash{$key}}) {
        print "$key $key2 $hash{$key}->{$key2}->{'Filename_suffix'}\n";
}
}

Open in new window


See if the above generates output that separates/orders the items as you want.
The rename/copy/move is within the last loop where a counter can be added starting from 1 ..........
HI arnold, sorry for the delay.
have attached the actual output

What do you suggest for below pls?

1. We should filter out files which doesn't have # suffix
2. then rename files which has same filenames (string part)  then rename ### to filename# 1, 2, ..etc
3. I can have a sh script to find out each dir and pass the path to this script. as this script process files for a given directory? or easy to manage inside the same perl script. So we pass the root directory and the script should parse files in each directory in different level and rename/move

e.g
unappliedtransaction.service.loan.app.na.XMLSchema671
unappliedtransaction.service.loan.app.na.XMLSchema841

to
unappliedtransaction.service.loan.app.na.XMLSchema1
unappliedtransaction.service.loan.app.na.XMLSchema2

thanks in advance
script_output.log
ASKER CERTIFIED SOLUTION
Avatar of arnold
arnold
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
slightly better but still listing regular files

I tried to attached actual file but EE blocked saying BusinessService is not in the allowed extension :)

Is there a way I can provide u the file? pls let me know.
Can you copy, paste some sample of info?

Line 9 outputs every item seen in the directory.....

Note the script outputs as the first line for every file the filename.extension or directory without the number as a reference with the enumerated files sorted.
I.e
Filename.extension123
Filename.extension345


fIlename.extension
Filename ...123 extension

Right now it separates the filename extension and numeric....
...
thanks. every minute I'm learning :)