get multiple file size for date range

I want to get total file size from June 01, 2013 till today in HDFS. For example if I have 4 files within this date range(Jun through Nov) with each file being 100KB, I want the output as 400KB. My approach at this point is to perform hadoop fs -ls and get the modification datetime and individual file size. Next step is to exclude all the files that lies outside this range and then sum up the individual file size. Please suggest 1-2 liner approach here. I want to avoid multiple steps here.
Thank You
Who is Participating?
simon3270Connect With a Mentor Commented:
An example output from "hadoop fs -ls" would have been useful (I don't have hadoop installed, but this is a scripting exercise rather than a hadoop one).

I believe that it looks like:
drwxr.r.   1 user1 user1       0 2013-06-25 16:45 /user/user1
-rw-r.r.   1 user1 user1       1845 2013-05-25 16:45 /user/user1/file1.lst
-rw-r.r.   1 user1 user1       1322 2012-06-25 16:45 /user/user1/file2.old
-rw-r.r.   1 user1 user1       2241 2013-06-25 16:45 /user/user1/

with a leading "-" for regular files and d for directories.  In this case, file1.lst and file2.old are too old (before June this year, and last year), and is new enough (June or later this year).

The following awk script will select only regular files, will discard any with a year earlier than 2013, or a month earlier than June, then add up the sizes of the files left.  It uses "hadoop fs -ls" to return file sizes in bytes; if you tried using the human-readable version (hadoop -fs -ls -h) to get sizes such as 1.4k, it makes the problem *much* harder to solve.
hadoop fs -ls |  awk '/^-/{split($6,a,"-");if ( a[1]< 2013 || a[2] < 6){next};s=s+$5}END{print s}'

Open in new window

If you wanted it in the output to be in, say, kbytes, you could just change the print statement at the end (this version gives kbytes with one decimal place):
hadoop fs -ls |  awk '/^-/{split($6,a,"-");if ( a[1]< 2013 || a[2] < 6){next};s=s+$5}END{printf "%.1fk\n" s/1024}'

Open in new window

or megabytes with 3 decimal places
hadoop fs -ls |  awk '/^-/{split($6,a,"-");if ( a[1]< 2013 || a[2] < 6){next};s=s+$5}END{printf "%.3fM\n" s/1048576}'

Open in new window

Daniel McAllisterPresident, IT4SOHO, LLCCommented:
This looks like overkill to me:

touch -d "starting date" /tmp/startdate
touch -d "stop date" /tmp/stoptime

find $DIR -newer /tmp/starttime -a ! -newer /tmp/stoptime |
  while read PICKED ; do
   THISSIZE=`stat -c "%s" $PICKED`

echo "SIZE is $SIZEOF"
exit 0


PS: No debugging that... just banged it out... probably got some details off...
I think that you need to use the "hadoop fs -ls" command to read the file system, otherwise a "find"-based system would be quite good (if a little more longwinded than a couple of awk statements).
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.