?
Solved

get multiple file size for date range

Posted on 2013-11-11
3
Medium Priority
?
550 Views
Last Modified: 2013-11-19
Hello,
I want to get total file size from June 01, 2013 till today in HDFS. For example if I have 4 files within this date range(Jun through Nov) with each file being 100KB, I want the output as 400KB. My approach at this point is to perform hadoop fs -ls and get the modification datetime and individual file size. Next step is to exclude all the files that lies outside this range and then sum up the individual file size. Please suggest 1-2 liner approach here. I want to avoid multiple steps here.
Thank You
0
Comment
Question by:Nova17
  • 2
3 Comments
 
LVL 20

Accepted Solution

by:
simon3270 earned 1600 total points
ID: 39641190
An example output from "hadoop fs -ls" would have been useful (I don't have hadoop installed, but this is a scripting exercise rather than a hadoop one).

I believe that it looks like:
drwxr.r.   1 user1 user1       0 2013-06-25 16:45 /user/user1
-rw-r.r.   1 user1 user1       1845 2013-05-25 16:45 /user/user1/file1.lst
-rw-r.r.   1 user1 user1       1322 2012-06-25 16:45 /user/user1/file2.old
-rw-r.r.   1 user1 user1       2241 2013-06-25 16:45 /user/user1/file3.new

with a leading "-" for regular files and d for directories.  In this case, file1.lst and file2.old are too old (before June this year, and last year), and file3.new is new enough (June or later this year).

The following awk script will select only regular files, will discard any with a year earlier than 2013, or a month earlier than June, then add up the sizes of the files left.  It uses "hadoop fs -ls" to return file sizes in bytes; if you tried using the human-readable version (hadoop -fs -ls -h) to get sizes such as 1.4k, it makes the problem *much* harder to solve.
hadoop fs -ls |  awk '/^-/{split($6,a,"-");if ( a[1]< 2013 || a[2] < 6){next};s=s+$5}END{print s}'

Open in new window

If you wanted it in the output to be in, say, kbytes, you could just change the print statement at the end (this version gives kbytes with one decimal place):
hadoop fs -ls |  awk '/^-/{split($6,a,"-");if ( a[1]< 2013 || a[2] < 6){next};s=s+$5}END{printf "%.1fk\n" s/1024}'

Open in new window

or megabytes with 3 decimal places
hadoop fs -ls |  awk '/^-/{split($6,a,"-");if ( a[1]< 2013 || a[2] < 6){next};s=s+$5}END{printf "%.3fM\n" s/1048576}'

Open in new window

0
 
LVL 21

Expert Comment

by:Daniel McAllister
ID: 39643206
This looks like overkill to me:


touch -d "starting date" /tmp/startdate
touch -d "stop date" /tmp/stoptime
SIZEOF=0

find $DIR -newer /tmp/starttime -a ! -newer /tmp/stoptime |
  while read PICKED ; do
   THISSIZE=`stat -c "%s" $PICKED`
   SIZEOF=`expr $SIZEOF + $THISSIZE`
  done

echo "SIZE is $SIZEOF"
exit 0


Dan
IT4SOHO

PS: No debugging that... just banged it out... probably got some details off...
0
 
LVL 20

Expert Comment

by:simon3270
ID: 39643232
I think that you need to use the "hadoop fs -ls" command to read the file system, otherwise a "find"-based system would be quite good (if a little more longwinded than a couple of awk statements).
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction We as admins face situation where we need to redirect websites to another. This may be required as a part of an upgrade keeping the old URL but website should be served from new URL. This document would brief you on different ways ca…
SSH (Secure Shell) - Tips and Tricks As you all know SSH(Secure Shell) is a network protocol, which we use to access/transfer files securely between two networked devices. SSH was actually designed as a replacement for insecure protocols that sen…
Learn how to find files with the shell using the find and locate commands. Use locate to find a needle in a haystack.: With locate, check if the file still exists.: Use find to get the actual location of the file.:
Connecting to an Amazon Linux EC2 Instance from Windows Using PuTTY.
Suggested Courses
Course of the Month13 days, 17 hours left to enroll

807 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question