Solved

get multiple file size for date range

Posted on 2013-11-11
3
493 Views
Last Modified: 2013-11-19
Hello,
I want to get total file size from June 01, 2013 till today in HDFS. For example if I have 4 files within this date range(Jun through Nov) with each file being 100KB, I want the output as 400KB. My approach at this point is to perform hadoop fs -ls and get the modification datetime and individual file size. Next step is to exclude all the files that lies outside this range and then sum up the individual file size. Please suggest 1-2 liner approach here. I want to avoid multiple steps here.
Thank You
0
Comment
Question by:Nova17
  • 2
3 Comments
 
LVL 19

Accepted Solution

by:
simon3270 earned 400 total points
ID: 39641190
An example output from "hadoop fs -ls" would have been useful (I don't have hadoop installed, but this is a scripting exercise rather than a hadoop one).

I believe that it looks like:
drwxr.r.   1 user1 user1       0 2013-06-25 16:45 /user/user1
-rw-r.r.   1 user1 user1       1845 2013-05-25 16:45 /user/user1/file1.lst
-rw-r.r.   1 user1 user1       1322 2012-06-25 16:45 /user/user1/file2.old
-rw-r.r.   1 user1 user1       2241 2013-06-25 16:45 /user/user1/file3.new

with a leading "-" for regular files and d for directories.  In this case, file1.lst and file2.old are too old (before June this year, and last year), and file3.new is new enough (June or later this year).

The following awk script will select only regular files, will discard any with a year earlier than 2013, or a month earlier than June, then add up the sizes of the files left.  It uses "hadoop fs -ls" to return file sizes in bytes; if you tried using the human-readable version (hadoop -fs -ls -h) to get sizes such as 1.4k, it makes the problem *much* harder to solve.
hadoop fs -ls |  awk '/^-/{split($6,a,"-");if ( a[1]< 2013 || a[2] < 6){next};s=s+$5}END{print s}'

Open in new window

If you wanted it in the output to be in, say, kbytes, you could just change the print statement at the end (this version gives kbytes with one decimal place):
hadoop fs -ls |  awk '/^-/{split($6,a,"-");if ( a[1]< 2013 || a[2] < 6){next};s=s+$5}END{printf "%.1fk\n" s/1024}'

Open in new window

or megabytes with 3 decimal places
hadoop fs -ls |  awk '/^-/{split($6,a,"-");if ( a[1]< 2013 || a[2] < 6){next};s=s+$5}END{printf "%.3fM\n" s/1048576}'

Open in new window

0
 
LVL 20

Expert Comment

by:Daniel McAllister
ID: 39643206
This looks like overkill to me:


touch -d "starting date" /tmp/startdate
touch -d "stop date" /tmp/stoptime
SIZEOF=0

find $DIR -newer /tmp/starttime -a ! -newer /tmp/stoptime |
  while read PICKED ; do
   THISSIZE=`stat -c "%s" $PICKED`
   SIZEOF=`expr $SIZEOF + $THISSIZE`
  done

echo "SIZE is $SIZEOF"
exit 0


Dan
IT4SOHO

PS: No debugging that... just banged it out... probably got some details off...
0
 
LVL 19

Expert Comment

by:simon3270
ID: 39643232
I think that you need to use the "hadoop fs -ls" command to read the file system, otherwise a "find"-based system would be quite good (if a little more longwinded than a couple of awk statements).
0

Featured Post

VMware Disaster Recovery and Data Protection

In this expert guide, you’ll learn about the components of a Modern Data Center. You will use cases for the value-added capabilities of Veeam®, including combining backup and replication for VMware disaster recovery and using replication for data center migration.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

rdate is a Linux command and the network time protocol for immediate date and time setup from another machine. The clocks are synchronized by entering rdate with the -s switch (command without switch just checks the time but does not set anything). …
The purpose of this article is to demonstrate how we can use conditional statements using Python.
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
Learn how to navigate the file tree with the shell. Use pwd to print the current working directory: Use ls to list a directory's contents: Use cd to change to a new directory: Use wildcards instead of typing out long directory names: Use ../ to move…

777 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question