[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Reading the response from rsync with python and popen3

Posted on 2010-01-04
17
Medium Priority
?
1,018 Views
Last Modified: 2012-06-22
I asked a similar question involving wget before and so have made this a related question.

I am using rsync with ssh to sync a local folder with a file from a server.  My rsync query is fairly basic accept for adding the stats and progress commands to give me more information

my statement is something along the lines of:

rsync -avz --progress --stats ssh remoteuser@remotehost:/remote/dir  /local/dir/

I am using ssh certificates on the server and the client so there is no need to enter a password or for any other user interaction

What I now want to do is to call my rsync query and parse the results as in the wget example, but my problem is that i am getting completely baffled as to how to put together the regular expressions (re) in python that will allow me to read the response and tell when the process is complete.

0
Comment
Question by:Susurrus
  • 8
  • 5
  • 4
17 Comments
 
LVL 12

Accepted Solution

by:
cminear earned 1200 total points
ID: 26173010
From my own quick sample run, I see the following last lines for the command:

Number of files: 89
Number of files transferred: 71
Total file size: 1373767 bytes
Total transferred file size: 1373767 bytes
Literal data: 1373767 bytes
Matched data: 0 bytes
File list size: 2103
File list generation time: 0.144 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 783708
Total bytes received: 1690

sent 783708 bytes  received 1690 bytes  174532.89 bytes/sec
total size is 1373767  speedup is 1.75

If you are looking for the "beginning of the end" --- in other words, where the transfer of files has stopped, and it's starting to output total statistics --- the 'start_of_end_re' should be the regular expression you need.  If you are looking for the end of the end, then the 'end_of_end_re' would be the one you want.
import re

start_of_end_re = re.compile('^Number of files:\s(\d+)\s*$')
end_of_end_re = re.compile('^total size is (\d+)\s+speedup is (\d+\.\d+)\s*$')

Open in new window

0
 
LVL 29

Expert Comment

by:pepr
ID: 26189150
Basically, you get the file-like object that can be read line by line (if it behaves as a text file) or by bytes (i.e. single char strings -- if it behaves as if opened in binary mode).  I did not try what is the case when using popen3 -- text or binary mode.  It is also not clear what popen3 (from what module) you are talking about.

Warning: you should use so called r'raw strings' for regular expression patterns or you have to double the backslashes.

If the response text is reasonably small, you can also use the .read() to get all into one multiline string and possibly to split it to the list of lines if needed.

Anyway, you want to apply the regular expression or to the lines or to the whole content.

The above comment compiles the regular expressions that are suitable for the separate lines only (they explicitly mark the beginning and the end of the string). The compiled regular expression from above are suitable for applying the .match() method (http://docs.python.org/library/re.html#re.match, http://docs.python.org/library/re.html#re.RegexObject.match).  However,  you may often use the .search() method (http://docs.python.org/library/re.html#re.search, http://docs.python.org/library/re.html#re.RegexObject.search). The compiled expression is an object that has the pattern compiled inside.  This way you just leave out the first argument mentioned in the documentation.

The result of .search() or .match() is the match object or None when nothing found.  Because of this you probably want to process a line like this:

    rex = re.compile(r'your pattern (\d+) goes here')       # the single group defined
    ...
    for line in fileLikeObjectInTextMode:
        m = rex.match(line)                          # or you can use m = rex.search(line) if appropriate
        if m:                                                  # the same as "if m is not None:"
            num = int(m.group(1))                 # processing of the number extracted by the pattern

Depending on your needs you may also be interested in the methods .findall() or .finditer().  Attach the sample of your response text here, and tell what should be extracted.
0
 

Author Comment

by:Susurrus
ID: 26200810
Hi cminear, thanks for you quick response, as far is it went it was exactly what i needed.  I am very sorry and embarrassed that I did not reply sooner, but I hope you are a patient enough bunch to continue helping me with this.  

When I run rsync with --progress and --stats I before I recieve the final stats (in the form you show above) I first get a progress report as the download is happening.  This looks like:


receiving file list ...
53 files to consider
pic1.jpg
      123433 100%    1.15MB/s    0:00:00 (xfer#1, to-check=50/53)
pic2.jpg
      123433 100%  415.66kB/s    0:00:00 (xfer#2, to-check=49/53)
pic4.jpg
      123433 100%  308.29kB/s    0:00:00 (xfer#3, to-check=48/53)
pic4.jpg
      123433 100%  219.16kB/s    0:00:00 (xfer#4, to-check=47/53)

(...continues.....)


I want to be able to capture these lines in python so I can keep a running to total on how many files have been downloaded.  Any idea how to write the regular expression for this?
0
[Webinar] Cloud and Mobile-First Strategy

Maybe you’ve fully adopted the cloud since the beginning. Or maybe you started with on-prem resources but are pursuing a “cloud and mobile first” strategy. Getting to that end state has its challenges. Discover how to build out a 100% cloud and mobile IT strategy in this webinar.

 
LVL 12

Expert Comment

by:cminear
ID: 26201510
The easiest thing to do would be to look for the filenames:
  file_re = re.compile(r'\A(\S+)\s*\Z')
Of course, when you are running and see a filename, that only means that the download of that file is in progress, not necessarily that it is done.  You can get around this by saving the name, and when you see the next filename, then report the previous one as completed.

However, I'm going to make a guess that you really want to also parse the statistics.  This will be more difficult.  To see why, run this command (when you know some updates will occur):
  rsync -avz --progress --stats ssh remoteuser@remotehost:/remote/dir  /local/dir/ > run-output
Then look at the 'run-output' file.  You should see some text like this:

       32768   0%    0.00kB/s    0:00:00^M     5406720   2%    5.12MB/s    0:00:42^M    11206656   4%    5.32MB/s    0:00:39^M  ...    229179392 100%    6.09MB/s    0:00:35  (1, 50.0% of 10)

When rsync outputs the statistics, it sends a carriage return to return the cursor to the beginning of the line, which it then overwrites with the next update.  It looks very nice when running interactively, but it's not as easy to deal with via just reading off of stdout.  Plus, you may have buffering issues: you may not get any of the stats until you get all of them.  And if you are getting some of them, you have to be careful that you are dealing with them appropriately.  (This is probably easiest by splitting the received string on carriage returns and then do your parsing.)  

So if you are still interested in dealing with the individual file stats, ask particular questions.  As a start, here would be the regex for a single update, and for the final update:
  base_re_str = r'\A\s+(\d+)\s+(\d+%)\s+([\d\.]+)[GMk]B/s\s+(\d+:\d{2}:\d{2})'
  norm_update_re = re.compile(base_re_str + r'\Z')
  final_update_re = re.compile(base_re_str + r'\s+\(.*\)\s*\Z')
(Note that my final update output looks different than your example; the final_update_re works for either case.)
0
 

Author Comment

by:Susurrus
ID: 26202883
currently I think that simply being able to note when a file download is in progress may well be enough.  This will give me at least some basic information on where rsync has got to.  It would be nice to have more details then this (as you guessed) such as the percentages, but I this looks like it may prove to be more trouble then it is worth for the momment.  I will have a go withwhat you have suggested so far and report back.

thanks again
0
 

Author Comment

by:Susurrus
ID: 26208494
I am still struggling to get even the most simple parts working (i.e. detecting when rsync has finnished), let alone detecting when a file is being downloaded.  below is  the most simple sample I have of my efforts, it starts ok, and rsync downloads the files, but it does not detect when rsync finishes:
import os
import re

start_of_the_end = re.compile('^Number of files:\s(\d+)\s*$')

cmd = 'rsync -r -v --progress --stats -e ssh remoteuser@remotehost:/remote/dir  /local/dir/ 

print "Starting File Download..."

ending = False

rsync_in, rsync_out, rsync_check_end = os.popen3(cmd)

rsync_in.close()

while 1:
	line = rsync_check_end.readline()
		
	if not ending:
		ending_mo = start_of_the_end.match(line)
		if ending_mo:
			ending = True
		continue
	
		print "Rsync Completed"

		rsync_check_end()
		rsync_out.close()

		break

Open in new window

0
 
LVL 12

Expert Comment

by:cminear
ID: 26210837
Change your script to use "rsync_out" rather than "rsync_check_end".  The 'rsync_out' is the STDOUT from the rsync process, and this is where the rsync process would be sending the output; "rsync_check_end" would be the STDERR, and it wouldn't have the line you are looking for.

However, beyond that, you have some problems with your program flow.  I think you are missing an 'else'.  After the "continue", you have the print and the break.  Well, if you continue, you skip those actions.  And if you fix the problem above and you set "ending" to True, then you would never get to that break, because it is behind the "not ending" check.

(Another alternative would be to just move the print, closes and break within the "if ending_mo" block; you know it's ending, take care of it immediately and get out of there.)

Note that I'm guessing that you really meant "rsync_check_end.close()", and not "rsync_check_end()".

One final comment: if you wanted to be persnickety, you maybe would want to be checking for an EOF on the reads after you saw that the process was "ending".  That would be a better indication that rsync really was done and you wouldn't be abandoning it before it finished outputting its information (not that you care about it).  In this case, it probably doesn't matter, but doing that may be a good example for the next time you do something similar, and it does matter.
import os
import re

start_of_the_end = re.compile('^Number of files:\s(\d+)\s*$')

cmd = 'rsync -r -v --progress --stats -e ssh remoteuser@remotehost:/remote/dir  /local/dir/ 

print "Starting File Download..."

ending = False

rsync_in, rsync_out, rsync_check_end = os.popen3(cmd)

rsync_in.close()

while 1:
	line = rsync_check_end.readline()
		
	if not ending:
		ending_mo = start_of_the_end.match(line)
		if ending_mo:
			ending = True
		continue
	else:
		print "Rsync Completed"

		rsync_check_end.close()
		rsync_out.close()

		break

Open in new window

0
 

Author Comment

by:Susurrus
ID: 26239083
The process is still hanging, I am going cross eyed looking at it.  I checked the expression through an on-line checker just to be sure, and I tried putting in print statements at various points in the code to try and work out where it is getting stuck.  As far as I can tell it is getting stuck in a loop at the match statement:

ending_mo = start_of_the_end.match(line)

and never get any further then there, even though, I know there is a matching line in the rsync output
import os
import re

start_of_the_end = re.compile('^Number of files:\s(\d+)\s*$')

cmd = 'sync -r -v --progress --stats -e ssh remoteuser@remotehost:/remote/dir  /local/dir/' 

print "Starting File Download..."

ending = False

rsync_in, rsync_out, rsync_check_end = os.popen3(cmd)

rsync_in.close()

while 1:
	line = rsync_check_end.readline()

	if not ending:
		print "processing...."
		ending_mo = start_of_the_end.match(line)
		if ending_mo:
			ending = True
		continue
        
		print "Rsync Completed"
		rsync_check_end()
		rsync_out.close()
		
		break

Open in new window

0
 
LVL 29

Expert Comment

by:pepr
ID: 26282411
The os.popen3() returns stdin, stdout, and stderr file-like objects. I am not that familiar with rsync. Are you sure that the parsed lines should be read from stderr?  It could be the case when nothing was sent to the stderr (your rsync_check_end) and the .readline() blocks until some line is obtained.

In my opinion, you should not rsync_in.close() also.

Say, the result is returned via rsync_out.  Your loop should look like:

for line in rsync_out:
    print 'processing line:', line
    m = start_of_end_re.match(line)
    if m:
        print 'extracted info:', m.group(1)
        break                  # break the loop

Or put your rsync_check_end to your for loop if rsync displays the message via stderr.

If you have newer Python, consider the subprocess modul instead of os.popen3().  It is more flexible with respect to synchronization with another process. See http://docs.python.org/library/subprocess.html#subprocess-replacements 
0
 

Author Comment

by:Susurrus
ID: 26282513
unfortunately because of other factors in this project I am limited to python 2.5, is sub process available in in versions under 2.6?

You version of the loop does seem to work though, and I am getting the correct response now.
0
 

Author Comment

by:Susurrus
ID: 26282645
Now I have the response I was looking for in that my code now completes correctly, unfortunately I have not got completely the response I was looking for.  I am getting all the lines output at the end of the process,where I really wanted to have the file name lines output as rsync processing (a run rsync from the command line), so that I have some feedback as to what rsync is doing.  My output looks like this

Starting File Download...
processing line: receiving file list ...

processing line: 53 files to consider

processing line:  1.jpg

      123433 100%  704.91kB/s    0:00:00 (xfer#1, to-check=50/53)

processing line: processing line:  2.jpg

      123433 100%  461.84kB/s    0:00:00 (xfer#2, to-check=49/53)

processing line: processing line:  3.jpg

      123433 100%  289.06kB/s    0:00:00 (xfer#3, to-check=48/53)

processing line: processing line:  4.jpg

      123433 100%  209.63kB/s    0:00:00 (xfer#4, to-check=47/53)

processing line: processing line:  5.jpg

     8158529 100%  773.08kB/s    0:00:10 (xfer#5, to-check=46/53)

processing line: processing line:  6.jpg

      132793 100%  414.32kB/s    0:00:00 (xfer#6, to-check=44/53)

processing line: processing line:  7.jpg

      132793 100%  275.92kB/s    0:00:00 (xfer#7, to-check=43/53)

processing line: processing line:  8.jpg

      132793 100%  202.31kB/s    0:00:00 (xfer#8, to-check=42/53)


(.....continues......)


processing line: Number of files: 53

extracted info: 53


The information coming back is correct, but it is all coming at once, not line by line as rsync completes each file download
0
 
LVL 29

Expert Comment

by:pepr
ID: 26282654
The subprocess module was introduced as the standard one in Python 2.4.  The os.popen3() is deprecated since Python 2.6; however, the usual approach when deprecating a Python module is to introduce the replacement earlier.  The truth is that os.popen3() (and os.system() and that kind of functionality) may look easier to be used.  I personally also thought so.  However, it is only a matter of "getting used to something with slightly different interface).  I personally found the subprocess replacements to be clear and to be used easily.

On the other hand, there probably is no need to change the existing code when it works.  Anyway, you can try to write the similar code with subprocess module and see if it also works.  
0
 
LVL 29

Expert Comment

by:pepr
ID: 26282942
Some more notes: You may consider also os.popen4() that merges stdout and stderr output together.

I tried some things with subprocess (but on Windows).  It seems that it returns characters instead of lines (as if the file-like object) was opened in binary mode instead of the text mode.  On the other hand, it may work more correctly with buffering of the info.

Also, if I am not wrong, at least earlier versions of Windows implemented pipes as files if the pipe was prescribed as a command interpreted by command.com.  This would cause behaviour as you described.  I do not know what OS you use.

Anyway, it is likely, that the newer Windows implement "the pipes" between utilities better.  It is also likely that the subprocess module may be better with respect to finer buffering of the piped information.  On the other hand, the binary mode (if my observations are valid also for your case) requires some extra step to put the characters back to the lines.

0
 
LVL 29

Assisted Solution

by:pepr
pepr earned 800 total points
ID: 26283160
Sorry for my bad information.  When you use something like

fout = subprocess.Popen(myCmd, stdout=subprocess.PIPE, universal_newlines=True).stdout

then you get the fout file-like object opened in the text mode. I tried to use the subprocess.Popen(...).communicate()... when observing the "binary-like" mode.

For immediate synchronisation, I do not have experience with that.  But we can try later when everything else works.
0
 

Author Comment

by:Susurrus
ID: 26283650
just to confirm my setup:

I am running on ubuntu linux 9.10, using python 2.5
0
 
LVL 12

Expert Comment

by:cminear
ID: 26284059
This may be a step backward, but when I run the script below (with the 'rsync' command modified for my environment), I am getting a line outputted immediately for each file.  (I have to apologize: I suggested that you use 'rsync_out' over 'rsync_check_end', but then I didn't modify that part in the script.  Sorry about that.)

My running host is python 2.5.2 on FreeBSD 7.2; a different OS, but operationally it should not be much different as they are both Unix-ish.
import os
import re

start_of_the_end = re.compile(r'^Number of files:\s(\d+)\s*$')
file_re = re.compile(r'^(\S+)')

cmd = 'sync -r -v --progress --stats -e ssh remoteuser@remotehost:/remote/dir  /local/dir/' 

print "Starting File Download..."

ending = False

rsync_in, rsync_out, rsync_check_end = os.popen3(cmd)

rsync_in.close()

while 1:
	line = rsync_out.readline()
		
	if not ending:
		ending_mo = start_of_the_end.match(line)
		file_mo = file_re.match(line)
		if ending_mo:
			ending = True
		elif file_mo:
			print "Got %s" % file_mo.groups()[0]
	else:
		print "Rsync Completed"

		rsync_check_end.close()
		rsync_out.close()

		break

Open in new window

0
 

Author Comment

by:Susurrus
ID: 26310657
thanks cminear, that got me 70% of the were I wanted to go and I have learned a lot in the process.  

I have had so much good help on this, its hard to see how to best split the points. So if I get in wrong, please forgive me :)
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Less strange, but still introduction This introduction was added (1st August, 2011) to reflect some reactions.  Firstly, the term basics in the title of the article...  As any other word, it is a symbol with meaning attached to the word by some a…
A set of related code is known to be a Module, it helps us to organize our code logically which is much easier for us to understand and use it. Module is an object with arbitrarily named attributes which can be used in binding and referencing. …
Learn the basics of strings in Python: declaration, operations, indices, and slicing. Strings are declared with quotations; for example: s = "string": Strings are immutable.: Strings may be concatenated or multiplied using the addition and multiplic…
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
Suggested Courses

872 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question