Reading the response from rsync with python and popen3

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Basically, you get the file-like object that can be read line by line (if it behaves as a text file) or by bytes (i.e. single char strings -- if it behaves as if opened in binary mode). I did not try what is the case when using popen3 -- text or binary mode. It is also not clear what popen3 (from what module) you are talking about.

Warning: you should use so called r'raw strings' for regular expression patterns or you have to double the backslashes.

If the response text is reasonably small, you can also use the .read() to get all into one multiline string and possibly to split it to the list of lines if needed.

Anyway, you want to apply the regular expression or to the lines or to the whole content.

The above comment compiles the regular expressions that are suitable for the separate lines only (they explicitly mark the beginning and the end of the string). The compiled regular expression from above are suitable for applying the .match() method (http://docs.python.org/library/re.html#re.match, http://docs.python.org/library/re.html#re.RegexObject.match). However, you may often use the .search() method (http://docs.python.org/library/re.html#re.search, http://docs.python.org/library/re.html#re.RegexObject.search). The compiled expression is an object that has the pattern compiled inside. This way you just leave out the first argument mentioned in the documentation.

The result of .search() or .match() is the match object or None when nothing found. Because of this you probably want to process a line like this:

rex = re.compile(r'your pattern (\d+) goes here') # the single group defined
...
for line in fileLikeObjectInTextMode:
m = rex.match(line) # or you can use m = rex.search(line) if appropriate
if m: # the same as "if m is not None:"
num = int(m.group(1)) # processing of the number extracted by the pattern

Depending on your needs you may also be interested in the methods .findall() or .finditer(). Attach the sample of your response text here, and tell what should be extracted.

ASKER

Hi cminear, thanks for you quick response, as far is it went it was exactly what i needed. I am very sorry and embarrassed that I did not reply sooner, but I hope you are a patient enough bunch to continue helping me with this.

When I run rsync with --progress and --stats I before I recieve the final stats (in the form you show above) I first get a progress report as the download is happening. This looks like:

receiving file list ...
53 files to consider
pic1.jpg
123433 100% 1.15MB/s 0:00:00 (xfer#1, to-check=50/53)
pic2.jpg
123433 100% 415.66kB/s 0:00:00 (xfer#2, to-check=49/53)
pic4.jpg
123433 100% 308.29kB/s 0:00:00 (xfer#3, to-check=48/53)
pic4.jpg
123433 100% 219.16kB/s 0:00:00 (xfer#4, to-check=47/53)

(...continues.....)

I want to be able to capture these lines in python so I can keep a running to total on how many files have been downloaded. Any idea how to write the regular expression for this?

The easiest thing to do would be to look for the filenames:
file_re = re.compile(r'\A(\S+)\s*\Z')
Of course, when you are running and see a filename, that only means that the download of that file is in progress, not necessarily that it is done. You can get around this by saving the name, and when you see the next filename, then report the previous one as completed.

However, I'm going to make a guess that you really want to also parse the statistics. This will be more difficult. To see why, run this command (when you know some updates will occur):
rsync -avz --progress --stats ssh remoteuser@remotehost:/remote/dir /local/dir/ > run-output
Then look at the 'run-output' file. You should see some text like this:

32768 0% 0.00kB/s 0:00:00^M 5406720 2% 5.12MB/s 0:00:42^M 11206656 4% 5.32MB/s 0:00:39^M ... 229179392 100% 6.09MB/s 0:00:35 (1, 50.0% of 10)

When rsync outputs the statistics, it sends a carriage return to return the cursor to the beginning of the line, which it then overwrites with the next update. It looks very nice when running interactively, but it's not as easy to deal with via just reading off of stdout. Plus, you may have buffering issues: you may not get any of the stats until you get all of them. And if you are getting some of them, you have to be careful that you are dealing with them appropriately. (This is probably easiest by splitting the received string on carriage returns and then do your parsing.)

So if you are still interested in dealing with the individual file stats, ask particular questions. As a start, here would be the regex for a single update, and for the final update:
base_re_str = r'\A\s+(\d+)\s+(\d+%)\s+([\d\.]+)[GMk]B/s\s+(\d+:\d{2}:\d{2})'
norm_update_re = re.compile(base_re_str + r'\Z')
final_update_re = re.compile(base_re_str + r'\s+\(.*\)\s*\Z')
(Note that my final update output looks different than your example; the final_update_re works for either case.)

ASKER

currently I think that simply being able to note when a file download is in progress may well be enough. This will give me at least some basic information on where rsync has got to. It would be nice to have more details then this (as you guessed) such as the percentages, but I this looks like it may prove to be more trouble then it is worth for the momment. I will have a go withwhat you have suggested so far and report back.

thanks again

ASKER

I am still struggling to get even the most simple parts working (i.e. detecting when rsync has finnished), let alone detecting when a file is being downloaded. below is the most simple sample I have of my efforts, it starts ok, and rsync downloads the files, but it does not detect when rsync finishes:

import os
import re

start_of_the_end = re.compile('^Number of files:\s(\d+)\s*$')

cmd = 'rsync -r -v --progress --stats -e ssh remoteuser@remotehost:/remote/dir  /local/dir/ 

print "Starting File Download..."

ending = False

rsync_in, rsync_out, rsync_check_end = os.popen3(cmd)

rsync_in.close()

while 1:
	line = rsync_check_end.readline()
		
	if not ending:
		ending_mo = start_of_the_end.match(line)
		if ending_mo:
			ending = True
		continue
	
		print "Rsync Completed"

		rsync_check_end()
		rsync_out.close()

		break

Change your script to use "rsync_out" rather than "rsync_check_end". The 'rsync_out' is the STDOUT from the rsync process, and this is where the rsync process would be sending the output; "rsync_check_end" would be the STDERR, and it wouldn't have the line you are looking for.

However, beyond that, you have some problems with your program flow. I think you are missing an 'else'. After the "continue", you have the print and the break. Well, if you continue, you skip those actions. And if you fix the problem above and you set "ending" to True, then you would never get to that break, because it is behind the "not ending" check.

(Another alternative would be to just move the print, closes and break within the "if ending_mo" block; you know it's ending, take care of it immediately and get out of there.)

Note that I'm guessing that you really meant "rsync_check_end.close()", and not "rsync_check_end()".

One final comment: if you wanted to be persnickety, you maybe would want to be checking for an EOF on the reads after you saw that the process was "ending". That would be a better indication that rsync really was done and you wouldn't be abandoning it before it finished outputting its information (not that you care about it). In this case, it probably doesn't matter, but doing that may be a good example for the next time you do something similar, and it does matter.

import os
import re

start_of_the_end = re.compile('^Number of files:\s(\d+)\s*$')

cmd = 'rsync -r -v --progress --stats -e ssh remoteuser@remotehost:/remote/dir  /local/dir/ 

print "Starting File Download..."

ending = False

rsync_in, rsync_out, rsync_check_end = os.popen3(cmd)

rsync_in.close()

while 1:
	line = rsync_check_end.readline()
		
	if not ending:
		ending_mo = start_of_the_end.match(line)
		if ending_mo:
			ending = True
		continue
	else:
		print "Rsync Completed"

		rsync_check_end.close()
		rsync_out.close()

		break

ASKER

The process is still hanging, I am going cross eyed looking at it. I checked the expression through an on-line checker just to be sure, and I tried putting in print statements at various points in the code to try and work out where it is getting stuck. As far as I can tell it is getting stuck in a loop at the match statement:

ending_mo = start_of_the_end.match(line)

and never get any further then there, even though, I know there is a matching line in the rsync output

import os
import re

start_of_the_end = re.compile('^Number of files:\s(\d+)\s*$')

cmd = 'sync -r -v --progress --stats -e ssh remoteuser@remotehost:/remote/dir  /local/dir/' 

print "Starting File Download..."

ending = False

rsync_in, rsync_out, rsync_check_end = os.popen3(cmd)

rsync_in.close()

while 1:
	line = rsync_check_end.readline()

	if not ending:
		print "processing...."
		ending_mo = start_of_the_end.match(line)
		if ending_mo:
			ending = True
		continue
        
		print "Rsync Completed"
		rsync_check_end()
		rsync_out.close()
		
		break

The os.popen3() returns stdin, stdout, and stderr file-like objects. I am not that familiar with rsync. Are you sure that the parsed lines should be read from stderr? It could be the case when nothing was sent to the stderr (your rsync_check_end) and the .readline() blocks until some line is obtained.

In my opinion, you should not rsync_in.close() also.

Say, the result is returned via rsync_out. Your loop should look like:

for line in rsync_out:
print 'processing line:', line
m = start_of_end_re.match(line)
if m:
print 'extracted info:', m.group(1)
break # break the loop

Or put your rsync_check_end to your for loop if rsync displays the message via stderr.

If you have newer Python, consider the subprocess modul instead of os.popen3(). It is more flexible with respect to synchronization with another process. See http://docs.python.org/library/subprocess.html#subprocess-replacements

ASKER

unfortunately because of other factors in this project I am limited to python 2.5, is sub process available in in versions under 2.6?

You version of the loop does seem to work though, and I am getting the correct response now.

ASKER

Now I have the response I was looking for in that my code now completes correctly, unfortunately I have not got completely the response I was looking for. I am getting all the lines output at the end of the process,where I really wanted to have the file name lines output as rsync processing (a run rsync from the command line), so that I have some feedback as to what rsync is doing. My output looks like this

Starting File Download...
processing line: receiving file list ...

processing line: 53 files to consider

processing line: 1.jpg

123433 100% 704.91kB/s 0:00:00 (xfer#1, to-check=50/53)

processing line: processing line: 2.jpg

123433 100% 461.84kB/s 0:00:00 (xfer#2, to-check=49/53)

processing line: processing line: 3.jpg

123433 100% 289.06kB/s 0:00:00 (xfer#3, to-check=48/53)

processing line: processing line: 4.jpg

123433 100% 209.63kB/s 0:00:00 (xfer#4, to-check=47/53)

processing line: processing line: 5.jpg

8158529 100% 773.08kB/s 0:00:10 (xfer#5, to-check=46/53)

processing line: processing line: 6.jpg

132793 100% 414.32kB/s 0:00:00 (xfer#6, to-check=44/53)

processing line: processing line: 7.jpg

132793 100% 275.92kB/s 0:00:00 (xfer#7, to-check=43/53)

processing line: processing line: 8.jpg

132793 100% 202.31kB/s 0:00:00 (xfer#8, to-check=42/53)

(.....continues......)

processing line: Number of files: 53

extracted info: 53

The information coming back is correct, but it is all coming at once, not line by line as rsync completes each file download

The subprocess module was introduced as the standard one in Python 2.4. The os.popen3() is deprecated since Python 2.6; however, the usual approach when deprecating a Python module is to introduce the replacement earlier. The truth is that os.popen3() (and os.system() and that kind of functionality) may look easier to be used. I personally also thought so. However, it is only a matter of "getting used to something with slightly different interface). I personally found the subprocess replacements to be clear and to be used easily.

On the other hand, there probably is no need to change the existing code when it works. Anyway, you can try to write the similar code with subprocess module and see if it also works.

Some more notes: You may consider also os.popen4() that merges stdout and stderr output together.

I tried some things with subprocess (but on Windows). It seems that it returns characters instead of lines (as if the file-like object) was opened in binary mode instead of the text mode. On the other hand, it may work more correctly with buffering of the info.

Also, if I am not wrong, at least earlier versions of Windows implemented pipes as files if the pipe was prescribed as a command interpreted by command.com. This would cause behaviour as you described. I do not know what OS you use.

Anyway, it is likely, that the newer Windows implement "the pipes" between utilities better. It is also likely that the subprocess module may be better with respect to finer buffering of the piped information. On the other hand, the binary mode (if my observations are valid also for your case) requires some extra step to put the characters back to the lines.

SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER

just to confirm my setup:

I am running on ubuntu linux 9.10, using python 2.5

This may be a step backward, but when I run the script below (with the 'rsync' command modified for my environment), I am getting a line outputted immediately for each file. (I have to apologize: I suggested that you use 'rsync_out' over 'rsync_check_end', but then I didn't modify that part in the script. Sorry about that.)

My running host is python 2.5.2 on FreeBSD 7.2; a different OS, but operationally it should not be much different as they are both Unix-ish.

import os
import re

start_of_the_end = re.compile(r'^Number of files:\s(\d+)\s*$')
file_re = re.compile(r'^(\S+)')

cmd = 'sync -r -v --progress --stats -e ssh remoteuser@remotehost:/remote/dir  /local/dir/' 

print "Starting File Download..."

ending = False

rsync_in, rsync_out, rsync_check_end = os.popen3(cmd)

rsync_in.close()

while 1:
	line = rsync_out.readline()
		
	if not ending:
		ending_mo = start_of_the_end.match(line)
		file_mo = file_re.match(line)
		if ending_mo:
			ending = True
		elif file_mo:
			print "Got %s" % file_mo.groups()[0]
	else:
		print "Rsync Completed"

		rsync_check_end.close()
		rsync_out.close()

		break