Link to home
Start Free TrialLog in
Avatar of phoffric
phoffric

asked on

How to find the largest value in a string with Python3.7

I have a text string like this:
"stuff<offset>1234</offset><length>78</length>stuff <offset>1000134</offset><length>5678</length>stuff...<offset>11234</offset><length>5678</length>stuff"

Open in new window

My goal is to find the largest offset value and the corresponding length value.

I know I can write a loop searching for each "<offset>" and extract the value. I was wondering if in python3.7, there is a non-loop approach. (I can  use existing xml parsing code, but this seems simple enough to just use the text string.)

Thanks,
Paul
Avatar of aikimark
aikimark
Flag of United States of America image

import re  #regular expression library

#sample text
a = r"stuff<offset>1234</offset><length>78</length>stuff <offset>1000134</offset><length>5678</length>stuff...<offset>11234</offset><length>5678</length>stuff"

#instantiate a regex object with the necessary pattern
r = re.compile(r"<offset>([^<]+)</offset><length>([^<]+)</length>")

matches = r.findall(a)   #instantiate our matches variable

print(max(matches, key = lambda x: [0]))  # print result using max() function

Open in new window

produces the following output:
('1234', '78')
You can also write the print() line like this and skip the matches instantiation:
print(max(r.findall(a), key = lambda x: x[0]))

Open in new window

Avatar of phoffric
phoffric

ASKER

I guess I should have specified that the offset value represents an integer, so a string to int conversion has to first occur.
This makes the largest offset as 1000134.

-            matches      [('1234', '78'), ('1000134', '5678'), ('11234', '5678')]      list
+            0      ('1234', '78')      tuple
+            1      ('1000134', '5678')      tuple
+            2      ('11234', '5678')      tuple
            __len__      3      int
ASKER CERTIFIED SOLUTION
Avatar of aikimark
aikimark
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Great! Thanks.  I am now looking up references for the "([^<]+)" and your lambda usage. I understand what the regex expression is doing; I just want to better understand the syntax/semantics. I am familiar with lamda in C++, but I do not understand your max expression.

If you can explain those two points or provide good links, I would appreciate that. I'll be looking now.
Thanks again!
Thanks for the very nice python solution.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thank you for those explanations. Very helpful.

I believe that your ()'s identify the groupings that you return from the regex search. But I have been searching in various places for a breakdown of this regex: ([^<]+). I cannot seem to find good info on the <.
Most information is here:
https://docs.python.org/3/library/re.html

That link gives me info for +, ^, []:
+
Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.

[]
Used to indicate a set of characters. In a set:
...
Characters that are not within a range can be matched by complementing the set. If the first character of the set is '^', all the characters that are not in the set will be matched. For example, [^5] will match any character except '5'
Can you please provide me with  some documentation to explain the < and its usage in the construct?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks for the lucid explanation! And here I thought the < was another regex symbol. Thanks again.
Since the max offset and the length added together gives me the file length, I did something like this:

size = sum( map(int, re.findall(pat1, string) ) )
I could get rid of a bunch of code with this sum/map combo. Thanks.

Then I decided to constrain my search to within <toc> and </toc>. So, pat1 = "<toc>.*</toc>, and pat2 was the original pattern. I then did something like this:
size = sum( map(int, re.findall(pat2, re.findall(pat1, the_string)[0] ) )

Open in new window

Anyway, that is my recollection of what I did, and it worked on edge cases. So, thanks again.
I'll keep at it. I will have lots of time soon since my contract is ending in two weeks.

If this approach of having two findall's in one expression can be significantly improved, let me know, and I can ask another question about optimization. My intuition (at least if this were C++) is that we are not really creating strings, just creating metadata that describes the string section. But I could be very wrong since i am new to Python.