phoffric
asked on
How to find the largest value in a string with Python3.7
I have a text string like this:
I know I can write a loop searching for each "<offset>" and extract the value. I was wondering if in python3.7, there is a non-loop approach. (I can use existing xml parsing code, but this seems simple enough to just use the text string.)
Thanks,
Paul
"stuff<offset>1234</offset><length>78</length>stuff <offset>1000134</offset><length>5678</length>stuff...<offset>11234</offset><length>5678</length>stuff"
My goal is to find the largest offset value and the corresponding length value.I know I can write a loop searching for each "<offset>" and extract the value. I was wondering if in python3.7, there is a non-loop approach. (I can use existing xml parsing code, but this seems simple enough to just use the text string.)
Thanks,
Paul
You can also write the print() line like this and skip the matches instantiation:
print(max(r.findall(a), key = lambda x: x[0]))
ASKER
I guess I should have specified that the offset value represents an integer, so a string to int conversion has to first occur.
This makes the largest offset as 1000134.
- matches [('1234', '78'), ('1000134', '5678'), ('11234', '5678')] list
+ 0 ('1234', '78') tuple
+ 1 ('1000134', '5678') tuple
+ 2 ('11234', '5678') tuple
__len__ 3 int
This makes the largest offset as 1000134.
- matches [('1234', '78'), ('1000134', '5678'), ('11234', '5678')] list
+ 0 ('1234', '78') tuple
+ 1 ('1000134', '5678') tuple
+ 2 ('11234', '5678') tuple
__len__ 3 int
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Great! Thanks. I am now looking up references for the "([^<]+)" and your lambda usage. I understand what the regex expression is doing; I just want to better understand the syntax/semantics. I am familiar with lamda in C++, but I do not understand your max expression.
If you can explain those two points or provide good links, I would appreciate that. I'll be looking now.
Thanks again!
If you can explain those two points or provide good links, I would appreciate that. I'll be looking now.
Thanks again!
ASKER
Thanks for the very nice python solution.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thank you for those explanations. Very helpful.
I believe that your ()'s identify the groupings that you return from the regex search. But I have been searching in various places for a breakdown of this regex: ([^<]+). I cannot seem to find good info on the <.
Most information is here:
https://docs.python.org/3/library/re.html
That link gives me info for +, ^, []:
I believe that your ()'s identify the groupings that you return from the regex search. But I have been searching in various places for a breakdown of this regex: ([^<]+). I cannot seem to find good info on the <.
Most information is here:
https://docs.python.org/3/library/re.html
That link gives me info for +, ^, []:
+Can you please provide me with some documentation to explain the < and its usage in the construct?
Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.
[]
Used to indicate a set of characters. In a set:
...
Characters that are not within a range can be matched by complementing the set. If the first character of the set is '^', all the characters that are not in the set will be matched. For example, [^5] will match any character except '5'
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks for the lucid explanation! And here I thought the < was another regex symbol. Thanks again.
ASKER
Since the max offset and the length added together gives me the file length, I did something like this:
size = sum( map(int, re.findall(pat1, string) ) )
I could get rid of a bunch of code with this sum/map combo. Thanks.
Then I decided to constrain my search to within <toc> and </toc>. So, pat1 = "<toc>.*</toc>, and pat2 was the original pattern. I then did something like this:
I'll keep at it. I will have lots of time soon since my contract is ending in two weeks.
If this approach of having two findall's in one expression can be significantly improved, let me know, and I can ask another question about optimization. My intuition (at least if this were C++) is that we are not really creating strings, just creating metadata that describes the string section. But I could be very wrong since i am new to Python.
size = sum( map(int, re.findall(pat1, string) ) )
I could get rid of a bunch of code with this sum/map combo. Thanks.
Then I decided to constrain my search to within <toc> and </toc>. So, pat1 = "<toc>.*</toc>, and pat2 was the original pattern. I then did something like this:
size = sum( map(int, re.findall(pat2, re.findall(pat1, the_string)[0] ) )
Anyway, that is my recollection of what I did, and it worked on edge cases. So, thanks again.I'll keep at it. I will have lots of time soon since my contract is ending in two weeks.
If this approach of having two findall's in one expression can be significantly improved, let me know, and I can ask another question about optimization. My intuition (at least if this were C++) is that we are not really creating strings, just creating metadata that describes the string section. But I could be very wrong since i am new to Python.
Open in new window
produces the following output:('1234', '78')