aaron900
asked on
Regex - HTML input with unescaped HTML in it - match the true value
Hi, ran across something I'm not quite sure how to capture...
Consider the following input:
I can capture the "Good" in form_act_message's value with:
But then I have some that contain unescaped line breaks in the value of the field - consider:
I don't believe it is truly valid, but unfortunately I don't have any control over it. While I'm still new to using these more than just their most simple uses, I'm not able to capture just that one "Good" value, also the "Good<br>Bad" value, and not capture more than I need (I only need the value from form_act_message's value tag).
Any help would be greatly appreciated! Thanks!
Consider the following input:
<input value="Bad" type=hidden name=form_hf><INPUT value="Good" type=hidden name=form_act_message><input value="Indifferent" type=hidden name=form_uf>
I can capture the "Good" in form_act_message's value with:
<INPUT value="([^<]+)" type=hidden name=form_act_message>
- works fine.But then I have some that contain unescaped line breaks in the value of the field - consider:
<input value="Bad" type=hidden name=form_hf><INPUT value="Good<br>Bad" type=hidden name=form_act_message><input value="Indifferent" type=hidden name=form_uf>
I don't believe it is truly valid, but unfortunately I don't have any control over it. While I'm still new to using these more than just their most simple uses, I'm not able to capture just that one "Good" value, also the "Good<br>Bad" value, and not capture more than I need (I only need the value from form_act_message's value tag).
Any help would be greatly appreciated! Thanks!
ASKER
Thanks, wdosanjos - I guess I should have clarified - this is actually a string of probably one hundred HTML input tags - even in RegexBuddy, which has pretty good performance, it takes probably 10-15 seconds to parse through the actual document.
Plus, it seems to grab some of the previous input tag (the one before the one I really need) into a match :-(
Plus, it seems to grab some of the previous input tag (the one before the one I really need) into a match :-(
Can you provide the string as a file attachment to this question?
Here is an alternative approach without Regex:
string v = "<input value=\"Bad\" type=hidden name=form_hf><INPUT value=\"Good<br>Bad\" type=hidden name=form_act_message><input value=\"Indifferent\" type=hidden name=form_uf>";
int end = v.IndexOf("\" type=hidden name=form_act_message>") - 1;
int start = v.LastIndexOf('"', end, end - 1) + 1;
string value = v.Substring(start, end - start + 1);
// value = "Good<br>Bad"
//
ASKER
See attached to this comment. Thanks!
tmpBF-FormFieldMockup.txt
tmpBF-FormFieldMockup.txt
ASKER
Wait, sorry - that was one version - the version that I'm getting passed is different... let me post that.
Just my 2 cents. This captures all the text after all the value attributes, if that's what you're after. It includes the " characters as well - not sure if that's what you want. The text extracted is found in the capturing group.
INPUT.*value=("[^"]*"|[^>\ s]*)
This next one captures all value tags that have <br> in them and are surrounded by "
INPUT.*value=("[^"]*<br>[^ "]*")
which in your test file is the only one t hat you've marked as needing to be captured.
Hope that helps.
INPUT.*value=("[^"]*"|[^>\
This next one captures all value tags that have <br> in them and are surrounded by "
INPUT.*value=("[^"]*<br>[^
which in your test file is the only one t hat you've marked as needing to be captured.
Hope that helps.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
As always, your thorough understanding of regular expressions blows me away, and teaches me a new lesson each time. Thanks!!!
NP. Let me know if I need to break it down. Looking back at it, I can see how it could appear a bit daunting = )
Open in new window