Link to home
Start Free TrialLog in
Avatar of aaron900
aaron900

asked on

Regex - HTML input with unescaped HTML in it - match the true value

Hi, ran across something I'm not quite sure how to capture...

Consider the following input:
<input value="Bad" type=hidden name=form_hf><INPUT value="Good" type=hidden name=form_act_message><input value="Indifferent" type=hidden name=form_uf>

Open in new window


I can capture the "Good" in form_act_message's value with:
<INPUT value="([^<]+)" type=hidden name=form_act_message>

Open in new window

- works fine.

But then I have some that contain unescaped line breaks in the value of the field - consider:
<input value="Bad" type=hidden name=form_hf><INPUT value="Good<br>Bad" type=hidden name=form_act_message><input value="Indifferent" type=hidden name=form_uf>

Open in new window


I don't believe it is truly valid, but unfortunately I don't have any control over it. While I'm still new to using these more than just their most simple uses, I'm not able to capture just that one "Good" value, also the "Good<br>Bad" value, and not capture more than I need (I only need the value from form_act_message's value tag).

Any help would be greatly appreciated! Thanks!
Avatar of wdosanjos
wdosanjos
Flag of United States of America image

Please try the following.  It returns "Good<br>Bad".

var v = "<input value=\"Bad\" type=hidden name=form_hf><INPUT value=\"Good<br>Bad\" type=hidden name=form_act_message><input value=\"Indifferent\" type=hidden name=form_uf>";
var value = Regex.Match(v, "(?<=\\<INPUT value=\").+(?=\" type=hidden name=form_act_message\\>)").Value;

// value = "Good<br>Bad"
//

Open in new window

Avatar of aaron900
aaron900

ASKER

Thanks, wdosanjos - I guess I should have clarified - this is actually a string of probably one hundred HTML input tags - even in RegexBuddy, which has pretty good performance, it takes probably 10-15 seconds to parse through the actual document.

Plus, it seems to grab some of the previous input tag (the one before the one I really need) into a match :-(
Can you provide the string as a file attachment to this question?
Here is an alternative approach without Regex:

string v = "<input value=\"Bad\" type=hidden name=form_hf><INPUT value=\"Good<br>Bad\" type=hidden name=form_act_message><input value=\"Indifferent\" type=hidden name=form_uf>";

int end = v.IndexOf("\" type=hidden name=form_act_message>") - 1;
int start = v.LastIndexOf('"', end, end - 1) + 1;
string value = v.Substring(start, end - start + 1);

// value = "Good<br>Bad"
//

Open in new window

See attached to this comment. Thanks!
tmpBF-FormFieldMockup.txt
Wait, sorry - that was one version - the version that I'm getting passed is different... let me post that.
Just my 2 cents. This captures all the text after all the value attributes, if that's what you're after. It includes the " characters as well - not sure if that's what you want. The text extracted is found in the capturing group.

INPUT.*value=("[^"]*"|[^>\s]*)

This next one captures all value tags that have <br> in them and are surrounded by "

INPUT.*value=("[^"]*<br>[^"]*")

which in your test file is the only one t hat you've marked as needing to be captured.

Hope that helps.
ASKER CERTIFIED SOLUTION
Avatar of kaufmed
kaufmed
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
As always, your thorough understanding of regular expressions blows me away, and teaches me a new lesson each time. Thanks!!!
NP. Let me know if I need to break it down. Looking back at it, I can see how it could appear a bit daunting  = )