asked on

Regex - HTML input with unescaped HTML in it - match the true value

Hi, ran across something I'm not quite sure how to capture...

Consider the following input:

<input value="Bad" type=hidden name=form_hf><INPUT value="Good" type=hidden name=form_act_message><input value="Indifferent" type=hidden name=form_uf>

Open in new window

I can capture the "Good" in form_act_message's value with:

<INPUT value="([^<]+)" type=hidden name=form_act_message>

Open in new window

- works fine.

But then I have some that contain unescaped line breaks in the value of the field - consider:

<input value="Bad" type=hidden name=form_hf><INPUT value="Good<br>Bad" type=hidden name=form_act_message><input value="Indifferent" type=hidden name=form_uf>

Open in new window

I don't believe it is truly valid, but unfortunately I don't have any control over it. While I'm still new to using these more than just their most simple uses, I'm not able to capture just that one "Good" value, also the "Good<br>Bad" value, and not capture more than I need (I only need the value from form_act_message's value tag).

Any help would be greatly appreciated! Thanks!

wdosanjos

Please try the following. It returns "Good<br>Bad".

var v = "<input value=\"Bad\" type=hidden name=form_hf><INPUT value=\"Good<br>Bad\" type=hidden name=form_act_message><input value=\"Indifferent\" type=hidden name=form_uf>";
var value = Regex.Match(v, "(?<=\\<INPUT value=\").+(?=\" type=hidden name=form_act_message\\>)").Value;

// value = "Good<br>Bad"
//

Open in new window

aaron900

ASKER

Thanks, wdosanjos - I guess I should have clarified - this is actually a string of probably one hundred HTML input tags - even in RegexBuddy, which has pretty good performance, it takes probably 10-15 seconds to parse through the actual document.

Plus, it seems to grab some of the previous input tag (the one before the one I really need) into a match :-(

wdosanjos

Can you provide the string as a file attachment to this question?

wdosanjos

Here is an alternative approach without Regex:

string v = "<input value=\"Bad\" type=hidden name=form_hf><INPUT value=\"Good<br>Bad\" type=hidden name=form_act_message><input value=\"Indifferent\" type=hidden name=form_uf>";

int end = v.IndexOf("\" type=hidden name=form_act_message>") - 1;
int start = v.LastIndexOf('"', end, end - 1) + 1;
string value = v.Substring(start, end - start + 1);

// value = "Good<br>Bad"
//

Open in new window

aaron900

ASKER

See attached to this comment. Thanks!
tmpBF-FormFieldMockup.txt

aaron900

ASKER

Wait, sorry - that was one version - the version that I'm getting passed is different... let me post that.

aaron900

ASKER

tmpBF-FormFieldMockup.txt

crysallus

Just my 2 cents. This captures all the text after all the value attributes, if that's what you're after. It includes the " characters as well - not sure if that's what you want. The text extracted is found in the capturing group.

INPUT.*value=("[^"]*"|[^>\s]*)

This next one captures all value tags that have <br> in them and are surrounded by "

INPUT.*value=("[^"]*<br>[^"]*")

which in your test file is the only one t hat you've marked as needing to be captured.

Hope that helps.

ASKER CERTIFIED SOLUTION

kaufmed

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

aaron900

ASKER

As always, your thorough understanding of regular expressions blows me away, and teaches me a new lesson each time. Thanks!!!

kaufmed

NP. Let me know if I need to break it down. Looking back at it, I can see how it could appear a bit daunting = )