[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 502
  • Last Modified:

Regex - HTML input with unescaped HTML in it - match the true value

Hi, ran across something I'm not quite sure how to capture...

Consider the following input:
<input value="Bad" type=hidden name=form_hf><INPUT value="Good" type=hidden name=form_act_message><input value="Indifferent" type=hidden name=form_uf>

Open in new window


I can capture the "Good" in form_act_message's value with:
<INPUT value="([^<]+)" type=hidden name=form_act_message>

Open in new window

- works fine.

But then I have some that contain unescaped line breaks in the value of the field - consider:
<input value="Bad" type=hidden name=form_hf><INPUT value="Good<br>Bad" type=hidden name=form_act_message><input value="Indifferent" type=hidden name=form_uf>

Open in new window


I don't believe it is truly valid, but unfortunately I don't have any control over it. While I'm still new to using these more than just their most simple uses, I'm not able to capture just that one "Good" value, also the "Good<br>Bad" value, and not capture more than I need (I only need the value from form_act_message's value tag).

Any help would be greatly appreciated! Thanks!
0
aaron900
Asked:
aaron900
  • 5
  • 3
  • 2
  • +1
1 Solution
 
wdosanjosCommented:
Please try the following.  It returns "Good<br>Bad".

var v = "<input value=\"Bad\" type=hidden name=form_hf><INPUT value=\"Good<br>Bad\" type=hidden name=form_act_message><input value=\"Indifferent\" type=hidden name=form_uf>";
var value = Regex.Match(v, "(?<=\\<INPUT value=\").+(?=\" type=hidden name=form_act_message\\>)").Value;

// value = "Good<br>Bad"
//

Open in new window

0
 
aaron900Author Commented:
Thanks, wdosanjos - I guess I should have clarified - this is actually a string of probably one hundred HTML input tags - even in RegexBuddy, which has pretty good performance, it takes probably 10-15 seconds to parse through the actual document.

Plus, it seems to grab some of the previous input tag (the one before the one I really need) into a match :-(
0
 
wdosanjosCommented:
Can you provide the string as a file attachment to this question?
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
wdosanjosCommented:
Here is an alternative approach without Regex:

string v = "<input value=\"Bad\" type=hidden name=form_hf><INPUT value=\"Good<br>Bad\" type=hidden name=form_act_message><input value=\"Indifferent\" type=hidden name=form_uf>";

int end = v.IndexOf("\" type=hidden name=form_act_message>") - 1;
int start = v.LastIndexOf('"', end, end - 1) + 1;
string value = v.Substring(start, end - start + 1);

// value = "Good<br>Bad"
//

Open in new window

0
 
aaron900Author Commented:
See attached to this comment. Thanks!
tmpBF-FormFieldMockup.txt
0
 
aaron900Author Commented:
Wait, sorry - that was one version - the version that I'm getting passed is different... let me post that.
0
 
aaron900Author Commented:
0
 
crysallusCommented:
Just my 2 cents. This captures all the text after all the value attributes, if that's what you're after. It includes the " characters as well - not sure if that's what you want. The text extracted is found in the capturing group.

INPUT.*value=("[^"]*"|[^>\s]*)

This next one captures all value tags that have <br> in them and are surrounded by "

INPUT.*value=("[^"]*<br>[^"]*")

which in your test file is the only one t hat you've marked as needing to be captured.

Hope that helps.
0
 
käµfm³d 👽Commented:
Try this out. It will only work if your "value" attribute is surrounded by quotes and there is not an internal quote within the value.

(?i)<input (?=(?:[^>]|(?<=<br)>)*name=\"?form_act_message)(?:[^>]|(?<=<br)>)*value=\"([^\"]*)\"

Open in new window


Example Usage
using (StreamReader reader = new StreamReader("input.txt"))
{
    while (!reader.EndOfStream)
    {
        string line = reader.ReadLine();
        MatchCollection matches = Regex.Matches(line, "(?i)<input (?=(?:[^>]|(?<=<br)>)*name=\"?form_act_message)(?:[^>]|(?<=<br)>)*value=\"([^\"]*)\"");

        foreach (Match m in matches)
        {
            Console.WriteLine(m.Groups[1].Value);
        }
    }
}

Open in new window

0
 
aaron900Author Commented:
As always, your thorough understanding of regular expressions blows me away, and teaches me a new lesson each time. Thanks!!!
0
 
käµfm³d 👽Commented:
NP. Let me know if I need to break it down. Looking back at it, I can see how it could appear a bit daunting  = )
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 5
  • 3
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now