How can i find all input names and the coresponding values in text file

I am in need of parsing field name and values from an html form to add to my db. I know i can go and do a find
 "input name='" then start another find to find the closing "'" and get the data via mid function then do the same
 for value via find "value='"
 I was wondering if there is an easier way to loop the doc and extract all input names and the associated values ?

 Below is a sample of what my page to parse looks like

<input name='a_glare'
                        value='B'
                        class='inputbox-highlighted-false'
                        size='1'
                        maxlength='1'>  
        </td>



                 <td align="center">


                    <input name='a_testani'
                        value=''
                        class='inputbox-highlighted-false'
                        size='1'
                        maxlength='1'>  

                 </td>

                 <td align="center">

                    <input name='a_tksig'
                        value='EC'
                        class='inputbox-highlighted-false'
                        size='2'
                        maxlength='2'>  


                 </td>

                 <td align="center">

                    <input name='a_sacnon'
                        value=''
                        class='inputbox-highlighted-false'
                        size='1'
                        maxlength='1'>  

                 </td>

                 <td align="center">

                    <input name='a_ot'
                        value=''
                        class='inputbox-highlighted-false'
                        size='1'
                        maxlength='1'>  

                 </td>


                 <td align="center">

                    <input name='a_ovlp'
                        value=''
                        class='inputbox-highlighted-false'
                        size='1'
                        maxlength='1'>  
AlexPonnathAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Michael FowlerSolutions ConsultantCommented:
0
IanStatisticianCommented:
Hi there AlexPonnath,

The eaziest way is under program control using a languiage which uses Perl Compatiable Regular expressions.

I assume you are not familiar with any languages where you use PCRE  (or you wouldn't haver asked this question).

However you can use a text editor to get the same results.

My suggestion for windows is   notepad++,   available from http://notepad-plus-plus.org/.  You can't use m$ notepad!!!  Many other editors exist for OSX and nix environments that do PCRE.

If using notepad++, open up a file with the web page you want to analyise.

(Make sure the cursor is on the first line before the first character).

then either ^H (or Search  -> Replace from the menus).

In the "Find what" field enter the following line (EXACTLY as shown, no extra blanks!)
name=('|")(\w*)\1\s*value=('|")(\w*)\3.*$

In the "replace with" field enter the following line
\n#$2\t$4

Click on "Regular expression" at the bottom of the dialog box.

Click on "Replace All"

It will find all the   name -> value  pairs and put them on a line by themselves starting with a # character.

Now you have to get rid of all the remaining rubbish.


In the "Find what" field enter the following line
^[^#].*$\R

Clear the  "replace with"  field.

(Make sure the cursor is on the first line before the first character).
Click on "Replace all"

(removes all the lines that don't start with #)


In the "Find what" field enter the following line
\R\R

In the  "replace with"  field enter the following line
\n

(Make sure the cursor is on the first line before the first character).
Click on "Replace all"  (You will need to do this a few time until it tells you there were no replacements.

(This removes all the blank lines)

You then have a file with one line with each name value pair as below
# <name> <tab char> <value>

What you do from there is up to your processing requirements.

(for example you can remove the # characters, you can add in text surrounding the name and values, etc.)

This assumes that both the name and value are "words - that is composed of A-Z, a-z, 0-9 and underscore.  No balnks, no extras like   +-!@$%&   characters. The pattern for the match would need adjusting if you want a different rule for value and/or name.

Ian



Explaination of the regular expression:
Note blanks are important, I expand it here only to explain.

name=('|")(\w*)\1\s*value=('|")(\w*)\3.*$
=>
name=    ('|")    (\w*)   \1    \s*   value=   ('|")    (\w*)    \3     .*    $
name=       - find these characters exactly
('|")             - next you MUST find either single or double quote. Remember that match as match number 1
(\w*)           - next you MUST match successive word characters only, but keep matching while there are
                      word characters to match.  Remember that as match 2
\1                - next exactly find the match 1 character (either single or doulbe quote
\s*             -  next match (as many as possible white space chars (blank, tabls, CR, LF)
value=       - exactly match these characters
('|")            - again match single or double quote. Remember as match 3
(\w*)          - again match word characters. Remember as match 4.
\3               - match the single/double quote found in match 3
.*               -  match any number (zero or more) characters on the rest of the line.
$                - stop at the end of line (but don't gobble it up)

For the replacement
\n#$2\t$4
=>
\n   #   $2   \t   $4
\n             - start a new line
#              - put in a hatch character (this could be another matker if you wanted)
$2            - put in the second match (the name part).
\t             -  put in a tab character  (you could have a comma it you wanted
$4            - put in the 4th match (the value part)

Note in the replacement part you need to use  $2 unlike  \2 that would be used in the find pattern.

=======

Also
^[^#].*$\R
=>
^  [^#]  .*   $   \R
^          - start at the begining of the line
[^#]     - find one character that is not a hatch character
.*         - find a many as possible "any" characters on the same line
$          - match up to the end of line
\R        - gobble up the end of line (\R means CR, or LF, or CR+LF or LF+CR)

====
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
IanStatisticianCommented:
Hi there AlexPonnath,

If you can get regular expressions going under program control, then you would just need to itterate over the html page,  feeding in the first pattern match (name=('|")(\w*)\1\s*value=('|")(\w*)\3.*$) and select the match 2 and match 4 from the result. Note depending on the routines,  a match number 0 is usually returned which is the whole pattern. (In addition to match1, match2, match3 and match4).

Ian
0
The Ultimate Tool Kit for Technolgy Solution Provi

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy for valuable how-to assets including sample agreements, checklists, flowcharts, and more!

AlexPonnathAuthor Commented:
Thanks, I have ultraedit which supports the regular expressions and it works as advertised. Great job, I ended up with
exactly what you said after running the 3 passes over the file. Now I just have to figure out how I can do this in my code.

I am a bit confused on your last comment on match 2 and 4 , I assume match1 is the pattern match (name=('|")(\w*)\1\s*value=('|")(\w*)\3.*$) but not sure about 2 and 4 and how I would access them

Thanks
0
IanStatisticianCommented:
Sorry, I didnt explain the numbering scheme very well.

Each set of parens ( up to matching ) is a kept match, numbered 1, 2, ...  

The numbering is by the order of the left paren, so that you have a method of uniquly numbering even with nested matches.

So in
name=    ('|")    (\w*)   \1    \s*   value=   ('|")    (\w*)    \3     .*    $
----------    ===    ====   ---    ----    ---------   ===    ====    ---     ---    --
                   1         2                                       3          4

the bits underlined with ===  are kept in numbered sequence, the bits underlined with ---- are not kept (except the whole string that is matched is available as number 0.

For better doco, you will need to search the web.  There is loads of doco about regular expressions.  Just be warned that the complicated bits can vary between implementations.  All the basis stuff is pretty much the same the world over!
0
IanStatisticianCommented:
If running under program control, I would just do the match

/name=('|")(\w*)\1\s*value=('|")(\w*)\3.*$/

and retrieve sub-match 2 and sub-match 4.    => itterate over the whole source HTML document.

[[[ Often the program functions want the string enclosed in  /  and  / as I have done here.  Read the notes on the PCRE functions you will use to see what it wants. ]]]


The replacement

\n#$2\t$4
and successive matches
^[^#].*$\R
and
\R\R

were there because with an editor you don't have other storage to put found stuff. Under program control you can just pick the matches off and store in an array or whatever.
0
IanStatisticianCommented:
Not sure what how the result of the PCRE you would use would go.

Maybe it will return an array of strings.

X[0]  ->  string which matches the whole name= ..... value= ...'   bit
X[1] ->   single/double quote
X[2]  ->  <name>
X[3]  ->  single/double quote
X[4]  ->  <value>

so for each itterated match, just get the returned X, and save X[2] and X[4], throw away the rest.

.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Visual Basic.NET

From novice to tech pro — start learning today.