[Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 328
  • Last Modified:

Glossary of terms, matching keywords without including html tags

Hello,

I have the following code which is close to working, but I recently ran into a problem I can't seem to resolve.  Items 1 thru 3 in the list below are what I need to resolve.  Item 4 and 5 are ideal, but I can probably wait and implement this once I understand jQuery better. I would appreciate any help someone can provide.

Thank you in advance.

1) The second parameter in the highlightGlossaryTerms() function when set to 1 will only highlight the first term found, then skip to the next term. I don't always want this code to highlight all found terms, just the first one otherwise the page might be full of glossary links.
2) If term is within the ALT, VALUE or other parameters in an html tag, do not convert it to the glossary link. The example code will show how this breaks, just look for the <IMG> tag. Term matching should remove all html tags, but us the text between the tags.
3) Optimize code, when the number of glossary terms grows it will not affect performance too much.
4) Convert Javascript code to use jQuery?
5) Convert terms array to JSON?
6) WORKING - Do not convert already existing <A> tags
7) WORKING - Grab "text" within the tag designated by id
8) WORKING - Must match terms within "text" using Javascript array
9) WORKING - Term matching should be case insensitive

NOTE: I don't need help with the displaying of term and definition.  This is already completed, but not represented in this condensed code.

==== CODE ====

<html>
<head>
<script type="text/javascript">
var istGlossaryTerms = new Array();
istGlossaryTerms[0] = ['velit','This is the description for the Velit term','1'];
istGlossaryTerms[1] = ['Lorem','This is the description for the Lorem term','2'];
istGlossaryTerms[2] = ['dignissim','This is the description for the Dignissim term','3'];

function highlightGlossaryTerms(obj, limitFirstTerm) {
    /*
          limitFirstTerm = set to number of same terms you want turned into glossary links.
    */
    /* var temp = $(obj).html(); */ /* <= jQuery code */
      var temp = document.getElementById(obj).innerHTML;
    // segregate the anchors
      var tmpBody = new Array();
      // Skip <a...>...</a> tags
      var cnt=0;
      var inA=false;
      var lookIn='';
      var subBody;
      var tmpStr;

      // Parse page content
      while (temp.length > 0) {

            // Find any links within the page and remove them before
            // applying mouse over links to glossary terms
            if ((temp.toLowerCase().indexOf('</A>'.toLowerCase())>-1)||(temp.toLowerCase().indexOf('<A'.toLowerCase())>-1)) {
                  
                  if (inA) {
                        tmpBody[cnt]=temp.substr(0,temp.toLowerCase().indexOf('</A>'.toLowerCase())+4);
                        temp=temp.substr(temp.toLowerCase().indexOf('</A>'.toLowerCase())+4);
                        inA=false;
                        cnt+=1;
                  } else {
                        tmpBody[cnt]=temp.substr(0,temp.toLowerCase().indexOf('<A'.toLowerCase())-1);
                        temp=temp.substr(temp.toLowerCase().indexOf('<A'.toLowerCase())-1);
                        inA=true;
                        lookIn+=''+cnt+',';
                        cnt+=1;
                  }

            } else {
                  tmpBody[cnt]=temp;
                  temp='';
                  lookIn+=''+cnt+',';
            }      
      }
      lookIn=lookIn.substr(0,lookIn.length-1);

      for(var idx = 0; idx < istGlossaryTerms.length; idx++) {
            for(var hdx = 0; hdx < tmpBody.length; hdx++) {
                  if(lookIn.indexOf(hdx)>-1) {
                        subBody=tmpBody[hdx].split(istGlossaryTerms[idx][0]);
                        tmpStr=subBody[0];
                        for(var jdx = 1; jdx < subBody.length; jdx++) {
                              
                              tmpStr+='<a href="#" rel="/glossary/class.glossary.php?id='+istGlossaryTerms[idx][2]+'" title="Glossary Term" class="ist-glossary-term">'+istGlossaryTerms[idx][0]+'<\/a>'+subBody[jdx];
                        }
                        tmpBody[hdx] = tmpStr;
                  }
            }
      }
      tmpStr='';
      for (var idx = 0; idx < tmpBody.length; idx++) {
            tmpStr+=tmpBody[idx];
      }
      /* $(obj).html(tmpStr); */ /* <= jQuery code */
      document.getElementById(obj).innerHTML=tmpStr;
}
</script>
<style type="text/css">
.ist-glossary-term {
      border-bottom: 1px dashed #FF0000;
      color: #FF0000;
}

a.ist-glossary-term:link, a.ist-glossary-term:visited {
      text-decoration: none;
}

.ist-glossary-term:hover {
      text-decoration: none;
}
</style>
</head>
<body>

<div id="istPageBody">
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Sed accumsan nibh et turpis. In dictum leo sed lectus. Nam fermentum erat id nulla. Nullam ac ligula. Mauris ante metus, interdum sit amet, porta non, fermentum eu, ante. Duis id est. Sed molestie felis sed urna. Praesent laoreet tincidunt nunc. Praesent lorem libero, sagittis a, rhoncus nec, auctor sit amet, nunc. Aliquam purus. Nulla facilisi. Nunc ut erat sit amet est condimentum bibendum. Proin lobortis massa viverra dui. Maecenas ac quam eget pede iaculis consectetuer. Curabitur consequat. Maecenas massa est, blandit et, fringilla ullamcorper, convallis vitae, mi. Ut sollicitudin convallis dolor. Nulla cursus dolor nec velit. Pellentesque quis urna id leo commodo scelerisque.<br /><br />

Sed nec metus condimentum elit consequat dignissim. Aenean purus elit, venenatis in, tempus sed, scelerisque ac, magna. Proin erat diam, adipiscing aliquam, feugiat fermentum, <img src="" border="1" width="10" height="10" alt="Testing terms within a tag, dignissim, shouldn't mess up the code"> dignissim vel, tellus. Nulla facilisi. Nulla tristique est et quam. In nec purus vitae nisl suscipit facilisis. Vivamus cursus. Etiam bibendum, arcu vitae aliquam dignissim, mi <a href="http://example.com" target="_blank">lorem</a> ornare libero, et porta metus tortor et erat. Pellentesque orci tortor, tempus vitae, aliquam posuere, elementum condimentum, ligula. Curabitur quam. Nunc nec augue nonummy quam fringilla sodales. Mauris rhoncus, sapien at imperdiet sodales, enim metus lobortis lorem, sit amet luctus lectus massa nec ipsum. Phasellus ac ipsum. Integer nec nulla. Duis commodo malesuada sem. Donec suscipit ligula vel odio. Sed in dui. Curabitur pede. Sed bibendum, pede at volutpat imperdiet, metus lorem suscipit turpis, et aliquam nibh urna et ipsum. Sed erat.<br /><br />

Nunc dui felis, vulputate sit amet, congue molestie, consectetuer non, metus. Sed nec orci. Sed dignissim. Etiam bibendum. Nam id magna nec neque venenatis imperdiet. Ut ac urna. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos hymenaeos. Vestibulum a risus. Ut nec sapien. Nulla nisi.<br /><br />

Phasellus velit <A href="http://example.com" target="_blank">velit</a>, convallis vitae, eleifend et, molestie dictum, enim. Sed cursus dapibus dolor. Sed non eros. Vestibulum adipiscing adipiscing tortor. Praesent elementum eleifend lectus. Suspendisse hendrerit orci id sapien. Ut ac mauris vel massa tristique sodales. Cras condimentum est ut velit. Vivamus pellentesque augue at felis. Vestibulum pellentesque. Sed vel libero a sem consequat cursus. Nunc vitae leo.
</div>

<br />

<div>
<strong>Do not match terms within this DIV tag.</strong><br /><br />
Nunc dui felis, vulputate sit amet, congue molestie, consectetuer non, metus. Sed nec orci. Sed dignissim. Etiam bibendum. Nam id magna nec neque venenatis imperdiet. Ut ac urna. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos hymenaeos. Vestibulum a risus. Ut nec sapien. Nulla nisi.<br /><br />

Phasellus velit <A href="http://example.com" target="_blank">velit</a>, convallis vitae, eleifend et, molestie dictum, enim. Sed cursus dapibus dolor. Sed non eros. Vestibulum adipiscing adipiscing tortor. Praesent elementum eleifend lectus. Suspendisse hendrerit orci id sapien. Ut ac mauris vel massa tristique sodales. Cras condimentum est ut velit. Vivamus pellentesque augue at felis. Vestibulum pellentesque. Sed vel libero a sem consequat cursus. Nunc vitae leo.
</div>


<script type="text/javascript">
      highlightGlossaryTerms('istPageBody', 1); /* The second parameter when set to 1 should only highlight first occurance of each term found */
</script>

</body>
</html>
0
expertis
Asked:
expertis
  • 5
  • 5
  • 2
  • +1
1 Solution
 
b0lsc0ttCommented:
If you are going to "convert" this to jquery eventually it might be best to just do it that way at first.  Although based still Javascript, Jquery might really change the way this should be done.  I am not a Jquery expert though so I can't help with that.  In fact, I haven't seen anything that makes me want to jump on board jquery yet although it seems to be getting more popular.

If this was just for Javascript then I would really suggest rethinking the way you do this.  Have you tried a regular expression and replace()?  A  regular expression can be used to make sure the match isn't in a tag (i.e. attribute, value, etc) and the replace can be done just once or globally.

It would seem to be a simple way to still do what you need.  However I don't know that jquery supports an expression or replace.  I guess, before I spend time working on this, I was wondering if you wanted general ideas to use in jquery or your current script or if you were open to sticking to Javascript and a major rework of how you are doing this.

Let me know if you have a question or need more info.

bol
0
 
BadotzCommented:
Here is one possibility for a JSON object representing your current array:

var istGlossary = {'terms':[
            {'term':      'velit',      'desc': 'This is the description for the Velit term',            'valu':'1'},
            {'term':      'Lorem',      'desc': 'This is the description for the Lorem term',            'valu':'2'},
            {'term':      'dignissim',      'desc': 'This is the description for the Dignissim term',      'valu':'3'}
      ]
};


You can access the elements like this:

var counter = istGlossary.terms.length; // 3
var item_01 = istGlossary.terms[0].desc; // 'This is the description for the Velit term'
var valu_12 = istGlossary.terms[1].valu; // 2
var term_20 = istGlossary.terms[2].term; // 'dignissim'

I'm not sure how this will help you solve you other problem, though.
0
 
basicinstinctCommented:
You probably won't like this because I suggest you rewrite the whole thing from scratch.
IMHO the way you are doing it is wrong.  You should not be using innerHTML at all.  Doing so causes you to end up writing your own HTML parsing code (yuk) like this kind of thing:

(temp.toLowerCase().indexOf('</A>'.toLowerCase())>-1)

This is really the wrong way to do it.  

I'm not keen to write a whole glossary matching script for you, but to get you started:

- use DOM methods instead of innerHTML
- check TAGNAME to see if the node is an <a> tag
- use NODETYPE == 3 to check if the noide is visible text (displayed on the page)
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
expertisAuthor Commented:
Thank you both for the response. Right now I really need to get this resolved so it will need to be Javascript -- not jQuery.  I mentioned jQuery in case it might be easier to accomplish using a library.

The JSON help is appreciated, but it definitely won't help with the 2 problems I'm having, the array is doing just fine for now. The database backend that generates that array is already completed and working fine.  However, I'm trying to learn JSON so your example really helps.

Thanks.
0
 
BadotzCommented:
Visit www.json.org for more details on JSON.
0
 
expertisAuthor Commented:
b0lsc0tt,
Regular expressions and replace seem to be the best solution, especially for performance reasons?  I want to make sure pages with a lot of text and markup will not be sluggish or experience other problems.  Would you agree? Going forward I want to stick with Javascript.  
0
 
b0lsc0ttCommented:
Javascript's regex engine has some weaknesses but expressions are great way to search and manipulate text, especially a lot of text.  If there is a lot then IE can have issues, especially with an ineffecient or complex expression.  However the expression you need shouldn't be too bad and will work in Javascript.

I worked on a function to do the replace.  I don't know if the changes I make are what you need since that wasn't real clear from your code or question.  However the expression and script won't match/replace a term in a tag (e.g. an attribute).

function highlightTerms(obj, limitFirstTerm) {
      ele = document.getElementById(obj);
      txt = ele.innerHTML;
      for (var i=0; i<istGlossaryTerms.length; i++) {
            var regObj = new RegExp(istGlossaryTerms[i][0] + '(?![^<]*>)','gi');
            txt = txt.replace(regObj, '<a href="#" rel="/glossary/class.glossary.php?id='+istGlossaryTerms[i][2]+'" title="Glossary Term" class="ist-glossary-term">'+istGlossaryTerms[i][0]+'<\/a>');
      }
      ele.innerHTML = txt;
}

Let me know what you think of that or if you have a question.  This function uses innerHTML and that will work well in current browsers and is an easy, fast method to actually "change" the html.

Basicinstinct's post does bring up some great points though.  I disagree that using innerHTML is wrong ("shouldn't use") but there are arguments against using it.  Of course there are arguments for it too.  Script to do what he suggests is not difficult but more complicated than the script above; replacing the text for an anchor is the complicated part that innerHTML makes sooo easy.  It probably is the "correct" way to change the html, especially using W3 specs, but is often slower in browsers than innerHTML, which is still well and universally supported.  I think it was a great suggestion though and I was really interested in it (I don't use DOM as much as I probably should and choose the easy (lazy) way).  Since it works and is fast it is hard to change just because it isn't "proper."

Let me know if there are any questions about this.  I can help you modify the code above to work in your page if I get some details from you.  That function will replace the majority of your script in the code above and fixes issue 2.

bol
0
 
expertisAuthor Commented:
b0lsc0tt,
Thanks so much... I have the initial requirements working, but introduced one more problem after some tweaking of the regular expression.  The current function looks like this;

function highlightGlossaryTerms(obj, limitFirstTerm) {
      ele = document.getElementById(obj);
      txt = ele.innerHTML;
      if(limitFirstTerm == 1) { var regFlags = 'i'; } else { var regFlags = 'gi'; }
      for (var i = 0; i < istGlossaryTerms.length; i++) {
            var regObj = new RegExp('('+istGlossaryTerms[i][0] + ')(?![^<]*>)(?!<a*[^>]*>)', regFlags);
            txt = txt.replace(regObj, '<a href="#" rel="/glossary/class.glossary.php?id='+istGlossaryTerms[i][2]+'" title="Glossary Term" class="ist-glossary-term">'+"$1"+'<\/a>');
      }
      ele.innerHTML = txt;
}

Your regular expression had a few issues; it didn't exclude links and also replaced the term with the one from the array.  This was a problem because the term doesn't always match the capitalization of the term found in the actual text, so now I can use $1 in replace instead of the array term.

I also needed to be able to set a flag that replaced all occurances of the term on the page or just the first occurance.  This was done by adding or removing the 'g' from the regular expresions flag.

It looks like I have one other problem... With my addition to the regular expression it now will not replace the term if it occurs inside a tag like <p>term</p>.

Could you verify my changes to the regular expression haven't caused any further problems?  Thank you again.
0
 
b0lsc0ttCommented:
I added the global argument and used the expression like that so it would be easy to implement your limitFirstTerm feature.  Sorry I forgot to mention that.  I'm glad you caught it.

I wasn't real sure about what you wanted to replaced, etc but glad to hear you quickly figured out the change to get exactly what you want.

There is a problem with the change you made.  Adding another negative lookahead is not the way to fix it.  If you want to not replace matches that are in an anchor tag try ...

            var regObj = new RegExp('('+istGlossaryTerms[i][0] + ')(?![^<]*>|</a>)', regFlags);

That will look for an anchor closing tag right after the term.  Other tags (e.g. <p>) won't be affected.  However if you may have the term like ...

<a href="">This is a sentence with a lorem term</a>

Then that will be matched and replaced.  Let me know if that will be an issue.

Let me know how this works or if you have a question.

bol
0
 
expertisAuthor Commented:
Unfortunately this regular expression has the issue you pointed out where if a term is inside a link tag along with other words, the link is converted to a "term" link and closed prematurely. For example;

Code Before:
Phasellus <p>velit</p> <A href="http://example.com" target="_blank">this is velit the term</a>

Code After:
Phasellus  <p><a class="ist-glossary-term" title="Glossary Term" rel="/glossary/class.glossary.php?id=1" href="#">velit</a></p> <a target="_blank" href="http://example.com">this is </a> <a class="ist-glossary-term" title="Glossary Term" rel="/glossary/class.glossary.php?id=1" href="#">velit</a> the term

In the above example it should only replace the term velit between the <p> tag. As you can see the term that was in between the <a> tags along with the other words is converted and the final two words ("the term") are removed from the link. This can't happen.  As you can see my regular expression skills are lacking, do you have any other recommendation on how to fine tune it?

Thanks,

Spencer
0
 
b0lsc0ttCommented:
Javascript's regex "engine's" limitations make this a little harder than in other cases but I did think of a modification that should handle even this new case.  Right now I can't think of an issue with it and it passed the tests I made.

            var regObj = new RegExp('('+istGlossaryTerms[i][0] + ')(?![^<]*>|[^<]*</a>)', regFlags);

Notice I added so the expression will now allow characters between the term and the closing tag.  I think that final adjustment will have this set but please test it (well). :)

Let me know if you have a question or how it works.

bol
0
 
expertisAuthor Commented:
Thanks for the help, sorry for the delay in awarding the points.
0
 
b0lsc0ttCommented:
Your welcome!  Thanks for coming back to close this yourself.

I'm glad I could help.  Thanks for the grade, the points and the fun question.

bol
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 5
  • 5
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now