JavaScript RegExp and UTF-8

Howdy all,
  I have a JavaScript function that has been working fine forever, but now it needs to support UTF-8.  Here is a snippet of the code:

var keywords = Array ('some word', 'Ă ă Ş ş Ţ ţ', 'another word', 'Š Ť Ž Ľ Č Ě Ď Ň Ř Ů Ĺ');

for (var counter = 0; counter < keywords.length; counter++) {
    var re = new RegExp ("\\b(" + keywords[counter] + ")\\b", 'i');
    var testit = re.test (oNode.nodeValue);
}

This is running through the DOM nodes in a document, and finding occurrences of the keywords in the keywords array.  Works fine with standard characters, always has.  But it seems to ignore the extended characters.

Any thoughts?
headzooAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

ljo8877Commented:
Can it be as simple as changing this '&#258; &#259; &#350; &#351; &#354; &#355;' to
'&#258;&#259;&#350;&#351;&#354;&#355;' ? I suspect the space between each letter prevents matching a word (usually doesn't have a space between each letter) or a single letter with \\bword\\b

0
headzooAuthor Commented:
Ignore what you see above.  I didn't put in text like "&#259; &#350;", I put in the actual characters that I got from this page: http://www.slovo.info/testuni.htm . The Experts Exchange website seems to have messed things up a bit.

- Sean
0
ljo8877Commented:
Well, this is interesting,

javascript: var x ="Š&#268;Ž&#262;&#272;"; /Š&#268;Ž&#262;&#272;/.test(x);

Not using unicode for clarity var x = "SCZCD"; /SCZCD/.test(x);

returns true, but adding \b or \\b returns false. It seems the combination of extended unicode and the word boundry character are incompatible.

But, I have a work around.

javascript: var x ="Š&#268;Ž&#262;&#272;";  /^Š&#268;Ž&#262;&#272;$|^Š&#268;Ž&#262;&#272; | Š&#268;Ž&#262;&#272;$| Š&#268;Ž&#262;&#272; /.test(x);

Again not in unicode for clarity.

javascript: var x ="SCZCD";  /^SCZCD$|^SCZCD | SCZCD$| SCZCD /.test(x);

This expression returned the correct results. It test for the string being the whole line, at the start of the line followed by a space, at the end of the line preceded by a space, or with a space at each end.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
The Ultimate Tool Kit for Technolgy Solution Provi

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy for valuable how-to assets including sample agreements, checklists, flowcharts, and more!

ljo8877Commented:
Oh the javascript: is because I did the tests in the browsers location field and javascript: is the protocol. It is not required in the code.
0
headzooAuthor Commented:
Hmm.. Let me give this a shot.  I have two concerns though: 1) The characters may or may not be unicode.  The script needs to be able to handle both character sets. 2) I've used a space as a delimiter before, and that causes problems if, say, a keyoword has a period after it, or a comma.  The \b ended up working well for that.

Is there a function I can use to test a string for Unicode?

- Sean
0
ljo8877Commented:
For your second question, the space can be replaced with [ .!?;:] one of a set. Using the \b would certainly be ideal, but it doesn't seem to work and the various things I've tried like putting it in () or [] didn't help, but I'll keep looking.

As for testing for extended  characters, I'm guessing the that string.charCodeAt(0) will always return a value higher than 255. Those that I have tested did. What I haven't tried is mixing standard ASCII characters and the extended set particularly with standard characters surrounding extended characters.
0
headzooAuthor Commented:
Well, if the 0 in charCodeAt() is the index, or letter number, then I can create a function to run through each character in a string, and if any of them are above 255, then the string must have extended characters in it.  If that is the case, then that problem is solved.  Let me try some test and I'll get back to you.

- Sean
0
headzooAuthor Commented:
Okay, this function seems to work to test for extended characters in a string:

function unitest(string)
{
      var unicode = false;
      for (var i = 0; i < string.length; i++)
      {
            if (string.charCodeAt(i) > 255)
            {
                  unicode = true;
            }
      }
      return unicode;
}

Now I just have to test out the RegExp.

- Sean
0
headzooAuthor Commented:
Okay great, everything seems to be working.   Here is the test code I used:

var keyword = '&#1050;&#1051;&#1052;';
var phrase = 'Use my geat program &#1050;&#1051;&#1052;.';

if (unitest(keyword))
{
      var re = new RegExp('^' + keyword + '$|^' + keyword + '[ .!?;:]|[ .!?;:]' + keyword + '$|[ .!?;:]' + keyword + '[ .!?;:]', 'i');
      var testit = re.test(phrase);
      alert(testit);
}
else
{
      var re = new RegExp("\\b(" + keyword + ")\\b", 'i');
      var testit = re.test(phrase);
      alert(testit);
}


function unitest(string)
{
      var unicode = false;
      for (var i = 0; i < string.length; i++)
      {
            if (string.charCodeAt(i) > 255)
            {
                  unicode = true;
            }
      }
      return unicode;
}

Thanks!
0
ljo8877Commented:
Sorry, I've been away from my desk. Went to a job fair; thought I might find a job.

I like the finished code. But I just tested  aŠ &#268; Ž &#262; &#272;b (aS C Z C Db) with a regular character on either side of the extended characters and in that case the \b seems to work. So all you need to test is the first and last character. It may not be any easier to code or quicker to run, but for what is worth.

On second thought, perhaps you could concatenate all keywords with a regular ASCII character at each end, then it looks like you could use the \b.


Just an option.

Thanks for the grade. I'm glad we could find you a working solution.

Lawrence
0
ljo8877Commented:
I just reread that and it makes no sense since adding letters changes the word. I need to engage tthe mind before typing.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
JavaScript

From novice to tech pro — start learning today.