Accent Insensitive Search in Access

Introduction

When typing a search string, one normally expect the engine to disregard the case and the accentuation. For example, looking for "geneve" on Wikipedia will redirect automatically to Genève (in French), or Geneva (in English). This common mistake is anticipated, because the name "Geneve" -- formally incorrect -- is widely used. When searching for "sesamoide" in French, no automatic redirection occurs, but pages containing the word "sésamoïde" are presented. The same occurs in Internet search engines (along with more elaborate search tips).

The Jet Engine already performs case insensitive matching of textual data against a search criteria. 'A' equals 'a', 'Z' is greater than 'a', 'Spring' is like 's*ing', 'C' is between 'a' and 'b', etc. Note that this behaviour isn't configurable; how to perform case sensitive searches would be the topic of another article.

More often than not, a case insensitive search should also be accent insensitive. In many database engines, this is configurable, but not in Access. As a matter of fact, neither the Jet Engine nor Visual Basic expose any method or function specific to accented characters, although they are handled correctly as far as sorting is concerned.

This article briefly introduces how Latin characters are handled in Access, and presents methods to create accent insensitive search expressions. This is particularly useful when searching for names -- people's names or place names -- in an international context.

Accented Latin Characters

The topic of character encoding, code pages, and Unicode is vast and complex. The following is a very superficial introduction, with a practical aim: searching for names using Microsoft Access.

Like everything else on a computer, characters are represented by numbers, in turn represented by bits -- ones and zeroes. Eight bits form one byte, storing the numbers 0--255. In practice, each character has a corresponding byte, for example 65 for the letter 'A'. Even today, many texts stored digitally, including the present article, are encoded as bytes and are thus limited to a little over two hundred different characters.

Most readers of the present article probably use the Windows-1252 "ANSI" code page, also known as "Windows Western". This 8-bit encoding provides about 220 printable characters. The particular selection of "ANSI" characters is destined to please most American and Western European users and is appropriate for several Latin based languages, including Spanish, German, and French. All Windows applications can use the characters of the current 8-bit code page. Most notably for this article, the Visual Basic editor uses it.

The most frequent 8-bit encoding used on web pages, ISO-8859-1, is very similar but lacks some characters in one narrow range. This explains why text copied from basic Windows applications to Experts-Exchange sometimes doesn't print correctly (e.g. when it contains the Euro symbol, bullets, daggers, typographical quotes, apostrophes, etc.).

Eight-bit encoding is slowly being replaced by 16-bit encoding -- over 65 000 code points (or characters, to simplify) -- which includes many other alphabets and thus allows mixing languages freely. It is naturally essential for ideographic languages, which need much more than a couple hundred glyphs. This encoding is globally called Unicode -- again a simplification, but perfectly suited for our needs.

The Jet Engine and the Access interface support Unicode. In practice, a name field can contain names in Cyrillic, Arabic, and Chinese, among many other alphabets, and will be rendered correctly provided a compatible font has been installed. This article focuses on Latin characters only, which are included in the standard fonts.

Accented characters are found on the ANSI code page in the range 128--255, and in Unicode in the range 192--591. Most characters present in both also have the same number, but there are a few exceptions. The ANSI range 128--191 is used to encode characters found above 255 in Unicode. Because of these exceptions, the character codes should not be used interchangeably (even if it works most of the time).

For Access users, one of the problems in managing Unicode is that Visual Basic doesn't fully support it. Although the String variable type stores characters as integers, and supports Unicode, the source code does not; it supports only the current Windows code page, most probably Windows-1252 (ANSI). You can name your tables and fields using any alphabet, and you can use them as expected in queries and forms, but you will have difficulties manipulating the same objects from Visual Basic.

Jet and Extended Latin

It is easy to see that text isn't sorted according to the numeric code values of the characters. For one thing, Jet considers upper and lower case characters to be equal in indexes and when sorting. You cannot use, say, both "A12" and "a12" in the same key column of a table. When looking at sorted names, it is obvious that Jet also knows about accented characters and ligatures, and sorts them correctly.

Technically, this means that there is another code, one that assigns a sort value to a string, based on the characters and sometimes even on the character combinations. This code is selected once for the life of the database, in the Tools dialogue, before creating it. For example, in German 'ü' is considered to be a typographical variant of 'ue' -- it is historically a ligature -- and is sorted accordingly. In French, the diaeresis is treated just like an accent, and is ignored when sorting, so 'ü' sorts like 'u'.

The default "General" database sort order treats all accented characters almost like their unaccented variants. This means that 'é' comes after 'e', but only if it's the last character considered. When other characters follow it, 'é' is equivalent with 'e': "été" sorts between "est" and "ex".

Finally, the equivalence between upper and lower case characters works for all Unicode Latin characters. There are a few issues, due to confusion between similar looking characters, but none involving upper versus lower case.

First Solution

The search criteria "geneve" will not match "Genève", but the database sort order can be used to find it by using:

Between 'geneve' And 'geneveZ'

If the three strings "geneve", "Genève" and "geneveZ" were sorted by Jet, "Genève" would indeed fall in the range, and it is thus found by the criteria. Likewise, if we want to see all names in a database starting with the letter 'z', regardless of accents, we can use:

Between 'z' And 'zZ'

If there is a chance that a name actually starts with two 'z's, the second string should probably be 'zzZ' or the like. To be absolutely certain, the terminal 'Z' can also be replaced by a character that is sorted after 'z'. There aren't many choices, one of them is Unicode 446 (hexadecimal 1BE), which isn't used in any current alphabet (it's an obsolete symbol of the International Phonetic Alphabet).

The method isn't perfect, in that it can return unwanted matches. For example, the following criteria can be used to find the name "Zaja" ('Z' having an inverted circumflex). The result is similar to an accent insensitive "Like 'zaja*'".

Between 'zaja' And 'zajaZ'

Due to the sorting rules, the name will be returned whether the 'Z' has the circumflex or not. However, it would also select the name "Zajac", and possibly others. It is possible to exclude the false positives by checking the length

Between 'zaja' And 'zajaZ' And Like '????'

This isn't as bad as it looks. Jet will normally first use the index to select possible matches (of different lengths) and then eliminate those that are too long; it will not start by selecting all four-letter names. Other solutions exist to check for the length of a string, but they involve an additional Visual Basic function call. So this is the first viable solution.

To search for the name "abc" in the field "name", use the following criteria:

Name Between 'abc' And 'abcZ' And Name Like '???'

This can be constructed through code (strSearch is the search string):

    strCriteria _
                              = "(Name Between '" & strSearch & "' And '" & strSearch & "Z'" _
                              & " And Name Like '" & String(Len(strSearch), "?") & "')"

Open in new window

Again, the upper case 'Z' can be replaced by ChrW(446) for additional robustness.

Using the Like operator

A more natural approach would be to construct an appropriate search template for the "Like" operator. Note that other database engines include a full regular expression (RegEx) search, implemented in such a way that is uses the index when possible. In Access, the closed feature -- using available indexes -- is "Like".

To find the name "Nunez", using either the American spelling or the original Spanish spelling "Núñez", the following criteria can be used:

Like 'n[uú][nñ]ez'

Generalising the idea, each letter in a search string can be replaced by the set of all equivalent letters: the base letter and all accented variants. Basically, this is simply a series of calls to Replace(), for example replacing every 'a' with "[aàáâãäå]".

I would be tempted to recommend that method only. However, it has one important drawback: the "Like" operator doesn't understand ligatures. The result is rather surprising. Using the German sort order, the expression " 'ü' = 'ue' " is true. The two variants are actually equivalent, just as " 'U' = 'u' ". Using the default "General" sort order, we still have " 'æ' = 'ae' " and " 'ß' = 'ss' ", for example. The variants are equal, but not like one another. Look at the expression below:

'Weiß' = 'weiss'
'Weiß' Like 'weiss'

The first returns True, and the second False! There are some technical reasons why the "Like" operator doesn't perform ligature expansion, but the result is troubling. And there is no way to build ligature expansion back into the expression. To do so requires a syntax closer to regular expressions, allowing to specify alternative, e.g. "wei(ß|ss)"

When searching simply for "weiss", "Weiß" is returned. When searching for "weiss*", it isn't, and neither is "Weißmüller". If that is acceptable, we can continue building the second solution.

Visual Basic and Unicode

The Visual Basic functions to convert between Windows ANSI codes and characters are Chr() and Asc() -- named after ASCII for historical reasons. In order to support Unicode, the functions ChrW() and AscW() have been added ('W' for "wide"). For example Asc("<euro symbol>") returns 128 -- the ANSI encoding -- but AscW() of the same symbol returns 8364 -- its position in Unicode. (Since that symbol isn't part of ISO-8859-1, it can't be reproduced on this page.)

Note: Comparisons performed in a module behave differently depending on the "Option Compare" setting. Similarly, InStr() has an argument to specify how the comparison is performed. This topic isn't covered in the present article, because a similar option doesn't exist for Jet. The code below doesn't rely on that option either. If in doubt, let's assume that "Option Compare Database" is active -- the Jet compatibility mode.

ANSI characters can be used directly in the source code, but non-ANSI characters need to be entered using the ChrW() function, or read from another source like a table or an external text file.

Why is this important? When looking for all variants of the letter 'o', the following pattern for Like can be used: "[oòóôõöø]". If all needed variants exist in ANSI (in Windows-1252), this search pattern is complete and can easily be manipulated through code. However, most variation of the letter 'c', for example, are not included. The search pattern will be much less readable:

"[cç" & ChrW(263) & ChrW(265) & ChrW(267) & ChrW(269) & "]"

This lists variants of 'c' (with acute accent, with inverted circumflex, etc.) from the Unicode Latin extensions. Similar lists are needed for almost all letters of the alphabet.

One way to obtain them would be to document all Unicode Latin characters, which can become quite tedious. It is better to use what Visual Basic already "knows" about these characters. The function Asc() returns the byte encoding for all ANSI characters, but also for Unicode characters. In most cases, the code 63 (for '?') is returned, but when a similar character exists, it is substituted. For example, the convoluted expression Chr(Asc(ChrW(263))) returns 'c'. Character 263 is 'c' with acute accent. But Asc() returns 99, the code for 'c'. In other words, we can build a list of character variants above 255 automatically, by capturing the substitutions performed by Asc() -- formally Unicode to ANSI substitutions.

In practice, only very few characters absent from Windows-1252 are likely to appear in the list of European names: 'c' acute, 'c', 's', and 'z' with inverted circumflex, and 'd'-stroke, for example. Knowing the list of used special characters, it becomes tempting to handle only those, but that would be a poor choice. A robust function should handle the entire Latin Extended-A block, hexadecimal 0100--017F, although the next block, Latin Extended-B, range 0180--024F, can be omitted for most databases.

Note: Although Visual Basic can strip the inverted circumflex from the letter 'c', it can't remove the cedilla, or any accent found in the ANSI code page. This is the part we will need to document and program from scratch.

Building the Second Solution

We are going to build two features: one to remove accents from any characters in our selected range, hexadecimal 0080-017F, and one to expand a search to include all accented variants of a letter.

When listed in a single string, the characters in the ANSI code page look (somewhat) like this:

Const WIN_1252 _
                          = "????????????????????????????????" _
                          & " !""#$%&'()*+,-./0123456789:;<=>?" _
                          & "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_" _
                          & "`abcdefghijklmnopqrstuvwxyz{|}~?" _
                          & "? ??????????? ?  ???????????? ??" _
                          & " ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿" _
                          & "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß" _
                          & "àáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"

Open in new window

The first 32 characters are control codes (e.g. tab and carriage return). The second range with question marks (on line 6) is where Windows-1252 and ISO-8859-1 disagree; the characters are visible in Visual Basic, but can't be reproduced here. However, we can't just neglect the range, because it contains the 's' and 'z' with inverted circumflex, which we need to handle. The same constant with unaccented characters (letters only) is:

Public Const UNACCENT _
                          = "                                " _
                          & "                                " _
                          & " ABCDEFGHIJKLMNOPQRSTUVWXYZ     " _
                          & " abcdefghijklmnopqrstuvwxyz     " _
                          & "          S ? Z           s ? zY" _
                          & "                                " _
                          & "AAAAAA?CEEEEIIIIENOOOOO OUUUUYT?" _
                          & "aaaaaa?ceeeeiiiienooooo ouuuuyty"

Open in new window

This time, question marks replace ligatures, and all non-letter characters have been replaced by a space. Stripping an accent from a letter is thus a simple call to Mid():

c = Asc(Mid(UNACCENT, Asc(c) + 1, 1))

As explained above, Asc will handle non-ANSI Unicode characters, and the constant is used to substitute all remaining accented characters with the un-accented character. For example, the Unicode number of the lower case 'z' with inverted circumflex is 382, but Asc() will return the ANSI code 152. At position 152(+1), we find 'z'.

This is all we need to create also the list of accented characters. Each accented character can be listed along with the unaccented base letter.

The following function collects them

Const LowerA = 97
                      Const LowerZ = 122
                       
                      Dim Accents(LowerA To LowerZ) As String
                       
                      Sub InitAccents()
                          
                          Dim i As Integer, c As Integer
                          
                          If Len(Accents(LowerA)) Then Exit Sub
                          
                          ' for all Unicode Latin-1 Supplement and Extended-A code points:
                          For i = &H80 To &H17F
                              ' strip accent:
                              c = Asc(Mid(UNACCENT, Asc(ChrW(i)) + 1, 1))
                              If LowerA <= c And c <= LowerZ Then
                                  If Len(Accents(c)) = 0 Then Accents(c) = Chr(c)
                                  Accents(c) = Accents(c) & ChrW(i)
                              End If
                          Next i
                          
                      End Sub

Open in new window

The function should run only once. The second time, variants for the letter 'a' already exist, and the function exits (line 10). If not, all Unicode characters in the selected blocks are parsed, and converted to ASCII as explained above. If the result is a lower case character, the accented character is added to the list of variants. Note that we only need the lists for either upper or lower case, not for both.

For example, the list for the letter 'a' is 'aàáâãäå×××' (Extended-A variants shown as '×'), and the list for 'b' is empty (there are no variants).

When the user types a search string, all letters that have variants can be replaced by a set of these characters. The criteria is returned by this function:

Function AccentCards(ByVal psName As String)
                          
                          Dim i As Integer
                          
                          InitAccents
                          For i = LowerA To LowerZ
                              If Len(Accents(i)) Then
                                  psName = Replace(psName, Chr(i), "[" & Accents(i) & "]")
                              End If
                          Next i
                          psName = Replace(psName, "'", "''")
                          AccentCards = "'" & psName & "'"
                          
                      End Function

Open in new window

The name of the function is inspired from "accent wildcards", like writing a tiny star above every letter to signify "any accent here, or none".

The returned values from this function aren't very readable when shown here, containing many '×' place-holders. For the sake of illustration, let's limit the lists of accented characters to those available on this page (included in ISO-8859-1, or in Latin-1 Supplement) -- with much less variants.

When searching for "Diana Nunez", the following criteria can be built:

Surname Like '[nñ][uùúûü][nñ][eèéêë]z' and GivenName Like '[dð][iìíîï][aàáâãäå][nñ][aàáâãäå]'

It will return the record even if the surname is spellt "Núñez". So we have achieved accent and case insensitive search, and the good news is that the index on the names is still being used, even when the pattern starts with a set.

As said before, using "Like" removes smart handling of ligatures. On the other hand, the function integrates nicely with actual search strings entered by the user. If "n*ez" is entered, the resulting string '[nñ]*[eèéêë]z' is valid for the "Like" operator. There is no way to express that search string using the first solution alone.

The text file below is a module containing the code for the "second solution", with more comments. It isn't presented in a snippet because of the ANSI encoding. Likewise, the functions cannot be tested accurately in the immediate pane, but should be examined from Jet or the Access interface, which both support Unicode.
accents-module.txt
Appropriate test data can be found in the companion article Searching for Names (in preparation).

Conclusion

When Unicode support was added to Access, I was thrilled. I had worked on an Access 2.0 project involving multiple alphabets, and the ability to mix alphabets in a single field would of course have been the perfect solution. Instead, several columns had to be created, each displayed with a different font. (It was a simple list of guest to a conference in an International Organisation, and the language support was limited to the official United Nations languages.)

What I hadn't realise then was what Unicode support really means, or even simple alphabet support. Access only implements a "database sort order", but without exposing the mechanism; Visual basic only implements ANSI substitution for backward compatibility and for it's own internal source code rendering. This falls short.

I stopped saying that Access supports Unicode. It can store Unicode text, yes, but too much of the support of any language related feature is in fact left to the developer. This is one case where trying to keep things simple for beginners makes everything more complicated in the long run.

I hope the ideas presented here will save you some time if you need to implement accent insensitive searches, or at least that the Latin alphabet and its encoding have become a little less mysterious.

Markus G Fischer
(°v°)

about unrestricted access

Comments (4)

DanRollins

CERTIFIED EXPERT

Author of the Year 2009

Commented: 2010-01-01

Very thorough with some interesting and useful programming techniques. Got my Yes vote!

Bardobrave

Commented: 2010-04-22

Accurate and interesting

Dennis Johnson

IT technician

Commented: 2010-11-17

I was looking for something like this.
I'll try to implement it in my project for the Greek characters (Windows-1253).

Markus Fischer

Author

Commented: 2011-07-19

Addendum 2011-07-19

The suggested syntax

Between 'z' And 'zZ'

relies on the fact that Z is the last letter of the alphabet. As I recently found out, this is not the case for all collating orders. In Estonian, the last letter is the Y, in Icelandic the Ø, and in Norwegian and Danish the Å... This makes the suggestion to use ChrW(446) instead of Z even more meaningful.

Many thanks to zorvek and krishnakrkc for a very interesting discussion about this topic!

(°v°)