Link to home
Start Free TrialLog in
Avatar of Robert Berke
Robert BerkeFlag for United States of America

asked on

turn non-breaking hyphens, smart quotes etc into equivalent ascii characters


My routine saves a document to a windows file system.  The name of file comes from the cells(1,1)

I use the attached function to clean cell(1,1) by dropping colons, slashes and other invalid characters.  

sometimes the cell contains smartquotes  like a non-breaking hyphen. My routine is dropping those characters, when I would prefer to treat them as their ascii equivalent of '-', single quote etc

I know how to do it with a series of vba lines like

str = substitute(str, chr$(150),"-")
str = substitute(str, chr$(151),"-")
etc repeat for non-breaking space, left leaning single quote, right leaning single quote etc

But, I was wondering if there is a way of doing the same thing in regular expressions??

Function func(cell As Range)
  ' set reference to microsoft vbscript regular expressions 5.5
 
   
    Dim reg As New RegExp
    Dim test As String
    Dim matches As MatchCollection
    Dim m As Match
   
    test = cell.Value
   
    With reg
        .Pattern = "[^\w~!@#$%^&()_+{}`\-=[\];',.]+"
        .Global = True
        Dim ans As String
        func = reg.Replace(test, " ")
    End With

End Function
Avatar of Anil
Anil
Flag of United Kingdom of Great Britain and Northern Ireland image

I tend to use a formula in my spreadsheet.

=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A1,"/","-"),"\","-"),":","-"),"*","-"),"?","-"),"<","-"),">","-"),CHAR(34),"-")

This takes care of all forbidden filename characters in windows.

This works in Exce l2007 and not in 2003 because of the 7 leve lnested limit.

Hope this helps.

A.
Why not use this?

.Pattern = "[^A-Za-z0-9 ]"

Sid
Something like this?

Function func(cell As Range)
    Dim reg As New RegExp
    Dim test As String
    Dim matches As MatchCollection
    Dim m As Match
   
    test = cell.Value
   
    With reg
        .Pattern = "[^A-Za-z0-9 ]"
        .Global = True
        func = reg.Replace(test, "-")
    End With
End Function

Open in new window


So for example if the cell value is

dadfd#$@%VFSSFDGF%$^%$^

then the output will be

dadfd----VFSSFDGF------

Is this what you want?

Sid
@SiddharthRout

The \w in the OP pattern covers A-Za-z0-9.

You can certainly use regex to do the replace, but I think if you're planning on using VBA for the replacements, you would be just as well off to use the Replace function. For example:
Function func(cell As Range)
    cell.Value = Replace(cell.Value, chr$(150), "-")
    cell.Value = Replace(cell.Value, chr$(151), "'")    
End Function

Open in new window

Yes you are right so I can use this

.Pattern = "[\W ]"

Which give me the same result as earlier...

Quick question.. Why are you replacing chr$(151) with a single quote?

Sid
@kaufmed:

Also regex is much better than using say using 50 replace()....

I believe what I could understand from the user's query is he just want to keep A-Z,a-z,Space,0-9 and remove rest of the characters...

And if I am correct then he can use regex rather than replace...

Function func(cell As Range)
    Dim reg As New RegExp
    Dim test As String
    Dim matches As MatchCollection
    Dim m As Match
   
    test = cell.Value
   
    With reg
        .Pattern ="[\W ]"
        .Global = True
        func = reg.Replace(test, "-")
    End With
End Function

Open in new window

@SiddharthRout

Sorry, I didn't format very well. The top sentence was for you. The bottom paragraph was meant for the author. As far as the chr$(151), I was only intending on demonstrating code structure--I admin that the replacement itself could be nonsensical. Since the author said he was replacing non-breaking space, left leaning single quote, etc. I wanted to show that you could use a line for each. However, looking back now, I see that the "substitute" call is doing the same thing. I think if the intent is to have one call to perform similar replacements, then something like the following would be suitable (with correct values!):
Function func(cell As Range)
  ' set reference to microsoft vbscript regular expressions 5.5 
   
    Dim reg As New RegExp
    Dim test As String
   
    test = cell.Value
   
    With reg
        Dim ans As String

        .Global = True

        .Pattern = "[" & Chr$(150) & Chr$(151) & ]"
        test = .Replace(test, "-")

        .Pattern = "[" & Chr$(147) & Chr$(148) & ]"
        test = .Replace(test, """")

        .Pattern = "[" & Chr$(145) & Chr$(146) & ]"
        test = .Replace(test, "'")
    End With

    cell.Value = test
End Function

Open in new window

@kaufmed: I still feel that the user wants to replace 'All' special Characters. I could be wrong though...

Sid
Avatar of Robert Berke

ASKER

Sorry I was unclear.

#1 There are 23 special characters that are allowed in a windows file name, and those characters are being retained.

#2 With a few exceptions all other special characters are invalid and should be replaced with blank.  (Or, more precisely, 1 or more invalids are accumlated and replaced with a single blank.)

#3 The exceptions are these special characters which should be replaced with their "Ascii Equivalent". This is the REAL subject of my question.


            Gets
             Replaced                                      Gets
Code    with           Character                    replace with
30      chr(45)    nonbreak hyphen    real hyphen
31      chr(45)    optional hyphen      real hyphen
96      chr(39)    Left Apostrophe      single quote
145    chr(39)    Left Quote             single quote
146    chr(39)    Right Quote           single quote
147    chr(34)    Left Double Quote  double quote  (this is for future use, it will actually be discarded by the current expression)
148    chr(34)    Right Double Quote  double quote (this is for future use, it will actually be discarded by the current expression)

150    chr(45)    long hyphen          real hyphen
151    chr(45)    longer hyphen       real hyphen
160    chr(32)    non break space    real space
173    chr(45)    Short hyphen        real hyphen




#1 and #2 were both accomplished with my code
 The expression is actually very simple once you realize it is just a character set
by the way                     [^\w~!@#$%^&()_+{}`\-=[\];',.]+
is equivalent to   [^a-zA-Z0-9~!@#$%^&()_+{}`\-=[\];',.]+

That code was working properly so I put it in my original post. But I now realize that it was a mistake to show all those characters, they just confuse the heck out of people.








 No, I do not want to replace ALL special characters.  
By the way, it is my guess that regular expressions do not have any good way of doing this, so I will probably just do it with code like this

for each code in array(30,31,150,151,173)
str = replace(str,chr(code),chr(45))
next
for each code in array(96,145,146,147,148)
str = replace(str,chr(code),chr(39))
next
str = replace(str,chr(160),chr(32))
See if this helps... (Untested though)

Function Func(cell As Range)
    Dim Strchr As String
    Strchr = cell.Value
    For i = 1 To Len(Strchr)
        Select Case Mid(Strchr, i, 1)
        Case Chr(30), Chr(31), Chr(150), Chr(151), Chr(173)
            Strchr = Replace(Strchr, Mid(Strchr, i, 1), Chr(45))
        Case Chr(96), Chr(145), Chr(146)
            Strchr = Replace(Strchr, Mid(Strchr, i, 1), Chr(39))
        Case Chr(147), Chr(148)
            Strchr = Replace(Strchr, Mid(Strchr, i, 1), Chr(34))
        Case Chr(160)
            Strchr = Replace(Strchr, Mid(Strchr, i, 1), Chr(45))
        End Select
    Next i
    Func = Strchr
End Function

Open in new window


Sid
Or rather this...

Function Func(cell As Range)
    Dim Strchr As String
    Strchr = cell.Value
    For i = 1 To Len(Strchr)
        Select Case Mid(Strchr, i, 1)
        Case Chr(30), Chr(31), Chr(150), Chr(151), Chr(173), Chr(160)
            Strchr = Replace(Strchr, Mid(Strchr, i, 1), Chr(45))
        Case Chr(96), Chr(145), Chr(146)
            Strchr = Replace(Strchr, Mid(Strchr, i, 1), Chr(39))
        Case Chr(147), Chr(148)
            Strchr = Replace(Strchr, Mid(Strchr, i, 1), Chr(34))
        End Select
    Next i
    Func = Strchr
End Function

Open in new window


Sid
ASKER CERTIFIED SOLUTION
Avatar of kaufmed
kaufmed
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I'm about ready to shoot myself.

Your code is nearly identical to something I tried myself.  They both turn non-breaking hyphens into spaces, not dashes.

See the attached example.  Also see https://www.experts-exchange.com/questions/26664174/excel-Code-function-is-weird-for-non-breaking-hyphen.html

rberke

P.S. line 29 should be func = test, but I'm sure everybody already knew that.

and I applaud your great simplification at line 25, It is much more readable.  (of course it will eventually need to be changed to my original monster expression with the 23 legit characters.
>>  P.S. line 29 should be func = test, but I'm sure everybody already knew that.

Does it? If cell is being passed ByRef, then you shouldn't have to return a value from func, but I haven't programmed in VBA for a while, so I could be mistaken.

>>  They both turn non-breaking hyphens into spaces, not dashes.

I didn't confirm the code values--I just went by your specification--but are you sure 30 and 31 are hyphens? I understood everything from 1 - 31 to be control characters (i.e. non-printables). Assuming the codes are incorrect, that is the reason the non-breaking hyphens get converted--they are missed by the first replace due to the incorrect 30 & 31 and the last replace picks them up as being outside of the valid character range.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
My problem came because there were more UNICODE values than I knew about.  Since Unicode has virtually unlimited weirdness, I decided to first convert the string to Ascii and then back to Unicode (which is Vba's native character set.)


It seems to work pretty well.

Finally your line 29 code executed fine, it just didn't do what I wanted.  I did not want to replace the cell's contents, I just wanted a function that returned a valid file name.
Function func(cell As Range)
  ' set reference to microsoft vbscript regular expressions 5.5
   
' func takes whatever text is in cell, and converts it to a string
' containing ONLY valid Windows File name characters.
' The routine converts smart quotes and smart double quotes to dumb quotes.
' various forms of non-breaking hyphen all become real hyphens (also know as minus sign)
' and various forms of space all become real spaces.
' Invalid characters are dropped.
' After all drops and replacments, repeating blanks get turned into a single blank.
'
          Gets
'         Replaced                      Gets
' Code    with           Character      replace with
' 30      chr(45)    nonbreak hyphen    real hyphen
' 31      chr(45)    optional hyphen    real hyphen
' 96      chr(39)    Left Apostrophe    single quote
' 145    chr(39)    Left Quote           single quote
' 146    chr(39)    Right Quote         single quote
' 147    chr(34)    Left Double Quote  double quote  (this is for future use, it will actually be discarded by the current expression)
' 148    chr(34)    Right Double Quote  double quote (this is for future use, it will actually be discarded by the current expression)
'
' 150    chr(45)    long hyphen          real hyphen
' 151    chr(45)    longer hyphen       real hyphen
' 160    chr(32)    non break space    real space
' 173    chr(45)    Short hyphen        real hyphen
'

    Dim reg As New RegExp
    Dim test As String
    ' there are some unicode characters that do not have an 8 bit Ascii equivalent
    ' I believe the following code does a fairly good job of replacing them with something reasonable.
    ' For instance 8209 is a non-breaking hyphen.  This replaces it with chr(45) which is a dash.
    test = StrConv(StrConv(cell, vbFromUnicode), vbUnicode)

   
    With reg
        .Global = True

        ' Replace "hyphens"
        .Pattern = "[" & Chr$(30) & Chr$(31) & Chr$(150) & Chr$(151) & Chr$(171) & "]"
        test = .Replace(test, "-")

        ' Replace double "quotes" with single quotes BECAUSE DOUBLE QUOTES ARE NOT VALID IN FILENAME
        .Pattern = "[" & Chr$(147) & Chr$(148) & "]"
        test = .Replace(test, "'")

        ' Replace single "quotes"
        .Pattern = "[" & Chr$(96) & Chr$(145) & Chr$(146) & "]"
        test = .Replace(test, "'")
        
        ' Replace all other invalid characters
        .Pattern = "[^a-zA-Z0-9~!@#$%^&()_+{}`\-=;',.]+"
        test = .Replace(test, " ")
    End With

    func = test
End Function

Open in new window

Kaufmed's used of regular expression instead of substitute/replace.

This is more consistent with the "flavor" of the subroutine.

SiddharthRout routine used For i = 1 to len(string). That was a poor approach which would causing a 100 character string to do 300 replaces. Using a different for loop would require 11 replaces.

For each ascValue in array(30, ...)
  replace  ...
next

The points for SiddharthRout are just a reward for effort
A few bugs fixed and a few comments added.

'
Function TextToWindowsFilename(cellOrString As Variant)
Dim cell As String
Select Case TypeName(cellOrString)
    Case "Range": cell = cellOrString.Value
    Case "String": cell = cellOrString
    Case Else: Error 0
End Select


  ' set reference to microsoft vbscript regular expressions 5.5
   
   
' see https://www.experts-exchange.com/questions/26662970/turn-non-breaking-hyphens-smart-quotes-etc-into-equivalent-ascii-characters.html
'
' func takes whatever text is in cell, and converts it to a string
' containing ONLY valid Windows File name characters.
' The routine converts smart quotes and smart double quotes to dumb quotes.
' various forms of non-breaking hyphen all become real hyphens (also know as minus sign)
' and various forms of space all become real spaces.
' Invalid characters are dropped.
' After all drops and replacments, repeating blanks get turned into a single blank.
'
 '           Gets
'              Replaced                                      Gets
' Code    with           Character                    replace with
' 30      chr(45)    nonbreak hyphen    real hyphen
' 31      chr(45)    optional hyphen      real hyphen
' 96      chr(39)    Left Apostrophe      single quote
' 145    chr(39)    Left Quote             single quote
' 146    chr(39)    Right Quote           single quote
' 147    chr(34)    Left Double Quote  double quote  (this is for future use, it will actually be discarded by the current expression)
' 148    chr(34)    Right Double Quote  double quote (this is for future use, it will actually be discarded by the current expression)
'
' 150    chr(45)    long hyphen          real hyphen
' 151    chr(45)    longer hyphen       real hyphen
' 160    chr(32)    non break space    real space
' 173    chr(45)    Short hyphen        real hyphen
'

    Dim reg As New RegExp
    Dim TempStr As String
    ' there are some unicode characters that do not have an 8 bit Ascii equivalent
    ' I believe the following code does a fairly good job of replacing them with something reasonable.
    ' For instance 8209 is a non-breaking hyphen.  This replaces it with chr(45) which is a dash.
    TempStr = StrConv(StrConv(cell, vbFromUnicode), vbUnicode)
   
    With reg
        .Global = True

        ' Replace with "hyphens"
        .Pattern = "[" & Chr$(30) & Chr$(31) & Chr$(150) & Chr$(151) & Chr$(171) & "]"
        TempStr = .Replace(TempStr, "-")
       

        ' Replace  "slanty double quotes" with "upright single quotes" BECAUSE DOUBLE QUOTES ARE NOT VALID IN FILENAME
        .Pattern = "[" & Chr$(147) & Chr$(148) & "]"
        TempStr = .Replace(TempStr, "'")  ' or, for other future TempStr = .replace(TempStr,"""")
       

        ' Replace  "slanty single quotes" with "upright single quotes"
        .Pattern = "[" & Chr$(96) & Chr$(145) & Chr$(146) & "]"
        TempStr = .Replace(TempStr, "'")
       
        Const EscapeNext = "\"
        ' Replace all other invalid characters. the valide characters below were typed in sequence for a regular keyboard
        ' first pass with shift key held down, second pass without shift key.  Invalid file name characters were omitted.
        ' The EscapeNext character was needed because - and ] have special meaning in a character class.
        '  here are the same characters without the EscapeNext.
        '         ~!@#$%^&()_+{}`-=[];',. 23 characters altogether
        '                   9 keyable special characters being dropped.
        '
        '  space is intentionally left out of the pattern which makes it "invalid".  But since the
        '  regular expression replaces any sequence of one or more invalid characters with a space,
        '  the "invalid" spaces get replaced with a single space.
        '
        Const CharactersDroppedFromWindowsFileName = "*|:" & """" & "<>?" & EscapeNext & "\/"
        Const CharactersAllowedInWindowsFileName = "~!@#$%^&()_+{}`" & EscapeNext & "-=[" & EscapeNext & "];',."
        ' .Pattern = "[^a-zA-Z0-9" & CharactersDroppedFromWindowsFileName & CharactersAllowedInWindowsFileName" &"]+"
       
        .Pattern = "[^a-zA-Z0-9" & CharactersAllowedInWindowsFileName & "]+"
        TempStr = .Replace(TempStr, " ")
       
    End With

    TextToWindowsFilename = Trim(TempStr)
End Function
crud !  Near the bottom where I listed the 23 characters being kept, I should have listed the 9 characters being dropped

         *:"<>? \/  9 keyable special characters omitted.
I have used my textToWindowsFileName routine in different situations for quite a while. I just discovered one major flaw.

When the unicode text contains Japenese and Chinese characters, it usually converts them to question marks or periods ( ? or .) although sometimes they go to #^&% etc.  

The mis-conversion occurs at the line that says    TempStr = StrConv(StrConv(cell, vbFromUnicode), vbUnicode)

I want mxwiotcc¿.¿.¿.¿.¿.¿.¿.¿.¿.¿.¿.0ftp0

to be coverted to mxwiotcc ftp0

When I get some time, I will open a new question to address this problem, and I will post a cross reference here.
Interesting,  Expert Exchange does its own conversion.  I had pasted Chinese symbol but they got converted to  .¿.¿. The string of little noses looks like an advertisement for a cold medication.
turns out StrConv(cell, vbFromUnicode)

does not convert to ascii as I had assumed, so my routine is a pretty big failure.  It leaves lots of special characters as 8 bit character does instead of the 7 bits I had wanted.  So, lots of special characters get returned instead of dropped.

Also  many accented characters get dropped instead of being converted to their unaccented equivalent.

And most weird characters are converted to question marks or   .?   or ^? when they should be dropped.

I am not willing to devote any more time to making it work better, so I am just posting this warning for future researchers.