Converting unicode punctuation to ASCII
Posted on 2008-06-19
I am reading in some text from a file. The text was saved (as tab-delimited text) from Excel, but may have originally been copied into Excel from Word. As a result of this (I assume) the text has characters such as the 'left single quotation mark' 8216 rather than the ASCII equivalent single quote. Another example is unicode 8230, which is the ellipsis (...) character which Exel seems to insert for some reason. I would like to convert any such characters into the ASCII equivalent. I understand not all incode characters have an ASCII equivalents, and those I will simple filter out, but does anyone know of a definitive list / table of which characters I can convert ? Java code would be even better of course. Thanks in advance.