Solved

Strange character mapping when running in a cmd shell

Posted on 2004-03-25
10
582 Views
Last Modified: 2013-12-03
I recently ran into a strange problem.

We had send an example of a command line invocation of our software to a customer, who cut and pasted it into a cmd shell (Windows XP). Sadly, it didn't work. The command line looked pretty normal, something like

prog.exe -opt1 -opt2 arg3 -opt3 arg3 etc.

The problem turned out to be that somewhere along the line between us and the customer, one of the dashes got converted from a regular ascii dash (0x2D) to an extended ascii dash (0x96). When the command line was pasted into the cmd shell, what the program got in the argv string was 8211 (0x2013). This completely messed up the string-handling functions and nothing worked of course.

The program is unicode, the main function declaration looks like

int wmain(int argc, wchar_t* argv[])

All string manipulations are done using the unicode versions of the functions.

So the question is:

Why/how did the extended dash character get converted from 0x96 to 0x2013, instead of 0x0096 like the regular ascii characters?

How do I map back from 0x2013 to 0x002D (or even 0x0096)? Obviously, I can run every extended ascii character through the command line, see what winds up in the strings, and build a big case statement to convert back to ascii, but I'm looking for something more algorithmic in nature.

OK, that's 2 questions. :)

Thanks!
0
Comment
Question by:wayside
  • 4
  • 3
  • 3
10 Comments
 
LVL 44

Assisted Solution

by:Karl Heinz Kremer
Karl Heinz Kremer earned 250 total points
ID: 10679694
How did you send the command line to your customer? Email? Word document? Any other format? Is is possible that the dash got converted during the transmission of the data? Is it possible that your customer accidentially erased one character from the command line and tried to retype it?
0
 
LVL 86

Accepted Solution

by:
jkr earned 250 total points
ID: 10679856
What codepages were being used on yours and the client's computer?
0
 
LVL 14

Author Comment

by:wayside
ID: 10680200
> How did you send the command line to your customer?

It was emailed. I don't care so much why the original 0x2D dash got converted to a 0x96 dash as I am about how to handle it in my program if it happens again. I'm sure some Microsoft product "helped" along the way by changing the formatting.

The customer cut and pasted it into a .bat file, and if I look at this in a hex editor you can see the 0x96.

The solution btw was to have the customer retype in the command from scratch and not cut-n-paste it. It took a field trip to the customer site to figure this out though, because we couldn't reproduce it in our office.

> What codepages were being used on yours and the client's computer?

We are both located in Massachusetts, so I would assume both systems are set to English (United States). Mine is, and I can now reproduce this on my XP Pro machine by forcibly inserting a 0x96 into the command line. I don't know how to tell which exact code page is being used. The code does nothing to set or change it as far as I know, so it should be using whatever is the default for the machine.

My understanding of code pages (which is likely incorrect) is that they are used to figure out what glyph to show for a given character code. I would have expected my program to receive the same character code regardless of what glyph is being used to display it, but apparently that's wrong. :)

Just for kicks I forcibly inserted a bunch of other characters in the range of 0x82 - 0xB8 (I tried to pick ones that had obvious equivalents from the normal ascii range) and got an interesting variety of stuff in my program.
0
 
LVL 44

Expert Comment

by:Karl Heinz Kremer
ID: 10683593
Here is my take on this: I would not change the way the parameters are processed. It looks like the a real dash is correctly recognized. When you type a dash on a keyboard, it also gets correctly recognized as dash. It's very unlikely that you will run into this problem if you don't use buggy software to transfer the command line.

Just keep in mind that the mailer may change the way your data is encoded. One thing you can do is to make sure that you send emails containing any command line parameters as ASCII, and not as HTML encoded mail. Most (all?) mailers have a mechanism to convert an email to ASCII before it's sent.
0
 
LVL 14

Author Comment

by:wayside
ID: 10686910
> > What codepages were being used on yours and the client's computer?

After some further investigation into how code pages work (it took me a while to track down the actual code pages, that was a pain, you'd think they would be easy to find, you can find them at http://www.microsoft.com/globaldev/reference/wincp.mspx , but I digress...) this indeed is the problem. Code page 1252 maps certain characters in the rage of 0x80-0x9F to characters higher up in the unicode table; two of which are these other dash characters, which go from 0x96->0x2013 and 0x97->0x2014 . There are other ones that get mapped which could potentially be changed without knowing, like "left double quotation mark" or "small tilde".

I don't know where exactly the switch happened, in the shell or in the C runtime. I'm guessing in the C runtime, because printing the unicode characters out did not work correctly.

> It's very unlikely that you will run into this problem if you don't use buggy software to transfer the command line.

Problem is, programs like Microsoft Word (and maybe Outlook, who knows) will change certain characters depending on how it decides it needs to format things, and apparently changing a regular dash to a different dash (called, I now know, em dash and en dash) is not that uncommon.

And I can't control what customers will do with the data once they get it; heck, there's no way I could even control what our own customer support people do. At least there is an awareness of the issue now, though.

So I may add some code to "sanitize" my input, and convert any of the (0x96, 0x2013, 0x2014) dashes back to a regular old 0x2D dash, and maybe the same for double quotes and tildes. I don't think I'm likely to see any of the other weird characters on the command line.

0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 44

Expert Comment

by:Karl Heinz Kremer
ID: 10687173
MS Word _CAN_ convert the sequence -- to an emdash (you configure this under Tools>Auto Correction"), but it does not by default.

My opinion still is that it's not necessary to modify the parameters read from the command line so that you can fix this rather unlikely problem in the future. I have never seen this happen (and I'm also using Outlook and Word, and I'm also mailing code segments and program parameters).

I would however try to recreate the poblem so that it can be avoided in the future (e.g. by only using ASCII mail if an email contains information that needs to be pasted into a cmd tool).
0
 
LVL 14

Author Comment

by:wayside
ID: 10687378
> I would however try to recreate the poblem so that it can be avoided in the future

Unfortunately this problem cropped up over two moths ago, it just took until now for it to bubble up to me. At this point no one can remember how they constructed various emails or what they did with with them when they got them.

At this point I'm inclined to either do nothing, or process only the en dash and em dash, the rest seem unlikely to ever be used on a command line.

Thanks for your comments.
0
 
LVL 86

Expert Comment

by:jkr
ID: 10689240
>>I don't know where exactly the switch happened, in the shell or in the C runtime

You can always adjust the locale of the C runtime to use the appropriate codepage, e.g.

// Set code page to English ANSI default
setlocale ( LC_ALL, "English.ACP");

0
 
LVL 14

Author Comment

by:wayside
ID: 10689431
> You can always adjust the locale of the C runtime to use the appropriate codepage

I would think that by the earliest time I can make a call to setlocale, say the first line of wmain(), the command line arguments have already been mapped using whatever code page is in effect.

In the case of en-dash and em-dash, all of the code pages I checked either don't map it, or map them to the same unicode values. So if I want to do something to handle this I don't think I need to worry about the locale.

Thanks for all the comments, points all around.
0
 
LVL 86

Expert Comment

by:jkr
ID: 10690055
>>I would think that by the earliest time I can make a call to setlocale

You can compile the settings into your program using

#pragma locale("...")
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

Unlike C#, C++ doesn't have native support for sealing classes (so they cannot be sub-classed). At the cost of a virtual base class pointer it is possible to implement a pseudo sealing mechanism The trick is to virtually inherit from a base class…
Article by: SunnyDark
This article's goal is to present you with an easy to use XML wrapper for C++ and also present some interesting techniques that you might use with MS C++. The reason I built this class is to ease the pain of using XML files with C++, since there is…
The goal of the video will be to teach the user the difference and consequence of passing data by value vs passing data by reference in C++. An example of passing data by value as well as an example of passing data by reference will be be given. Bot…
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now