• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 499
  • Last Modified:

Handling Microsoft characters in textareas

Hi All,

I have people copying Microsoft Office documents into textarea fields on a web page.

I would hope there is a best way to strip out those characters and/or substitute them as when I go to try to redisplay the textarea fields, they look junky.

Does anybody have a piece of code to do this in Java?  I think perhaps server side would be better to handle this then in Javascript, but I am open to all ideas.

I would like to put this into the next release of my site so can really use the help.

Thank you!!!

JohnE
0
johnike
Asked:
johnike
  • 12
  • 12
  • 6
  • +1
1 Solution
 
objectsCommented:
Use a custom Document wikth your text are that uses the replaceAll() String method to strip out any unwanted characters.
0
 
aozarovCommented:
Not sure what do you mean by junky?
I assume the paste from MS word to the html text area looks fine..?

you can try to display the value back using the pre tags (to preserve \n \t and others):
<pre>
Your document
</pre>
0
 
aozarovCommented:
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
objectsCommented:
Didn't notice it was a web page, in that case just use replaceAll() in the serverside code to strip out everything you don't need.
If you are displaying the value in html (as opposed to in the field) then you will also want to replace newlines with <p>'s for that purpose.
0
 
aozarovCommented:
>> newlines with <p>
I think <b> will serve this purpose better though as I said before
I recommend the <pre> as it preserves the whole structure (spaces, \t, ...)
0
 
aozarovCommented:
I meant <br> instead of <p> (and <b> which is to make it bold :-)
0
 
objectsCommented:
> I think <b> will serve this purpose better though as I said before

for what reason?
0
 
aozarovCommented:
>> <BR> vs <P>
Two reasons:
Though both will cause the browser to move to the next line
1. The space between the lines will be different (<p> will have a bigger space)
2. <p> represnt a new paragraph where <br> represent new line (semantic)
0
 
objectsCommented:
> 1. The space between the lines will be different (<p> will have a bigger space)

the space can be whatever you want it to be

> 2. <p> represnt a new paragraph where <br> represent new line (semantic)

And a line break would indicate a new paragraph :)
0
 
aozarovCommented:
And of course the sematic difference will become visible when having a CSS that provides different attributes to each type.
0
 
objectsCommented:
But you have no way of determining what the semantics are meant to be :-)
You could just as easily say all line breaks are new paragraph, which in fact they are going to be far more likely to be anyway.

0
 
aozarovCommented:
>> And a line break would indicate a new paragraph :)
And why is that? You can have as many line breaks in one paragraph.

>> the space can be whatever you want it to be
Not sure what you mean but try those two html files and maybe you will see what I mean.

a.html
<html>
      <body>
            Hello wold1<p>
            Hello wold2<p>
            Hello wold3<p>
            Hello wold4
      </body>
</html>

b.html
<html>
      <body>
            Hello wold1<br>
            Hello wold2<br>
            Hello wold3<br>
            Hello wold4
      </body>
</html>
0
 
aozarovCommented:
>> You could just as easily say all line breaks are new paragraph, .
You could but then you lose the value of one tag.

>> which in fact they are going to be far more likely to be anyway
Maybe in the pages that you generate, I don't think you can talk for others.
0
 
objectsCommented:
Well now you're just being silly, you seem to enjoy arguing just for the sake of it :)
Sorry but I'm really not interested and have got more important things to do.
0
 
aozarovCommented:
>> Well now you're just being silly, you seem to enjoy arguing just for the sake of it :)
Actually I don't, but I think you are. (and please try to keep it clean).
0
 
objectsCommented:
I made a suggestion, you critisized it. Don't see how that can be interpreted as me arguing.
TRy and stick to the question in future.
0
 
aozarovCommented:
>> I made a suggestion, you critisized it
I have full rights to express my opinion especially if I think it is relevant and for the interest of the one who asked the question.

>> TRy and stick to the question in future.
Keep that advice to yourself.
0
 
objectsCommented:
Suggest you read your member agreement :)
0
 
aozarovCommented:
If you think that I have done something wrong then don't hesitate to report it.
0
 
CEHJCommented:
I would clean the string whatever markup you're going to use:



      public static String isoClean(String s) {
            StringBuffer sb = new StringBuffer(s.length());
            final int NUM_CHARS = 1 << 8;
            BitSet goodChars = new BitSet(NUM_CHARS);
            goodChars.set(0x09);
            goodChars.set(0x0A);
            goodChars.set(0x0D);
            for (int i = 0x20; i <= 0x7E; i++) {
                  goodChars.set(i);
            }
            for (int i = 0xA0; i <= 0xFF; i++) {
                  goodChars.set(i);
            }
            for (int i = 0; i < s.length(); i++) {
                  char c = s.charAt(i);
                  if (goodChars.get(c)) {
                        sb.append(c);
                  }
            }
            return sb.length() > 0 ? sb.toString() : s;
      }
0
 
objectsCommented:
As I suggested earlier :)
0
 
johnikeAuthor Commented:
Wow thank you for all the responses.  I was not expecting this much help, but appreciate it.

Maybe I could have been more specific.   People are cutting MS Word test including things like bullets and pasting them into my text area.  I don't do a translation/removal but likely should in my web page.

The result when I display the text area back is like this:

· State-of-the-art software
· Access to all exchanges
· Remote trading available
· Structured clearing agreements for computer program traders & early stage hedge funds
· Black box technology available

There are other characters then those above that get "junky".

It wold be be better if something coherent were put in its place or else nothing.   I am thinking that since it is a paste, they won't be able to see any replacement until after they hit the submit button so was not sure if this was best to do in Javascript or back server side.  I was thinking server side.

I am looking at the code you gave and am wondering if it will work.

0
 
johnikeAuthor Commented:
P.S.:  I am not having a problem with line breaks as that is already solved by using css, <pre></pre> tags, and a few other minor things.
0
 
CEHJCommented:
·      Let’s
·      Give
·      It a try here
0
 
CEHJCommented:
LOL. I tried to paste bullets into the text from Word - somehow they were converted to periods
0
 
johnikeAuthor Commented:
Likely because experts exchange already handles this problem that I face ;+)
0
 
johnikeAuthor Commented:
They do it server side which is what I need to do as well.
0
 
objectsCommented:
Not sure about Javascript, perhaps ask in the Javascript TA.
Server side it shouldn't be any problem to strip out the characters, there are various methods to strip out the unwanted characters.
0
 
CEHJCommented:
>> Likely because experts exchange already handles this problem that I face ;+)

Actually it looked like it'd already been converted as i pasted it in, and the same happened as i pasted into a text editor
0
 
johnikeAuthor Commented:
Thank you CEHJ.

I just tried the same and you are right it converted the bullets ok even on my site.

The odd thing is I am a developer on http://www.jobbank.com and there are situations where people are posting jobs based on copying them from word and they are getting the odd characters I showed above.

I am hopefully finding out form one of them how they got it to happen.  They had said pasting from word before.  I am hoping I can get their word document.

0
 
objectsCommented:
It could be lots of things depending on character encodings involved etc.
Perhaps try filtering out all non ascii characters, replaceAll() would make that an easy process.
Though the special characters may have relevance in the pasted text which would be lost.
0
 
objectsCommented:
eg. different bullets may be being used:

§      User Story
§      Developer Velocity
§      View Track Record


0
 
aozarovCommented:
What about having a warning in the preview page which suggests that if weird looking characters appears
they should be replaced/removed and before posting the content again.
That way will save you messing with the user content and its formatting yourself .
0
 
CEHJCommented:
It would be better to warn them to paste the content into Notepad first before pasting it into your window then they won't mess with your app at all
0
 
CEHJCommented:
:-)
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

  • 12
  • 12
  • 6
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now