• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 498
  • Last Modified:

finding a non-unicode character in an xml

Hi;

I have an xml having an invalid xml character, like diamond looking question mark, �.

I want to find it but failed in Java. I don't get any errors, just "done".

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.StringWriter;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import org.apache.commons.io.IOUtils;
import org.w3c.dom.Document;


public class RemoveInvalidXML
{
   
    public static String removeInvalidXMLCharacters(String s)
    {
        StringBuilder out = new StringBuilder();

        int codePoint;
        int i = 0;

        while (i < s.length())
        {
            // This is the unicode code of the character.
            codePoint = s.codePointAt(i);
            if ((codePoint == 0x9) ||
                    (codePoint == 0xA) ||
                    (codePoint == 0xD) ||
                    ((codePoint >= 0x20) && (codePoint <= 0xD7FF)) ||
                    ((codePoint >= 0xE000) && (codePoint <= 0xFFFD)) ||
                    ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF)))
            {
              //  out.append(Character.toChars(codePoint));
               // System.out.println("0-");
            }
            else
            {
            	out.append(Character.toChars(codePoint));
            	System.out.println("Errorss");
            }
            i += Character.charCount(codePoint);
           // System.out.println("0+");
        }
        return out.toString();
    }

   /*
    public static String removeXMLMarkups(String s)
    {
        StringBuffer out = new StringBuffer();
        char[] allCharacters = s.toCharArray();
        for (char c : allCharacters)
        {
            if ((c == '\'') || (c == '<') || (c == '>') || (c == '&') || (c == '\"'))
            {
            	//System.out.println("1");
                continue;
               
            }
            else
            {
                out.append(c);
                //System.out.println("2");
            }
        }
        return out.toString();
    }
*/
    /**
     * @param args The arguments to the main function.
     */
    public static void main(String[] args)
    {
    	try{
    	DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();

        DocumentBuilder dBuilder = builderFactory.newDocumentBuilder();
    	Document document = dBuilder.parse(new FileInputStream("C:\\my.xml"));
    	
    	TransformerFactory tf = TransformerFactory.newInstance();
    	Transformer transformer = tf.newTransformer();
    	//transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
    	StringWriter writer = new StringWriter();
    	transformer.transform(new DOMSource(document), new StreamResult(writer));
    	// String output = writer.getBuffer().toString().replaceAll("\n|\r", "");
    	String output = writer.getBuffer().toString();
    	
    	/*
    	InputStream is = RemoveInvalidXML.class.getResourceAsStream("C:\\trans_RO_8208_SRU - Copy.xml");
    	String str = new String("");
		try {
			str = IOUtils.toString(is);
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
    	 */   	    	
    	
        String x = RemoveInvalidXML.removeInvalidXMLCharacters(output);
        // String y = RemoveInvalidXML.removeXMLMarkups(x);
        System.out.println(x);
        // System.out.println(y);
        System.out.println("done");
    	}catch(Exception ex)
    	{
    		System.out.println("Error in conversion");
    	}
    }
}

Open in new window

0
jazzIIIlove
Asked:
jazzIIIlove
  • 8
  • 7
  • 2
7 Solutions
 
mccarlIT Business Systems Analyst / Software DeveloperCommented:
Maybe the invalid character has already been replaced by that "question mark in a diamond" character. What does using this if statement give you...
            if ((codePoint == 0x9) ||
                    (codePoint == 0xA) ||
                    (codePoint == 0xD) ||
                    ((codePoint >= 0x20) && (codePoint <= 0xD7FF)) ||
                    ((codePoint >= 0xE000) && (codePoint <= 0xFFFC)) ||
                    ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF)))

Open in new window

Explanation: That diamond question mark character has a code point of 0xFFFD, so I have changed the above if statement to exclude that character from what it will match, so let us know if it know says "Errorss"
0
 
jazzIIIloveAuthor Commented:
thanks i will try. actually my aim is that grabbing all non unicode ones. any way?
0
 
mrcoffee365Commented:
No - "question mark in a diamond" is not magically something that java or the xml package do for you.  That's a bad display of the actual character.  It's good to test for it, however, and for the structure of all characters you are not going to read.

Removing invalid characters before you try to read the xml is the right thing to do.  In the code you posted, you read the incoming file as an xml doc, and then remove characters.

Read the file with normal Java file-reading, remove invalid characters, then apply xml structure to the resulting file.
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
jazzIIIloveAuthor Commented:
My first aim is to see and then i can remove/replace. So, first i need help for those to list in a file if they exist.
0
 
mrcoffee365Commented:
Then I think you don't understand the problem.  The xml package will not successfully read a document with incorrect characters.  So you'll have to use something else -- for example file io -- to read the file and display it somewhere.
0
 
jazzIIIloveAuthor Commented:
Hmm..but the code just found the error for that unicode. I mean i can see 'Errors' output. Am i missing something?
0
 
jazzIIIloveAuthor Commented:
I did <0xFFFD instead of <=0xFFFD
0
 
mrcoffee365Commented:
Okay -- is it fixed now?
0
 
jazzIIIloveAuthor Commented:
For that char yes, but i want to see the other non unicode characters if they exist. How can i modify that 'if' to see other nonunicode characters?
0
 
mrcoffee365Commented:
There's no perfect way to do this.  Most people end up writing the list of unwanted characters similar to mccarl's suggestion.  I've seen Junidecode proposed as a solution but have not used it myself.
0
 
jazzIIIloveAuthor Commented:
Where can i find that list of codes?
0
 
jazzIIIloveAuthor Commented:
An idiotic quesion, can a nonunicode be a unicode? Or can a unicode (Latin) be nonunicode?
0
 
mrcoffee365Commented:
The definition of a unicode character is that it has a certain numeric value.  So -- no, a number cannot be a number and another number.

It's possible that you're thinking of various font displays and various forms of unicode.  If there are 2 ways to display a character, then it's possible that one of them is not in the unicode set.

Why would you ask a question like that?  Maybe you have a better one underneath?
0
 
mrcoffee365Commented:
In terms of finding unicode chars -- did you try searching?  There are many many sites which list unicode values.
http://www.ssec.wisc.edu/~tomw/java/unicode.html
0
 
jazzIIIloveAuthor Commented:
Hi;

Thanks, but when i check the link, the very last one which we think it's not a unicode, 0xFFFD      65533 "is" actually in the unicode list. So, is 0xFFFD a unicode?

regards.
0
 
mrcoffee365Commented:
If you have access to www.google.com you can search for lists of unicode characters.

You should also look into unprintable characters -- really, mccarl gave you a great deal of help right at the beginning.  Much of his list is unprintable characters.  If you don't want to see oxFFFD on your list of chars in xml, then remove it before creating your xml document.

If you are interested in the international definition for unicode characters, a search on google.com offers several places to look, e.g.:
http://www.w3.org/International/articles/definitions-characters/
0
 
mccarlIT Business Systems Analyst / Software DeveloperCommented:
Ok, here is where (I think) the problem is..

As I think you are aware, there are some codepoints which aren't valid unicode. Now if these were to occur in a file and you were to open that file in some sort of editor/viewer, that would detect the invalid codepoint and display this "question mark in a diamond" that you see. But those invalid codepoints are still in the file.

However, now consider if this file has been opened and saved again, or processed in some other way, perhaps that operation has actually physically replaced the invalid codepoints with the valid 0xFFFD "question mark in a diamond" codepoint. The file now no longer has and invalid codepoints but it still looks the same if you were to view the file.

That is what I think is going on here... You have a file that may at some point have had invalid codepoints but now is actually a fully valid file but one that contains the 0xFFFD codepoint making it "look" as though it might be invalid, but really it's not.

Does that make sense?  And if so, and if this is actually what is happening, can you know explain what you would like to do with the file from here?
0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 8
  • 7
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now