Tags:Java xml parser can't parse escaped ascii (e.g. "")
I am working with my developer to parse an xml file using java. We keep getting the following error based on the following xml data.
In the XML: <First__Name>Zm.</First__Name>
Error returned when executing the Java code: org.xml.sax.SAXParseException: Illegal XML character: . at org.apache.crimson.parser.InputEntity.fatal(InputEntity.java:1100) at org.apache.crimson.parser.InputEntity.parsedContent(InputEntity.java:593) at org.apache.crimson.parser.Parser2.content(Parser2.java:1973) at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1654) at org.apache.crimson.parser.Parser2.content(Parser2.java:1926) at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1654) at org.apache.crimson.parser.Parser2.content(Parser2.java:1926) at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1654) at org.apache.crimson.parser.Parser2.parseInternal(Parser2.java:634) at org.apache.crimson.parser.Parser2.parse(Parser2.java:333) at org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:448) at javax.xml.parsers.SAXParser.parse(SAXParser.java:345) at javax.xml.parsers.SAXParser.parse(SAXParser.java:143) at ParseEtResults.parseDocument(ParseEtResults.java:108) at ParseEtResults.runResultsParser(ParseEtResults.java:96) at ParseEtResults.main(ParseEtResults.java:228)
I do not have control over the xml or source data so I need a solution that will allow me to ignore or filter this and other potentially offending characters we may discover while running the java code. The xml file is large (over 1gb) which may or may not limit our ability to remove or change the character before parsing. Also, this is a daily process and not just a one off job. Any help would be appreciated.
Would you provide me with a sample of the code or explanation of how the filter is utilized, I am assuming that it is a command line within the java script? I am not familiar with java and am trying to find as complete a solution as I can to help my developer. Parsing these large xml files and outputing into flat files for load to sql database is the original problem I faced as the xml files are too large for my etl tool. The java sax parser is something I found and turned over to development, but it is not high on their priority list even though it is on mine so any help I can get is very appreciated. PS. I am the Business Analyst so I'm better at talking about the problem and pointing others in the direction of a solution then actually building the solution myself.
I still have some concerns with the amount of memory used and if I'd need to filter in junks and return a stream to parse. Is this a potential issue?
"I am assuming that it is a command line within the java script"
Java isn't a script and it has nothing to do with being a command line. I presume your development dept is capable of implementing this, and I would be very amazed if they require you to get this solution for them.
I think the idea of filtering before XML parsing is really the way to go. Probably the best way is to make a list of offending names containing the  character and replace these elements with the same name with '_' substituted by the valid '-' character. Search for the regexp [<]\s*bad__name and replace it by bad--name.
for an example. It's probably better to make sure that the input file does not consist of a single line if you use this approach. You won't have [additional] memory issues if this is not the case. Otherwise you must look at a different way to cache the characters.
1:
2:
3:
4:
5:
6:
7:
8:
SAXParserFactory parserFactory = SAXParserFactory.newInstance();
SAXParser parser = parserFactory.newSAXParser();
FileReader fileReader = new FileReader("test/hello.xml");
FilterReader filteredReader = new MyFilterReader(fileReader);
InputSource inputSource = new InputSource(filteredReader);
parser.parse(inputSource, new DefaultHandler());