Advertisement

05.09.2008 at 12:36PM PDT, ID: 23390595 | Points: 500
[x]
Attachment Details
Java xml parser can't parse illegal xml character
Tags: Java xml parser can't parse escaped ascii (e.g. "")
I am working with my developer to parse an xml file using java.  We keep getting the following error based on the following xml data.


In the XML:
<First__Name>Zm.</First__Name>

 
Error returned when executing the Java code:
org.xml.sax.SAXParseException: Illegal XML character:  &#x1d;.
        at org.apache.crimson.parser.InputEntity.fatal(InputEntity.java:1100)
        at org.apache.crimson.parser.InputEntity.parsedContent(InputEntity.java:593)
        at org.apache.crimson.parser.Parser2.content(Parser2.java:1973)
        at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1654)
        at org.apache.crimson.parser.Parser2.content(Parser2.java:1926)
        at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1654)
        at org.apache.crimson.parser.Parser2.content(Parser2.java:1926)
        at org.apache.crimson.parser.Parser2.maybeElement(Parser2.java:1654)
        at org.apache.crimson.parser.Parser2.parseInternal(Parser2.java:634)
        at org.apache.crimson.parser.Parser2.parse(Parser2.java:333)
        at org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:448)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:345)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:143)
        at ParseEtResults.parseDocument(ParseEtResults.java:108)
        at ParseEtResults.runResultsParser(ParseEtResults.java:96)
        at ParseEtResults.main(ParseEtResults.java:228)

I do not have control over the xml or source data so I need a solution that will allow me to
ignore or filter this and other potentially offending characters we may discover while running the java code.  The xml file
is large (over 1gb) which may or may not limit our ability to remove or change the character before
parsing.  Also, this is a daily process and not just a one off job.  Any help would be appreciated.  
Start your free trial to view this solution
Question Stats
Zone: Programming
Question Asked By: customerportfolios
Question Asked On: 05.09.2008
Participating Experts: 2
Points: 500
Views: 0
Translate:
Loading Advertisement...
05.09.2008 at 12:51PM PDT, ID: 21535939

Rank: Genius

All comments and solutions are available to Premium Service Members only.

Start your 7 day free trial and see for yourself why Experts Exchange is the easiest and most proven technology resource in the world. Get Started

Already a member? Login to view this solution.

 
05.09.2008 at 01:30PM PDT, ID: 21536218

All comments and solutions are available to Premium Service Members only.

Start your 7 day free trial and see for yourself why Experts Exchange is the easiest and most proven technology resource in the world. Get Started

Already a member? Login to view this solution.

 
05.09.2008 at 03:22PM PDT, ID: 21536746

All comments and solutions are available to Premium Service Members only.

Start your 7 day free trial and see for yourself why Experts Exchange is the easiest and most proven technology resource in the world. Get Started

Already a member? Login to view this solution.

 
 
Loading Advertisement...
Microsoft
  • Internet Protocols
  • Applications
  • Development
  • OS
  • Hardware
  • Windows Security
Apple
  • Operating Systems
  • Hardware
  • Programming
  • Networking
  • Software
Internet
  • Search Engines
  • File Sharing
  • WebTrends / Stats
  • Spy / Ad Blockers
  • Web Browsers
  • New Net Users
  • Web Development
  • Chat / IM
  • Anti Spam
  • Web Servers
  • Anti-Virus
  • Email Clients
Gamers
  • Tips
  • Online / MMORPG
  • Puzzle
  • Emulators
  • Action / Adventure
  • Role Playing
  • Consoles
  • Game Programming
  • Strategy
  • Sports
  • Misc
  • Computer Games
Digital Living
  • Hardware
  • New Net Users
  • New Users
  • Software
  • Digital Music
  • Gaming World
  • Home Security
  • Apple
  • Networking Hardware
Virus & Spyware
  • Vulnerabilities
  • IDS
  • Encryption
  • Anti-Virus
  • Operating Systems Security
  • Software Firewalls
  • WebApplications
  • Cell Phones
  • Operating Systems
  • Internet
  • Hardware Firewalls
Hardware
  • Handhelds / PDAs
  • Displays / Monitors
  • Components
  • Networking Hardware
  • Peripherals
  • Laptops/Notebooks
  • Storage
  • Servers
  • Desktops
  • New Users
  • Misc
  • Apple
Software
  • System Utilities
  • Industry Specific
  • Network Management
  • Photos / Graphics
  • Page Layout
  • VMWare
  • Misc
  • Web Development
  • OS
  • CYGWIN
  • Voice Recognition
  • Message Queue
  • Quality Assurance
  • Security
  • Firewalls
  • MultiMedia Applications
  • Development
  • Database
  • Office / Productivity
  • Business Management
  • OS/2 Apps
  • Server Software
  • Internet / Email
ITPro
  • OS
  • Storage
  • Encryption
  • Operating Systems Security
  • Apple Hardware
  • Laptops & Notebooks
  • Servers
  • Networking Hardware
  • Peripherals
  • Devices
  • Displays / Monitors
  • WebTrends / Stats
  • Search Engines
  • Firewalls
  • WebApplications
  • IDS
  • Vulnerabilities
  • Email Clients
  • File Sharing
  • Spy / Ad Blockers
  • Web Browsers
  • Web Servers
  • Networking
  • Anti-Virus
  • Chat / IM
  • Anti Spam
Developer
  • Web Servers
  • Web Browsers
  • Game Programming
  • Dev Tools
  • Industry Specific
  • Office / Productivity
  • Database
  • CYGWIN
  • Web Development
  • Search Engines
  • File Sharing
  • WebTrends / Stats
  • Programming
  • Content Management
  • Application Servers
  • Protocols
Storage
  • Removable Backup Media
  • Storage Technology
  • Servers
  • Grid
  • Remote Access
  • Backup / Restore
  • Misc
  • Hard Drives
OS
  • Miscellaneous
  • Security
  • Development
  • Linux
  • VMWare
  • MainFrame OS
  • Unix
  • Apple
  • OS / 2
  • AS / 400
  • BeOS
  • Microsoft
  • VMS / OpenVMS
Database
  • Oracle
  • Miscellaneous
  • MySQL
  • Software
  • Sybase
  • Contact Management
  • PostgreSQL
  • Data Manipulation
  • Clarion
  • InterSystems Cache
  • Siebel
  • MUMPS
  • OLAP
  • SQLBase
  • SAS
  • GIS & GPS
  • 4GL
  • Berkeley DB
  • DB2
  • Informix
  • Interbase / Firebird
  • FoxPro
  • Reporting
  • LDAP
  • Filemaker Pro
  • MS SQL Server
  • dBase
  • MS Access
Security
  • Misc
  • Web Browsers
  • Software Firewalls
  • Operating Systems Security
  • File Sharing
  • Spy / Ad Blockers
  • Vulnerabilities
  • WebApplications
  • IDS
  • Anti-Virus
  • Encryption
  • Anti Spam
  • Email Clients
  • VPN
  • Chat / IM
Programming
  • Editors IDEs
  • Installation
  • Handhelds / PDAs
  • Multimedia Programming
  • System / Kernel
  • Algorithms
  • Game
  • Signal Processing
  • Project Management
  • Open Source
  • Database
  • Misc
  • Languages
  • Processor Platforms
  • Theory
Web Development
  • Scripting
  • Blogs
  • Web Servers
  • Software
  • Search Engines
  • Web Graphics
  • Images
  • Internet Marketing
  • Images and Photos
  • Components
  • Document Imaging
  • Web Languages/Standards
  • Illustration
  • WebApplications
  • Fonts
  • WebTrends / Stats
  • Authoring
  • Digital Camera Software
  • Miscellaneous
Networking
  • Protocols
  • Apple Networking
  • Network Management
  • Message Queue
  • Application Servers
  • Content Management
  • File Servers
  • Email Servers
  • Misc
  • Java Editors & IDEs
  • Wireless
  • Networking Hardware
  • Backup / Restore
  • System Utilities
  • ISPs & Hosting
  • Web Servers
  • Storage Technology
  • Removable Backup Media
  • Servers
  • Broadband
  • Grid
  • OS / 2
  • Novell Netware
  • Unix Networking
  • Windows Networking
  • Security
  • Telecommunications
  • Operating Systems
  • Linux Networking
Other
  • Community Advisor
  • Lounge
  • Community Support
  • New Net Users
  • Philosophy / Religion
  • Math / Science
  • Miscellaneous
  • URLs
  • Expert Lounge
  • Politics
  • Puzzles / Riddles
Community Support
  • Suggestions
  • New to EE
  • New Topics
  • Community Advisor
  • CleanUp
  • Announcements
  • General
  • Feedback
  • Input
  • EE Bugs
 
05.09.2008 at 12:51PM PDT, ID: 21535939

Rank: Genius

I would write a FilterReader to remove all control characters other than perhaps \r \n  and \t
 
05.09.2008 at 01:30PM PDT, ID: 21536218
Would you provide me with a sample of the code or explanation of how the filter is utilized, I am assuming that it is a command line within the java script?  I am not familiar with java and am trying to find as complete a solution as I can to help my developer.  Parsing these large xml files and outputing into flat files for load to sql database is the original problem I faced as the xml files are too large for my etl tool.  The java sax parser is something I found and turned over to development, but it is not high on their priority list even though it is on mine so any help I can get is very appreciated.  PS. I am the Business Analyst so I'm better at talking about the problem and pointing others in the direction of a solution then actually building the solution myself.

I still have some concerns with the amount of memory used and if I'd need to filter in junks and return a stream to parse. Is this a potential issue?
 
05.09.2008 at 03:22PM PDT, ID: 21536746
"I am assuming that it is a command line within the java script"

Java isn't a script and it has nothing to do with being a command line. I presume your development dept is capable of implementing this, and I would be very amazed if they require you to get this solution for them.

I think the idea of filtering before XML parsing is really the way to go. Probably the best way is to make a list of offending names containing the &#x1d; character and replace these elements with the same name with '_' substituted by the valid '-' character. Search for the regexp [<]\s*bad__name and replace it by bad--name.

See

http://exampledepot.com/egs/java.util.regex/LineFilter2.html

for an example. It's probably better to make sure that the input file does not consist of a single line if you use this approach. You won't have [additional] memory issues if this is not the case. Otherwise you must look at a different way to cache the characters.
1:
2:
3:
4:
5:
6:
7:
8:
SAXParserFactory parserFactory = SAXParserFactory.newInstance();
SAXParser parser = parserFactory.newSAXParser();
 
FileReader fileReader = new FileReader("test/hello.xml");
FilterReader filteredReader = new MyFilterReader(fileReader);
			
InputSource inputSource = new InputSource(filteredReader);
parser.parse(inputSource, new DefaultHandler());
Open in New Window
 
 
20080236-EE-VQP-29 / EE_QW_2_20070628