UTF8.txt file to UTF8 xml

 I have a UTF8 file that I need to read and convert to xml (allso utf8).
The UTF file is in CZECH and I'm on an US system.
I have tree questions:
I'm reading the file like this:
            input = new BufferedReader( new FileReader(aFile) );
            String line = null;
            int i = 0;
            while (( line = input.readLine()) != null && i < 100){
                i ++;
                System.out.println("l:" + line);

1. When I print it out, I get lines like: Oznámení zadávacího řízenÃ. And that does not look correct. (I don't know Czech, but I'd ecpect som Czech letters.) ­ Any Idea what I do wrong?
2. I need to read the file line for line and I neet to remove "\t" "\n" and "  " from the start of the line. Is there any way I can print these "hidden" chars so I can see what the original file is using?
3. I have some text like this: "
  PD: 20060620
  ND: 121873-2006"
And I need to convert it to "<ti><country>UK</country><city>Cardiff</city><name>KOBO CESIE EEIG</name></ti>
Any ideas how I can split the text like this? Just to make things worse, sometimes the TI: text has several lines separatet with "\n" and serveral " " (banks or space).

Who is Participating?
CEHJConnect With a Mentor Commented:
>>But how can I test if a line starts with 4 blanks?  

if (line.startsWith("    "))
You need to use a font that support Czech charcters


should be

line =

            input = new BufferedReader( new InputStreamReader(new FileInputStream(aFile), "UTF8") );

you'll also need to have a font installed that support czech to display it
Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

System.out.println(...) uses a console or a command line, so don't expect to see there the foreign charaters, You could see them only in an awt or swing component like JTextArea and only in the font which has the glyphs defined for the language You are using. I would suggest You to convert the text file to xml and only then check whether the characters are recognizable.
>>Just to make things worse,

If it's not too big, you'd be better off reading it all into one String
kristian_grAuthor Commented:
ok, this seams to work:  
            BufferedWriter w = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(new File("c:/test.txt")), "UTF8"));
            BufferedReader input = new BufferedReader( new InputStreamReader(new FileInputStream(aFile), "UTF8") );
            String line = null;
            int i = 0;
            while (( line = input.readLine()) != null && i < 400){
                i ++;
//                line = line.trim();

If I open my test.txt in word it seams to have the right cahracters.
Using trim is not a good idea. The file is formated in a way that using 4 blanks/space in the start of a line, it indicates that it realy belongs to the line above.
I therfore thik I need to do if(! String line starts with 4 blanks){w.newline();}
But how can I test if a line starts with 4 blanks?  
kristian_grAuthor Commented:
Some days even the easiest things it hard. This is one of those.
tnx CEHJ.

And sinse I know you are good at regex'es, you probably have an idea about this:
Some of my lines starts with XX: as in two Uppercase chars, and a ":". If that occures I'd like to split the String at the first ":" into String[0] and String[1]. But the text in String[1] might allso contain serveral ":".
eks: String test = "PT: testString : has some more text";
String[0] result: PT
String[1] result: "testString : has some more text";
if line.matches("^[A-Z]{2}:.+")) {
    String[]  tokens  = line.split(":", 2);
how does that answer your original question?
the important thing was to specify the appropriate encoding.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.