[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

convert text/html nach text/plain;charset=UTF8

Posted on 2003-10-22
9
Medium Priority
?
1,349 Views
Last Modified: 2007-12-19
looking of a way to convert html files to text

does anyone know some software which does it? All I found just can do it to ascii

I know a way to convert charsets with java so it might be enough for me to convert html to text preserving the right charset

anyone knows?
0
Comment
Question by:mightyone
  • 2
  • 2
  • 2
  • +3
9 Comments
 
LVL 86

Expert Comment

by:CEHJ
ID: 9597266
There's no good easy way, but it's worth trying

line = line.replaceAll("<[^>]+>", "");

on each line of the file
0
 
LVL 35

Expert Comment

by:TimYates
ID: 9597270
nice! :-)
0
 
LVL 7

Expert Comment

by:tomboshell
ID: 9597854
ummmm, CEHJ wont you need the '/' character in that replace call also?

and for the concerns about ASCII, both are (well ASCII is more or less used as a generic term since it could very well be UTF-8 or UTF-16 and if it is all written in english you wont notice the difference).  Or more better said, it doesn't matter.  It looks like you want the content without the mark-up, keep it in the same character encoding.  Take CEHJ's answer
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
LVL 6

Author Comment

by:mightyone
ID: 9600816
hmm not quite what i need,

to tombo
see e.g german has a few letters with are not in ascii "öäü"
all non western scripts are not in ascii therefor i will notice (i am German)

just stripping the tags wont work either, i need no font info as text, no javascript stuff as text i just want the visible text
so it is a bit more complicated, therefore i am looking for a libary or tool....
i tested some tools, e.g html2txt, but just ascii support, same with several others


anyone any idea?
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 9601363
>>ummmm, CEHJ wont you need the '/' character in that replace call also?

That's taken care of.

>>just stripping the tags wont work either, i need no font info as text

You won't get any font info. You're right though about the JavaScript - you'd have to have another regex for that.
0
 
LVL 6

Author Comment

by:mightyone
ID: 9602631
e.g. write a litle letter with word add some fat letters bigger letters some pics a table a sound and than save as html

try stripping that you´ll find plenty of stuff no one wants (specially me....)


still not any further snief
0
 
LVL 17

Accepted Solution

by:
paulop1975 earned 200 total points
ID: 9602785
Try this program.

http://www.convertzone.com/doc2txt/help.htm

Seems good enough for the job.
:)

Fui (portuguese for "gone")
c(^.^)o

pAul0|PIm3NTA
0
 

Expert Comment

by:gsergiu
ID: 11230789
line = line.replaceAll("<[^>]+>", "");

this is just the main idea of the solution:

in html you also mai have comments on more then one line of code
<!--
this is a comment
-->

you can have

<img src="very long link" target="target"
title="image title"
/>

javascripts
<script>
.... java script here
</script>


:))

 If you want to implement this convertor .... have fun
0
 

Expert Comment

by:gsergiu
ID: 11230802
probably if you read char by char you can eliminate
< ........  > blocks
but you cannot replace html code errors ....(unclosed tags)
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

For beginner Java programmers or at least those new to the Eclipse IDE, the following tutorial will show some (four) ways in which you can import your Java projects to your Eclipse workbench. Introduction While learning Java can be done with…
Introduction Java can be integrated with native programs using an interface called JNI(Java Native Interface). Native programs are programs which can directly run on the processor. JNI is simply a naming and calling convention so that the JVM (Java…
Viewers will learn about basic arrays, how to declare them, and how to use them. Introduction and definition: Declare an array and cover the syntax of declaring them: Initialize every index in the created array: Example/Features of a basic arr…
This tutorial covers a step-by-step guide to install VisualVM launcher in eclipse.
Suggested Courses
Course of the Month19 days, 10 hours left to enroll

873 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question