Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

convert text/html nach text/plain;charset=UTF8

Posted on 2003-10-22
9
Medium Priority
?
1,338 Views
Last Modified: 2007-12-19
looking of a way to convert html files to text

does anyone know some software which does it? All I found just can do it to ascii

I know a way to convert charsets with java so it might be enough for me to convert html to text preserving the right charset

anyone knows?
0
Comment
Question by:mightyone
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
  • 2
  • +3
9 Comments
 
LVL 86

Expert Comment

by:CEHJ
ID: 9597266
There's no good easy way, but it's worth trying

line = line.replaceAll("<[^>]+>", "");

on each line of the file
0
 
LVL 35

Expert Comment

by:TimYates
ID: 9597270
nice! :-)
0
 
LVL 7

Expert Comment

by:tomboshell
ID: 9597854
ummmm, CEHJ wont you need the '/' character in that replace call also?

and for the concerns about ASCII, both are (well ASCII is more or less used as a generic term since it could very well be UTF-8 or UTF-16 and if it is all written in english you wont notice the difference).  Or more better said, it doesn't matter.  It looks like you want the content without the mark-up, keep it in the same character encoding.  Take CEHJ's answer
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
LVL 6

Author Comment

by:mightyone
ID: 9600816
hmm not quite what i need,

to tombo
see e.g german has a few letters with are not in ascii "öäü"
all non western scripts are not in ascii therefor i will notice (i am German)

just stripping the tags wont work either, i need no font info as text, no javascript stuff as text i just want the visible text
so it is a bit more complicated, therefore i am looking for a libary or tool....
i tested some tools, e.g html2txt, but just ascii support, same with several others


anyone any idea?
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 9601363
>>ummmm, CEHJ wont you need the '/' character in that replace call also?

That's taken care of.

>>just stripping the tags wont work either, i need no font info as text

You won't get any font info. You're right though about the JavaScript - you'd have to have another regex for that.
0
 
LVL 6

Author Comment

by:mightyone
ID: 9602631
e.g. write a litle letter with word add some fat letters bigger letters some pics a table a sound and than save as html

try stripping that you´ll find plenty of stuff no one wants (specially me....)


still not any further snief
0
 
LVL 17

Accepted Solution

by:
paulop1975 earned 200 total points
ID: 9602785
Try this program.

http://www.convertzone.com/doc2txt/help.htm

Seems good enough for the job.
:)

Fui (portuguese for "gone")
c(^.^)o

pAul0|PIm3NTA
0
 

Expert Comment

by:gsergiu
ID: 11230789
line = line.replaceAll("<[^>]+>", "");

this is just the main idea of the solution:

in html you also mai have comments on more then one line of code
<!--
this is a comment
-->

you can have

<img src="very long link" target="target"
title="image title"
/>

javascripts
<script>
.... java script here
</script>


:))

 If you want to implement this convertor .... have fun
0
 

Expert Comment

by:gsergiu
ID: 11230802
probably if you read char by char you can eliminate
< ........  > blocks
but you cannot replace html code errors ....(unclosed tags)
0

Featured Post

Ask an Anonymous Question!

Don't feel intimidated by what you don't know. Ask your question anonymously. It's easy! Learn more and upgrade.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Java had always been an easily readable and understandable language.  Some relatively recent changes in the language seem to be changing this pretty fast, and anyone that had not seen any Java code for the last 5 years will possibly have issues unde…
Introduction This article is the last of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article covers our test design approach and then goes through a simple test case example, how …
Viewers will learn about the regular for loop in Java and how to use it. Definition: Break the for loop down into 3 parts: Syntax when using for loops: Example using a for loop:
This video teaches viewers about errors in exception handling.
Suggested Courses

636 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question