Solved

convert text/html nach text/plain;charset=UTF8

Posted on 2003-10-22
9
1,311 Views
Last Modified: 2007-12-19
looking of a way to convert html files to text

does anyone know some software which does it? All I found just can do it to ascii

I know a way to convert charsets with java so it might be enough for me to convert html to text preserving the right charset

anyone knows?
0
Comment
Question by:mightyone
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
  • 2
  • +3
9 Comments
 
LVL 86

Expert Comment

by:CEHJ
ID: 9597266
There's no good easy way, but it's worth trying

line = line.replaceAll("<[^>]+>", "");

on each line of the file
0
 
LVL 35

Expert Comment

by:TimYates
ID: 9597270
nice! :-)
0
 
LVL 7

Expert Comment

by:tomboshell
ID: 9597854
ummmm, CEHJ wont you need the '/' character in that replace call also?

and for the concerns about ASCII, both are (well ASCII is more or less used as a generic term since it could very well be UTF-8 or UTF-16 and if it is all written in english you wont notice the difference).  Or more better said, it doesn't matter.  It looks like you want the content without the mark-up, keep it in the same character encoding.  Take CEHJ's answer
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 6

Author Comment

by:mightyone
ID: 9600816
hmm not quite what i need,

to tombo
see e.g german has a few letters with are not in ascii "öäü"
all non western scripts are not in ascii therefor i will notice (i am German)

just stripping the tags wont work either, i need no font info as text, no javascript stuff as text i just want the visible text
so it is a bit more complicated, therefore i am looking for a libary or tool....
i tested some tools, e.g html2txt, but just ascii support, same with several others


anyone any idea?
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 9601363
>>ummmm, CEHJ wont you need the '/' character in that replace call also?

That's taken care of.

>>just stripping the tags wont work either, i need no font info as text

You won't get any font info. You're right though about the JavaScript - you'd have to have another regex for that.
0
 
LVL 6

Author Comment

by:mightyone
ID: 9602631
e.g. write a litle letter with word add some fat letters bigger letters some pics a table a sound and than save as html

try stripping that you´ll find plenty of stuff no one wants (specially me....)


still not any further snief
0
 
LVL 17

Accepted Solution

by:
paulop1975 earned 50 total points
ID: 9602785
Try this program.

http://www.convertzone.com/doc2txt/help.htm

Seems good enough for the job.
:)

Fui (portuguese for "gone")
c(^.^)o

pAul0|PIm3NTA
0
 

Expert Comment

by:gsergiu
ID: 11230789
line = line.replaceAll("<[^>]+>", "");

this is just the main idea of the solution:

in html you also mai have comments on more then one line of code
<!--
this is a comment
-->

you can have

<img src="very long link" target="target"
title="image title"
/>

javascripts
<script>
.... java script here
</script>


:))

 If you want to implement this convertor .... have fun
0
 

Expert Comment

by:gsergiu
ID: 11230802
probably if you read char by char you can eliminate
< ........  > blocks
but you cannot replace html code errors ....(unclosed tags)
0

Featured Post

Online Training Solution

Drastically shorten your training time with WalkMe's advanced online training solution that Guides your trainees to action. Forget about retraining and skyrocket knowledge retention rates.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

By the end of 1980s, object oriented programming using languages like C++, Simula69 and ObjectPascal gained momentum. It looked like programmers finally found the perfect language. C++ successfully combined the object oriented principles of Simula w…
Introduction This article is the last of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article covers our test design approach and then goes through a simple test case example, how …
This tutorial covers a practical example of lazy loading technique and early loading technique in a Singleton Design Pattern.
This tutorial will introduce the viewer to VisualVM for the Java platform application. This video explains an example program and covers the Overview, Monitor, and Heap Dump tabs.
Suggested Courses

737 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question