Link to home
Start Free TrialLog in
Avatar of jculkincys
jculkincysFlag for United States of America

asked on

sed remove <html>

Hello

I am having trouble getting these sed commands to perform the desired results
I have to remove "<html>" and "</html>" from all the .txt files in a directory.

here is what I have so far.

sed -e 's/\<html\>//g' *.txt

Thanks for your help


Avatar of pjedmond
pjedmond
Flag of United Kingdom of Great Britain and Northern Ireland image

First copy the directory!:

cp -R sourcedir backupdir

Now make a working copy:

cp -R sourcedir workingdir

The sed statements that you need are as follows:

sed -e "s/<\/html>//g"  in order to remove </html>
and
sed -e "s/<html.*>//g" in order to remove <html and another bit of text>

We now have to automate it:

cd working dir
mkdir outputdir

Check that the following line produces the desired output to remove the lines:

find *.txt | awk {'print "cat " $0 " | sed -e \"s/<html.*>//g\ | sed -e \"s/<\\\/html.*>//g\"  > outputdir/" $0 '}

You'll see that the produced command line takes the output and passes it through the 2 sed filtering processes, and then copies the file to the output directory. Once we are happy with the result which should look like this:
----------------------------8X--------------------------------------
cat 1.txt | sed -e "s/<html.*>//g | sed -e "s/<\/html.*>//g"  > outputdir/1.txt
cat 2.txt | sed -e "s/<html.*>//g | sed -e "s/<\/html.*>//g"  > outputdir/2.txt
cat 3.txt | sed -e "s/<html.*>//g | sed -e "s/<\/html.*>//g"  > outputdir/3.txt
cat 4.txt | sed -e "s/<html.*>//g | sed -e "s/<\/html.*>//g"  > outputdir/4.txt
----------------------------8X--------------------------------------

then we send the commands that we have printed out to a bash shell:

find *.txt | awk {'print "cat " $0 " | sed -e \"s/<html.*>//g\ | sed -e \"s/<\\\/html.*>//g\"  > outputdir/" $0 '} | bin bash

Check the output directory and files to see if it has done exactly what you want. If it has:

mv sourcedir sourcedirbackup

mv outputdir sourcedir

...and your source dir now has all the html bits that you needed removing removed:)

HTH:)



OK - that looks long, but that's only because I've explained it really carefully, and provided examples:)
ASKER CERTIFIED SOLUTION
Avatar of pjedmond
pjedmond
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Of course, you'd spotted that deliberate mistake, .....hadn't you?
Avatar of jculkincys

ASKER

Thanks for the in-depth answer

Hmm I must be missing something simple

I do
sed -e "s/<\/html>//g" *.txt
Then I do
grep "<\/html>" *.txt
in the same directory and I get a lot of results where "</html>" is still present.

Any ideas
Yes - all you are doing is reading in every *.txt file, and then copying it to the screen (with the appropriate bit removed). You are not altering it in the file.

Read through the solution above carefully, and you'll see that the file is being read, filtered and then saved to another file. The original files are then removed and replaced by the new ones (that have the html tags removed)

Just do it on 1 file first - the command line created is:

cat 1.txt | sed -e "s/<html.*>//g" | sed -e "s/<\/html.*>//g"  > outputdir/1.txt

and you'll see that that the appropriate tags are removed:

cat 1.txt

prints out the file "1.txt"

This is then piped to a sed expression. The first expression strips out the <html asdfsvas>, and the second strips out </html> . The final bit puts the end result into a new file (in the outputdir that you've previously created.

Obviously, you don't want to type this out for each file, so awk is used to build the command line from a list of *.txt files in your directory. Once happy that all of the command line is correct, you pipe the command lines to /bin/bash in order to execute them.

HTH:)
Ok not I think I understand better - thanks

this should work if I take out the "bin" right?

find *.txt | awk {'print "cat " $0 " | sed -e \"s/<html.*>//g\" | sed -e \"s/<\\\/html.*>//g\"  > outputdir/" $0 '} | bash
Yes no problems as long as bash is in your path. This technique is *EXTREMELY* POWERFUL (building the command from a find). I normally get people to run the command first without the bash at the end so that they can check what they are running before it happens. Of course always make a back up, and *ESPECIALLY* if the command is being run as root be very careful when checking your commnads.

I really like this type of question because it shows of the real power of linux, in that you can build these fantasticly complicated commands from the little bricks that exist:)

Another really useful trick for this type of thing is if you need to run the command regularly, then set the command up as an alias, or stick it in a shell script.

HTH:)
Yea I really like the answer too

Do you have any simplier useful examples of building a command from Find?

find *.txt | awk {'print "cat " $0 '} | bash                               Just lists all the .txt files one after another.

is the simplest type of thing....you just build up the command as you need it. If you can work out the command for 1 file, then you build it up for all of the files.

Key tricks are greping particular types of files before bing piped into the awk command line, and also that 'special' chars have to be escaped with a back-slash when in the awk statement.

awk is incredibly powerful, you can write whole programs in it which can be put in a file and called from the command line. Likewise with sed. Just a case of understanding how each of the little tools work. When combined together in the right way, they become incredibly powerful:)

thanks alot
Would I need all 3 \\\ in this statement?

find *.txt | awk {'print "cat " $0 " | sed -e \"s/<html.*>//g\" | sed -e \"s/<\\\/html.*>//g\"  > outputdir/" $0 '} | bash

or would this work?

find *.txt | awk {'print "cat " $0 " | sed -e \"s/<html.*>//g\" | sed -e \"s/<\/html.*>//g\"  > outputdir/" $0 '} | bash
The awk cmd prints what is between the " ".
in order to get it to print \/  (which is needed to produce the / inside the sed statement) you need the 3 \\\

If in doubt try it out without the | bash at the end, then you can see what is produced. You can then pick one line out and try it manually if you want to confirm.

As always, make backups before doing this type of thing:)

HTH:)
Hmmm

Still having a little trouble

find *.* | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig" | sed -e \"s/<title.*>//ig\" | sed -e \"s/<\\\/title.*>/ig\" > $outputdir/" $0 '} | bash


results in

find *.* | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig" | sed -e \"s/<title.*>//ig\" | sed -e \"s/<\\\/title.*>/ig\" > $outputdir/" $0 '} | bash
awk: cmd. line:1: warning: escape sequence `\/' treated as plain `/'
awk: cmd. line:1: {print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig" | sed -e \"s/<title.*>//ig\" | sed -e \"s/<\\\/title.*>/ig\" > $outputdir/" $0 }
awk: cmd. line:1:                                                               ^ backslash not last character on line
find client.zip | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig" | sed -e \"s/<title.*>//ig\" | sed -e \"s/<\\\/title.*>/ig\" > $outputdir/" $0 '}

errors

find client.zip | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig" | sed -e \"s/<title.*>//ig\"  > $outputdir/" $0 '}

still errors

find client.zip | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig"  > $outputdir/" $0 '}

still errors

find client.zip | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\"   > $outputdir/" $0 '}

OK......

There is nothing wrong with your command....*EXCEPT* for the length. I'm getting a similar issue as well. You need to do the stripping as 2 seperate processes. One to strip out the html bits, and then a second process to strip out the pre.

Alternatively, you can put the awk or sed elements into a script file (sedit):

-------------------------------------------8X-----------------------------------
#!/bin/sed -f
s/<html.*>//g
s/<\/html>//g

#Add other substitutions as necessary
-------------------------------------------8X-----------------------------------

chmod +x sedit

You may find this approach easier as you don't have to worry about all of the \ and "s

Now change your command line to be along the line of

find *.* | awk {'print "./sedit " $0 '} | /bin/bash

And you should be away:)

HTH...
In fact, that looks a lot neater for really complex scripts! Almost elegant in fact!
I may go with the shorter version but I don't completely understand it yet

I have been working on it and I am really close

find *.* | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>//ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>//ig\" | sed -e \"s/<title.*>//ig\" | sed -e \"s/<\\\/title.*>i//ig\" > $outputdir/"$0 '}

The issue I am having is with   -----------> > $outputdir/"$0 '}

I think I will be able to get it to work this way hopefully

I really appreciate your help
I think that the problem is that the command line is longer than 255 chars, so you need to shorten the line by taking out some rules, *OR* do the process in a number of stages (First strip out the html bits, and then the pre bits)
Hmm I guess I would need to go with the shorter version but I don't quite understand how I would output the results.
Try it! - The output is EXACTLY the same process - you've merely moved the sed commands to a file rather than trying to put all the commands on the command line:)