Bash shell : grab the name in a web page between the <title> and </title> and then modify that information to a variable


What I want to be able to do is to grab the name in a web page between the <title> and </title> and then modify that information. And then afterwards, I need to be able to copy the modified information to a variable with bash shell linux.

For exemple :

<title>En Selle, Sakamichi - Grande Road - Saison 2 Épisode 8 - vostfr </title>

Copy this part « En Selle, Sakamichi - Grande Road - Saison 2 Épisode 8 – vostfr »
Edit it so that it looks like « En Selle, Sakamichi – Grande Road – S02 Ép 08

It needs to use conditions. If the title uses a specific word. Ex : Saison, Épisode, Episode. They need to be changed automatically to « S01 » as in Season 01 or « Ép01 » for Épisode 01.

Also, it has to be able to delete a selection of words. Ex : vostfr. Same goes for special characters

Thank you !
hexo darkAsked:
Who is Participating?
simon3270Connect With a Mentor Commented:
How about:
TITLE=$(grep '<title>' /path/to/file.html | sed -e 's,^.*<title>,,' -e 's,</title>.*,,' | \
    sed -e 's/Épisode \([0-9][^0-9]\)/Ép0\1/g' \
        -e 's/Saison \([0-9][^0-9]\)/S0\1/g' \
        -e 's/Season \([0-9][^0-9]\)/S0\1/g' \
        -e 's/Épisode \([0-9][0-9]\)/Ép\1/g' \
        -e 's/Saison \([0-9][0-9]\)/S\1/g' \
        -e 's/Season \([0-9][0-9]\)/S\1/g' \
        -e 's/vostfr//g' \
        -e 's/badword//g' \
        -e 's/unwanted words//g' \
        -e 's/[ ][ ]*/ /g' \
        -e 's/[ ]*-[ ]*$//' | \
        tr -d "'"'"()^%~#/')

echo "TITLE is <${TITLE}>"

Open in new window

The "grep" finds any lines with "<title>" in them.

The first "sed" then removes the line up to and including the "<title>" bit, and from "</title>" to the end of the line.  This assumes that "<title> and "</title>" are on the same text line.

The next sed manipulates the contents.  For example, if "Épisode" is followed by a single digit, this is replaced by "Ép0" followed by the digit.  The same is done foe "Episode n", and for Saison or Season, "S0" followed by the digit.  If the same words are followed by two digits, the extra "0" is not inserted, so "Épisode 12" is replace by "Ép" followed by the "12".  the next three lines remove any unwanted words (here "vostrfr", "badword" and "unwanted words").  The next line compresses any multiple spaces to a single space character, and the next removes a trailing " - " (e.g. if "vostfr" is removed in your example).

The "tr" then removes the specified "special" characters.  The long sequence of single and double quotes at the start says that any single or double quotes are removed.  Any other characters can be added, then a final single quote ends the set of characters.

The "echo" line displays the result - the "<" and ">" are there to show any leading/trailing spaces in the result.

To modify, add any other words that might be used (e.g. copy the episode and season lines for any that you want to modify in the same way, and add any other words you want to remove).  Just keep the format the same, with each line ending in a "\".  You can ad almost any character the the set of characters to be deleted too.
hexo darkAuthor Commented:
works on a local html files  the web page is on the net :)

edit it works !!

i add this wget -O test.htm weblink
hexo darkAuthor Commented:
nikel good !!! a++ 5 star !!! ty ^^
Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

hexo darkAuthor Commented:
juste one more question
how to remove this character ( - )
Add \- to the tr command, so

    tr -d "'"'"()^%~#/\-'

The backslash is to stop the "-" being treated as indicating a character range. It is not strictly necessary if the "-" is the last character in the list of characters, but it would become necessary if another character were added to the end of the  list, so to be on the safe side I have put it in!
hexo darkAuthor Commented:
ty :)
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.