Solved

Bash shell : grab the name in a web page between the <title> and </title> and then modify that information to a variable

Posted on 2014-12-08
6
134 Views
Last Modified: 2014-12-09
Hi,

What I want to be able to do is to grab the name in a web page between the <title> and </title> and then modify that information. And then afterwards, I need to be able to copy the modified information to a variable with bash shell linux.

For exemple :

<title>En Selle, Sakamichi - Grande Road - Saison 2 Épisode 8 - vostfr </title>

Copy this part « En Selle, Sakamichi - Grande Road - Saison 2 Épisode 8 – vostfr »
Edit it so that it looks like « En Selle, Sakamichi – Grande Road – S02 Ép 08

It needs to use conditions. If the title uses a specific word. Ex : Saison, Épisode, Episode. They need to be changed automatically to « S01 » as in Season 01 or « Ép01 » for Épisode 01.

Also, it has to be able to delete a selection of words. Ex : vostfr. Same goes for special characters

Thank you !
0
Comment
Question by:hexo dark
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 2
6 Comments
 
LVL 19

Accepted Solution

by:
simon3270 earned 500 total points
ID: 40487931
How about:
TITLE=$(grep '<title>' /path/to/file.html | sed -e 's,^.*<title>,,' -e 's,</title>.*,,' | \
    sed -e 's/Épisode \([0-9][^0-9]\)/Ép0\1/g' \
        -e 's/Saison \([0-9][^0-9]\)/S0\1/g' \
        -e 's/Season \([0-9][^0-9]\)/S0\1/g' \
        -e 's/Épisode \([0-9][0-9]\)/Ép\1/g' \
        -e 's/Saison \([0-9][0-9]\)/S\1/g' \
        -e 's/Season \([0-9][0-9]\)/S\1/g' \
        -e 's/vostfr//g' \
        -e 's/badword//g' \
        -e 's/unwanted words//g' \
        -e 's/[ ][ ]*/ /g' \
        -e 's/[ ]*-[ ]*$//' | \
        tr -d "'"'"()^%~#/')

echo "TITLE is <${TITLE}>"

Open in new window

The "grep" finds any lines with "<title>" in them.

The first "sed" then removes the line up to and including the "<title>" bit, and from "</title>" to the end of the line.  This assumes that "<title> and "</title>" are on the same text line.

The next sed manipulates the contents.  For example, if "Épisode" is followed by a single digit, this is replaced by "Ép0" followed by the digit.  The same is done foe "Episode n", and for Saison or Season, "S0" followed by the digit.  If the same words are followed by two digits, the extra "0" is not inserted, so "Épisode 12" is replace by "Ép" followed by the "12".  the next three lines remove any unwanted words (here "vostrfr", "badword" and "unwanted words").  The next line compresses any multiple spaces to a single space character, and the next removes a trailing " - " (e.g. if "vostfr" is removed in your example).

The "tr" then removes the specified "special" characters.  The long sequence of single and double quotes at the start says that any single or double quotes are removed.  Any other characters can be added, then a final single quote ends the set of characters.

The "echo" line displays the result - the "<" and ">" are there to show any leading/trailing spaces in the result.

To modify, add any other words that might be used (e.g. copy the episode and season lines for any that you want to modify in the same way, and add any other words you want to remove).  Just keep the format the same, with each line ending in a "\".  You can ad almost any character the the set of characters to be deleted too.
0
 
LVL 2

Author Comment

by:hexo dark
ID: 40487956
works on a local html files  the web page is on the net :)

edit it works !!

i add this wget -O test.htm weblink
0
 
LVL 2

Author Closing Comment

by:hexo dark
ID: 40487964
nikel good !!! a++ 5 star !!! ty ^^
0
NEW Veeam Agent for Microsoft Windows

Backup and recover physical and cloud-based servers and workstations, as well as endpoint devices that belong to remote users. Avoid downtime and data loss quickly and easily for Windows-based physical or public cloud-based workloads!

 
LVL 2

Author Comment

by:hexo dark
ID: 40487973
juste one more question
how to remove this character ( - )
0
 
LVL 19

Expert Comment

by:simon3270
ID: 40488605
Add \- to the tr command, so

    tr -d "'"'"()^%~#/\-'

The backslash is to stop the "-" being treated as indicating a character range. It is not strictly necessary if the "-" is the last character in the list of characters, but it would become necessary if another character were added to the end of the  list, so to be on the safe side I have put it in!
0
 
LVL 2

Author Comment

by:hexo dark
ID: 40489164
ty :)
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In order for businesses to be compliant with certain information security laws in some countries, you need to be able to prove that a user (which user it was becomes important to the business to take action against the user after an event has occurr…
The purpose of this article is to fix the unknown display problem in Linux Mint operating system. After installing the OS if you see Display monitor is not recognized then we can install "MESA" utilities to fix this problem or we can install additio…
How to Install VMware Tools in Red Hat Enterprise Linux 6.4 (RHEL 6.4) Step-by-Step Tutorial

737 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question