Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people, just like you, are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
Solved

Bash shell : grab the name in a web page between the <title> and </title> and then modify that information to a variable

Posted on 2014-12-08
6
132 Views
Last Modified: 2014-12-09
Hi,

What I want to be able to do is to grab the name in a web page between the <title> and </title> and then modify that information. And then afterwards, I need to be able to copy the modified information to a variable with bash shell linux.

For exemple :

<title>En Selle, Sakamichi - Grande Road - Saison 2 Épisode 8 - vostfr </title>

Copy this part « En Selle, Sakamichi - Grande Road - Saison 2 Épisode 8 – vostfr »
Edit it so that it looks like « En Selle, Sakamichi – Grande Road – S02 Ép 08

It needs to use conditions. If the title uses a specific word. Ex : Saison, Épisode, Episode. They need to be changed automatically to « S01 » as in Season 01 or « Ép01 » for Épisode 01.

Also, it has to be able to delete a selection of words. Ex : vostfr. Same goes for special characters

Thank you !
0
Comment
Question by:hexo dark
  • 4
  • 2
6 Comments
 
LVL 19

Accepted Solution

by:
simon3270 earned 500 total points
ID: 40487931
How about:
TITLE=$(grep '<title>' /path/to/file.html | sed -e 's,^.*<title>,,' -e 's,</title>.*,,' | \
    sed -e 's/Épisode \([0-9][^0-9]\)/Ép0\1/g' \
        -e 's/Saison \([0-9][^0-9]\)/S0\1/g' \
        -e 's/Season \([0-9][^0-9]\)/S0\1/g' \
        -e 's/Épisode \([0-9][0-9]\)/Ép\1/g' \
        -e 's/Saison \([0-9][0-9]\)/S\1/g' \
        -e 's/Season \([0-9][0-9]\)/S\1/g' \
        -e 's/vostfr//g' \
        -e 's/badword//g' \
        -e 's/unwanted words//g' \
        -e 's/[ ][ ]*/ /g' \
        -e 's/[ ]*-[ ]*$//' | \
        tr -d "'"'"()^%~#/')

echo "TITLE is <${TITLE}>"

Open in new window

The "grep" finds any lines with "<title>" in them.

The first "sed" then removes the line up to and including the "<title>" bit, and from "</title>" to the end of the line.  This assumes that "<title> and "</title>" are on the same text line.

The next sed manipulates the contents.  For example, if "Épisode" is followed by a single digit, this is replaced by "Ép0" followed by the digit.  The same is done foe "Episode n", and for Saison or Season, "S0" followed by the digit.  If the same words are followed by two digits, the extra "0" is not inserted, so "Épisode 12" is replace by "Ép" followed by the "12".  the next three lines remove any unwanted words (here "vostrfr", "badword" and "unwanted words").  The next line compresses any multiple spaces to a single space character, and the next removes a trailing " - " (e.g. if "vostfr" is removed in your example).

The "tr" then removes the specified "special" characters.  The long sequence of single and double quotes at the start says that any single or double quotes are removed.  Any other characters can be added, then a final single quote ends the set of characters.

The "echo" line displays the result - the "<" and ">" are there to show any leading/trailing spaces in the result.

To modify, add any other words that might be used (e.g. copy the episode and season lines for any that you want to modify in the same way, and add any other words you want to remove).  Just keep the format the same, with each line ending in a "\".  You can ad almost any character the the set of characters to be deleted too.
0
 
LVL 2

Author Comment

by:hexo dark
ID: 40487956
works on a local html files  the web page is on the net :)

edit it works !!

i add this wget -O test.htm weblink
0
 
LVL 2

Author Closing Comment

by:hexo dark
ID: 40487964
nikel good !!! a++ 5 star !!! ty ^^
0
VMware Disaster Recovery and Data Protection

In this expert guide, you’ll learn about the components of a Modern Data Center. You will use cases for the value-added capabilities of Veeam®, including combining backup and replication for VMware disaster recovery and using replication for data center migration.

 
LVL 2

Author Comment

by:hexo dark
ID: 40487973
juste one more question
how to remove this character ( - )
0
 
LVL 19

Expert Comment

by:simon3270
ID: 40488605
Add \- to the tr command, so

    tr -d "'"'"()^%~#/\-'

The backslash is to stop the "-" being treated as indicating a character range. It is not strictly necessary if the "-" is the last character in the list of characters, but it would become necessary if another character were added to the end of the  list, so to be on the safe side I have put it in!
0
 
LVL 2

Author Comment

by:hexo dark
ID: 40489164
ty :)
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

After running Ubuntu some time, you will be asked to download updates for fixing bugs and security updates. All the packages you download replace the previous ones, except for the kernel, also called "linux-image". This is due to the fact that w…
This article will explain how to establish a SSH connection to Ubuntu through the firewall and using a different port other then 22. I have set up a Ubuntu virtual machine in Virtualbox and I am running a Windows 7 workstation. From the Ubuntu vi…
Two types of users will appreciate AOMEI Backupper Pro: 1 - Those with PCIe drives (and haven't found cloning software that works on them). 2 - Those who want a fast clone of their boot drive (no re-boots needed) and it can clone your drive wh…

856 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question