Solved

Bash shell : grab the name in a web page between the <title> and </title> and then modify that information to a variable

Posted on 2014-12-08
6
123 Views
Last Modified: 2014-12-09
Hi,

What I want to be able to do is to grab the name in a web page between the <title> and </title> and then modify that information. And then afterwards, I need to be able to copy the modified information to a variable with bash shell linux.

For exemple :

<title>En Selle, Sakamichi - Grande Road - Saison 2 Épisode 8 - vostfr </title>

Copy this part « En Selle, Sakamichi - Grande Road - Saison 2 Épisode 8 – vostfr »
Edit it so that it looks like « En Selle, Sakamichi – Grande Road – S02 Ép 08

It needs to use conditions. If the title uses a specific word. Ex : Saison, Épisode, Episode. They need to be changed automatically to « S01 » as in Season 01 or « Ép01 » for Épisode 01.

Also, it has to be able to delete a selection of words. Ex : vostfr. Same goes for special characters

Thank you !
0
Comment
Question by:hexo dark
  • 4
  • 2
6 Comments
 
LVL 19

Accepted Solution

by:
simon3270 earned 500 total points
Comment Utility
How about:
TITLE=$(grep '<title>' /path/to/file.html | sed -e 's,^.*<title>,,' -e 's,</title>.*,,' | \
    sed -e 's/Épisode \([0-9][^0-9]\)/Ép0\1/g' \
        -e 's/Saison \([0-9][^0-9]\)/S0\1/g' \
        -e 's/Season \([0-9][^0-9]\)/S0\1/g' \
        -e 's/Épisode \([0-9][0-9]\)/Ép\1/g' \
        -e 's/Saison \([0-9][0-9]\)/S\1/g' \
        -e 's/Season \([0-9][0-9]\)/S\1/g' \
        -e 's/vostfr//g' \
        -e 's/badword//g' \
        -e 's/unwanted words//g' \
        -e 's/[ ][ ]*/ /g' \
        -e 's/[ ]*-[ ]*$//' | \
        tr -d "'"'"()^%~#/')

echo "TITLE is <${TITLE}>"

Open in new window

The "grep" finds any lines with "<title>" in them.

The first "sed" then removes the line up to and including the "<title>" bit, and from "</title>" to the end of the line.  This assumes that "<title> and "</title>" are on the same text line.

The next sed manipulates the contents.  For example, if "Épisode" is followed by a single digit, this is replaced by "Ép0" followed by the digit.  The same is done foe "Episode n", and for Saison or Season, "S0" followed by the digit.  If the same words are followed by two digits, the extra "0" is not inserted, so "Épisode 12" is replace by "Ép" followed by the "12".  the next three lines remove any unwanted words (here "vostrfr", "badword" and "unwanted words").  The next line compresses any multiple spaces to a single space character, and the next removes a trailing " - " (e.g. if "vostfr" is removed in your example).

The "tr" then removes the specified "special" characters.  The long sequence of single and double quotes at the start says that any single or double quotes are removed.  Any other characters can be added, then a final single quote ends the set of characters.

The "echo" line displays the result - the "<" and ">" are there to show any leading/trailing spaces in the result.

To modify, add any other words that might be used (e.g. copy the episode and season lines for any that you want to modify in the same way, and add any other words you want to remove).  Just keep the format the same, with each line ending in a "\".  You can ad almost any character the the set of characters to be deleted too.
0
 
LVL 2

Author Comment

by:hexo dark
Comment Utility
works on a local html files  the web page is on the net :)

edit it works !!

i add this wget -O test.htm weblink
0
 
LVL 2

Author Closing Comment

by:hexo dark
Comment Utility
nikel good !!! a++ 5 star !!! ty ^^
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 2

Author Comment

by:hexo dark
Comment Utility
juste one more question
how to remove this character ( - )
0
 
LVL 19

Expert Comment

by:simon3270
Comment Utility
Add \- to the tr command, so

    tr -d "'"'"()^%~#/\-'

The backslash is to stop the "-" being treated as indicating a character range. It is not strictly necessary if the "-" is the last character in the list of characters, but it would become necessary if another character were added to the end of the  list, so to be on the safe side I have put it in!
0
 
LVL 2

Author Comment

by:hexo dark
Comment Utility
ty :)
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

This document is written for Red Hat Enterprise Linux AS release 4 and ORACLE 10g.  Earlier releases can be installed using this document as well however there are some additional steps for packages to be installed see Metalink. Disclaimer: I hav…
The purpose of this article is to demonstrate how we can upgrade Python from version 2.7.6 to Python 2.7.10 on the Linux Mint operating system. I am using an Oracle Virtual Box where I have installed Linux Mint operating system version 17.2. Once yo…
In this tutorial you'll learn about bandwidth monitoring with flows and packet sniffing with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're interested in additional methods for monitoring bandwidt…
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now