Extracting ALT Text from Multiple HTML Pages


I was trying to find out how I can :

Extract ALT Text from an IMG Tag, inside a identified div
There are multiple HTML files

Example, I need to ONLY extract from the following DIV named content. There are other divs and ALT text on the page.

<div id ="content"><img src="../../images/nameofimage.jpg" alt="Tool Box"></div>

Any tool or technology will work (regex or DW or Pearl)
Who is Participating?
Lucas BishopClick TrackerCommented:
You could build a basic scraper for this specific div tag using Kimono Labs.

It's fairly straightforward system if you watch this video.

Here is their Chrome extension. You'll want to switch it into the "Data Model" view, so you can access the code of the site, instead of the rendered view. Here is a basic overview on how to extract html elements.

I've used this before to do something similar where I was pulling a specific ASIN number from Amazon search results to identify search rankings of specific products. Made it very easy to deep dive into Amazon's rankings without having to visit the site at all.
Duy PhamFreelance IT ConsultantCommented:
It might not be related to this topics, but you can easily do that using HtmlAgilityPack in C#.
HtmlDocument htmlDoc = new HtmlDocument();

// get all image elements having ALT tag inside div element with id=content
HtmlNodeCollection nodes = htmlDoc.DocumentNode.SelectNodes("//div[@id='content']//img[@alt]");

Open in new window

Just my 2 cents.
Here's a (somewhat) manual method using Notepad++ (N++)

- Backup your files
- Open your files in N++
- Ctrl-F

Find tab...
Find what: (<div.+alt=")(.+)(".+)
Regular expression
Wrap around

Mark tab...
Bookmark line
Wrap around
Mark all
This marks all relevant lines.

From menu...Search, Bookmark, Remove unmarked lines

Find tab...
Find what: (<div.+alt=")(.+)(".+)
Regular expression
Wrap around

Replace tab...
Replace with: \2
Replace all

If that works satisfactorily, you can use the Macro to record the sequence...
From menu...Macro, Start recording.
Do 1.
Do 2.
Stop recording.
Save current recorded macro to a name

Pick each file.
Run macro: Macro, pick saved name
rgarimellaAuthor Commented:
Pham, is it possible to enter the code in DreamWeaver ? or would I need Visual Studio to test this code?
Duy PhamFreelance IT ConsultantCommented:
@rgarimella:  Dreamweaver? I suppose that you are doing the extraction from inside a website/web application. Then you can use jQuery with the same simplicity as above HtmlAgilityPack:
            // get html content from an url
                method: 'GET',   // using POST if needed
                url: '<url_to_extract_alt_texts>',
                contentType: 'html',
                success: function(result) {
                    $(result).find('div[id="content"] img[alt]').each(function (idx, obj) {
                        // do something with extracted alt texts

Open in new window

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.