nicholassolutions
asked on
displaying URLs with lynx -dump
Hello,
I am programming a PHP script in which I need to convert web pages to plain text. Currently, I am using something like this:
$text =`lynx -cfg lcfg.cfg -dump http://www.google.com/`;
which simply assigns the text dumped by lynx to a variable $text...that is, it is equavalent to doing this from shell:
lynx -cfg lcfg.cfg -dump http://www.google.com/
this produces the following output:
Google
Web [1]Images [2]Groups [3]News [4]Froogle [5]more »
__________________________ __________ __________ _________
Google Search I'm Feeling Lucky [6]Advanced Search
[7]Preferences
[8]Language Tools
[9]Advertising Programs - [10]Business Solutions - [11]About Google
©2004 Google - Searching 4,285,199,774 web pages
References
1. http://www.google.com/imghp?hl=en&tab=wi&ie=UTF-8
2. http://www.google.com/grphp?hl=en&tab=wg&ie=UTF-8
3. http://www.google.com/nwshp?hl=en&tab=wn&ie=UTF-8
4. http://www.google.com/froogle?hl=en&tab=wf&ie=UTF-8
5. http://www.google.com/options/index.html
6. http://www.google.com/advanced_search?hl=en
7. http://www.google.com/preferences?hl=en
8. http://www.google.com/language_tools?hl=en
9. http://www.google.com/ads/
10. http://www.google.com/services/
11. http://www.google.com/about.html
This is great, except I would like the URLs to be included in the text itself. For example, instead of
[1]Images
I would like something like
Images[http://www.google.com/imghp?hl=en&tab=wi&ie=UTF-8]
Does anyone know a command-line flag or configuration that would let me do something like this with lynx? I am new to lynx, so I may need a little help getting it to work.
Thanks in advance for your help.
Cheer,
Matt
I am programming a PHP script in which I need to convert web pages to plain text. Currently, I am using something like this:
$text =`lynx -cfg lcfg.cfg -dump http://www.google.com/`;
which simply assigns the text dumped by lynx to a variable $text...that is, it is equavalent to doing this from shell:
lynx -cfg lcfg.cfg -dump http://www.google.com/
this produces the following output:
Web [1]Images [2]Groups [3]News [4]Froogle [5]more »
__________________________
Google Search I'm Feeling Lucky [6]Advanced Search
[7]Preferences
[8]Language Tools
[9]Advertising Programs - [10]Business Solutions - [11]About Google
©2004 Google - Searching 4,285,199,774 web pages
References
1. http://www.google.com/imghp?hl=en&tab=wi&ie=UTF-8
2. http://www.google.com/grphp?hl=en&tab=wg&ie=UTF-8
3. http://www.google.com/nwshp?hl=en&tab=wn&ie=UTF-8
4. http://www.google.com/froogle?hl=en&tab=wf&ie=UTF-8
5. http://www.google.com/options/index.html
6. http://www.google.com/advanced_search?hl=en
7. http://www.google.com/preferences?hl=en
8. http://www.google.com/language_tools?hl=en
9. http://www.google.com/ads/
10. http://www.google.com/services/
11. http://www.google.com/about.html
This is great, except I would like the URLs to be included in the text itself. For example, instead of
[1]Images
I would like something like
Images[http://www.google.com/imghp?hl=en&tab=wi&ie=UTF-8]
Does anyone know a command-line flag or configuration that would let me do something like this with lynx? I am new to lynx, so I may need a little help getting it to work.
Thanks in advance for your help.
Cheer,
Matt
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Yes, I'd thought of that too...since each link that is referenced appears as e.g. [1]link1, [2]link2, etc., it is not too hard to tack on the links given the references...I was just looking for the "easy way out"...Actually I was concerned about pages containing bracketed numbers confusing my parser, or at least that is my story ;)
Thanks to both of you for your help -- I'll assign pts shortly.
Thanks to both of you for your help -- I'll assign pts shortly.
ASKER