wsyy
asked on
How to save a web page with its url as file name
Hi,
I want to save a web page with its url being the file name. However, there are quite a lot of characters that can't be accepted by either Ubuntu or Windows file systems. Please see below:
http://www.bestbuy.com/site/HP+-+Laptop+/+AMD+Phenom%26%23153%3B+II+Processor+/+15.6%22+Display+/+3GB+Memory+/+320GB+Hard+Drive+-+Biscotti/1945374.p?id=1218301987141&skuId=1945374
I want to know how to convert such a url to an acceptable file name.
Thanks
I want to save a web page with its url being the file name. However, there are quite a lot of characters that can't be accepted by either Ubuntu or Windows file systems. Please see below:
http://www.bestbuy.com/site/HP+-+Laptop+/+AMD+Phenom%26%23153%3B+II+Processor+/+15.6%22+Display+/+3GB+Memory+/+320GB+Hard+Drive+-+Biscotti/1945374.p?id=1218301987141&skuId=1945374
I want to know how to convert such a url to an acceptable file name.
Thanks
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
You don't need to use UTF:
This explanation is from the first link which gurvinder posted:
The alphanumeric characters "a" through "z", "A" through "Z" and "0" through "9" remain the same.
The special characters ".", "-", "*", and "_" remain the same.
The space character " " is converted into a plus sign "+".
All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte. The recommended encoding scheme to use is UTF-8. However, for compatibility reasons, if an encoding is not specified, then the default encoding of the platform is used.
For example using UTF-8 as the encoding scheme the string "The string ü@foo-bar" would get converted to "The+string+%C3%BC%40foo-b ar" because in UTF-8 the character ü is encoded as two bytes C3 (hex) and BC (hex), and the character @ is encoded as one byte 40 (hex).
This explanation is from the first link which gurvinder posted:
The alphanumeric characters "a" through "z", "A" through "Z" and "0" through "9" remain the same.
The special characters ".", "-", "*", and "_" remain the same.
The space character " " is converted into a plus sign "+".
All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte. The recommended encoding scheme to use is UTF-8. However, for compatibility reasons, if an encoding is not specified, then the default encoding of the platform is used.
For example using UTF-8 as the encoding scheme the string "The string ü@foo-bar" would get converted to "The+string+%C3%BC%40foo-b
ASKER
for_yan, thanks for more inputs.
if the url contains chinese words, i do want to keep the chinese words in the file name. do i need to use "gb2312" or "gb18030"? or I can just keep using "utf-8".
the reason I ask is that I don't know if the url (as an input from other application) is encoded in utf-8 or not.
if the url contains chinese words, i do want to keep the chinese words in the file name. do i need to use "gb2312" or "gb18030"? or I can just keep using "utf-8".
the reason I ask is that I don't know if the url (as an input from other application) is encoded in utf-8 or not.
I'm not sure, you can give it a try. Are chinese charcaters OK to be in the file names?
ASKER
yes. it is ok to have chinese characters in file name.
Then just try both ways - I cannot try myself - I don't have chinese characters
use big5
Big5
ASKER
what about the url contains chinese character, and how the encoding with "utf-8" will affect the result?
is the "utf-8" picked by randomly? or should i detect the encoding of the url first?