[Date Prev][Date Next][Thread Prev][][Date Index][Thread Index]

Re: UTF-8 links in big5 page



In [emacs-w3m : No.11648] jidanni@xxxxxxxxxxx wrote:
> In http://sex.ncu.edu.tw/activities/recent.htm
> emacs-w3m thinks a link is
> 404
> http://sex.ncu.edu.tw/activities/documents/%B3%B7%A6%5A%A8%67%AD%B7(%C1%BF%AE%79).pdf
> Firefox thinks it is
> 400
> http://sex.ncu.edu.tw/activities/documents/%E9%9B%AA%E5%90%8E%E7%8B%82%E9%A2%A8(%E8%AC%9B%E5%BA%A7).pdf

> Maybe emacs-w3m is right (big5), but Firefox gets us the PDF.

AFAIK some sites require a browser to use the charset that is used
to encode the page to encode a url to retrieve, some allow both
page's charset and utf-8, and some require utf-8 unconditionally.
This is the last case, though emacs-w3m follows the first one.
I don't know what is the majority, but I think we need to have
an option to alter the behavior site by site anyway.  I'll work
on this.  Maybe using utf-8 always will be the default.  Here is
a makeshift workaround:

(defadvice w3m-url-transfer-encode-string (before modify-charset
						  (url &optional coding)
						  activate)
  "Use `utf-8' to encode urls to retrieve for http://*.ncu.edu.tw/.";
  (when (string-match
	 "\\`https?://\\(?:[^./?#]+\\.\\)*ncu\\.edu\\.tw/"
	 url)
    (setq coding 'utf-8)))

You'd better add the following one if you try the workaround:

(add-to-list
 'w3m-show-decoded-url
 '("\\`http://\\(?:[^./?#]+\\.\\)*ncu\\.edu\\.tw/" . utf-8))