[Date Prev][Date Next][Thread Prev][][Date Index][Thread Index]

Re: Any treat-dumbquotes for eight-bit-controls?



Sergei さんからのリプライを転送します。

-- 
有沢 明宏

;; custom の改善ありがとうございました > 山岡さん
--- Begin Message ---
>>>>> "ARISAWA" == ARISAWA Akihiro <ari@mbf.ocn.ne.jp> writes:

>>>>> In [emacs-w3m : No.06558] 
  >>> Sergei Pokrovsky <pok@nbsp.nsk.su> wrote:

  >> The main problem is in the eight-bit-controls, the code interval
  >> 0x80..0x9F which is used in the MS Windows codepages.  The utf-8
  >> console version of w3m renders them quite well, but they remain a
  >> problem within emacs.  gnus-article-treat-dumbquotes somehow solves

  ARISAWA> Emacs-w3m cannot treat pages written by a coding-system
  ARISAWA> which Emacs doesn't support. And Emacs21 does not support
  ARISAWA> windows-1252 by default.

Yes, I know.  Or, to be more precise, it does not support the
eight-bit-controls.

  ARISAWA> I created windows-1252 coding-system. How about using this?
  ARISAWA> http://www.nijino.com/ari/emacs/cp1252.el

  ARISAWA> ;; Emacs cannot detect windows-1252 automatically, so you may need
  ARISAWA> ;; to specify the coding-system by typing "C c windows-1252 [RET]"
  ARISAWA> ;; in "*w3m*" buffer.

Hm, thank you, but I am not sure how shall I use it.

But in the meantime I've realized, that the problem is worse than I
imagined, and that it is not emacs-w3m's fault.

Here is an example -- the AltaVista's entry page
http://www.altavista.com/

I see in the 4th line from the bottom:
,----
| Translate    Toolbar   Yellow Pages    People Finder    More \233\233
`----

These two \233 are intended to be
155:SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
but actually the page source contains: "More &#155;&#155;".

Now, these &#155; are an obvious violation of the HTML standard; the
web-page author intended it to represent the windows-1252's code, but
according to the Standard the numeric entity must be interpreted as
UCS codepoint.  The browser (w3m) interprets it correctly -- as a UCS
control code, and is has no representation in the console; emacs-w3m
too, interprets it correctly (i.e. according to the Standard) and
renders it as an octal sequence.

The problem is, that such violations are rather numerous; it is
scandalous that even such an solid server as AltaVista is breaking the
standard.  Thus it may make sense to make an exception for the
pages in windows-* encoding, and interpret their numeric entities from
the eight-bit-controls range as single-byte codes (leaving the
interpretation of the the other numeric entities according to the
Unicode tables, which is very important).

But that should be treated by w3m rather than by emacs-w3m; emacs-w3m
should receive the expected values according to the encoding that it
requests (in my case, in utf-8).

Actually, what cures the problem in emacs is setting

(standard-display-ascii 150 "--")
(standard-display-ascii 151 "---")
(standard-display-ascii 155 ">")
...
etc.

I guess there should be a package for doing this, though do not know
about any such thing.

What confused me was that I did not see the octals in the console
session; and besides, the strange fact that w3m correctly renders
&larr; and &rarr; (as the left and right arrows), while emacs-w3m does
not -- see at the bottom  (line +83 = -13) of another search engine,
http://www.yandex.ru/yandsearch?nl=0&stype=&text=emacs

  >> Basically it works for me, with 2 exceptions: the eight-bit-controls
  >> I've mentioned above, and the bookmark list which I'd like to keep in
  >> utf-8 (as this is done from outside Emacs), but which is spoiled if I
  >> try to add an item from within Emacs-w3m (the new entry is added in
  >> some unreadable encoding).

  ARISAWA> Is the following setting helpful to you?
  ARISAWA> (setq w3m-bookmark-file-coding-system 'utf-8)

Yes it is, thank you.

-- 
Sergei
--- End Message ---