[Date Prev][Date Next][Thread Prev][][Date Index][Thread Index]

Re: windows-1252



>>>>> In [emacs-w3m : No.08035] David Hansen wrote:

>> Hooray!  I realized the reason why characters in 0x80..0x9f
>> aren't decoded by windows-1252.  So, I added the decoder to
>> decode entities like `€' before decoding the contents by
>> windows-1252.  Thanks.

> Huh?!  0x80 - 0x9F are the only differences to latin-1.  No need
> for windows-1252 support then.  Or am I missing something?

Hm, people sometimes misunderstand my poor English even if it
took hours to write. ;-)  Let me explain it again.

In the washingtonpost pages, there are many characters encoded
as what we call `entities', for example, "•".  When
displaying those pages, emacs-w3m does the following:

First, emacs-w3m decodes the whole raw data by the charset which
is specified for the page.  If the charset is iso-8859-1 or
windows-1252, emacs-w3m uses windows-1252 for decoding.  Keep in
mind that entities aren't decoded at that time.

Finally, the contents are fontified and then entities in the
contents are decoded using `w3m-decode-entities'.  What
`w3m-decode-entities' does then is:

(with-temp-buffer
  (insert "•")
  (w3m-decode-entities)
  (buffer-string))
 => "\x95"

It is no more than another representation for the number 149.
You might look at a human readable character if you've set the
display table, though.  On the other hand, `w3m-decode-entities'
decodes iso-10646 characters, such as "•", correctly.  The
way you proposed in [emacs-w3m:08003] was to replace "\x95" with
"•" before performing `w3m-decode-entities', but it is not
effective to "•".

The way I committed today is to replace "•" with "\x95" in
the raw data before decoding it by windows-1252.