[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: windows-1252
>>>>> In [emacs-w3m : No.08035] David Hansen wrote:
>> Hooray! I realized the reason why characters in 0x80..0x9f
>> aren't decoded by windows-1252. So, I added the decoder to
>> decode entities like `€' before decoding the contents by
>> windows-1252. Thanks.
> Huh?! 0x80 - 0x9F are the only differences to latin-1. No need
> for windows-1252 support then. Or am I missing something?
Hm, people sometimes misunderstand my poor English even if it
took hours to write. ;-) Let me explain it again.
In the washingtonpost pages, there are many characters encoded
as what we call `entities', for example, "•". When
displaying those pages, emacs-w3m does the following:
First, emacs-w3m decodes the whole raw data by the charset which
is specified for the page. If the charset is iso-8859-1 or
windows-1252, emacs-w3m uses windows-1252 for decoding. Keep in
mind that entities aren't decoded at that time.
Finally, the contents are fontified and then entities in the
contents are decoded using `w3m-decode-entities'. What
`w3m-decode-entities' does then is:
(with-temp-buffer
(insert "•")
(w3m-decode-entities)
(buffer-string))
=> "\x95"
It is no more than another representation for the number 149.
You might look at a human readable character if you've set the
display table, though. On the other hand, `w3m-decode-entities'
decodes iso-10646 characters, such as "•", correctly. The
way you proposed in [emacs-w3m:08003] was to replace "\x95" with
"•" before performing `w3m-decode-entities', but it is not
effective to "•".
The way I committed today is to replace "•" with "\x95" in
the raw data before decoding it by windows-1252.