[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: windows-1252

From: Katsumi Yamaoka <yamaoka@xxxxxxx>
Date: Fri, 08 Apr 2005 20:32:43 +0900
X-ml-name: emacs-w3m
X-mail-count: 08036
References: <87mzst5j2r.fsf@denkblock.local> <b9yr7i5mbpr.fsf@jpl.org><b9yfyykbdfe.fsf@jpl.org> <ufyykfj3i.wl%t_chou@cec-ltd.co.jp><878y4bzzan.fsf@denkblock.local> <87psxn270r.fsf@robotron.ath.cx><87r7hxqbh4.fsf_-_@robotron.ath.cx> <b9y8y42og17.fsf@jpl.org><87wtrm8xcn.fsf@puyo.nijino.com> <b9yekdrcnvz.fsf@jpl.org><b9y1x9qisx2.fsf@jpl.org> <b9yzmwcm6yk.fsf@jpl.org> <b9yekdnqpqg.fsf@jpl.org><87sm22n3zh.fsf@robotron.ath.cx> <b9ymzs9dd9x.fsf@jpl.org><873bu14n8p.fsf@robotron.ath.cx>

>>>>> In [emacs-w3m : No.08035] David Hansen wrote:

>> Hooray!  I realized the reason why characters in 0x80..0x9f
>> aren't decoded by windows-1252.  So, I added the decoder to
>> decode entities like `&#128;' before decoding the contents by
>> windows-1252.  Thanks.

> Huh?!  0x80 - 0x9F are the only differences to latin-1.  No need
> for windows-1252 support then.  Or am I missing something?

Hm, people sometimes misunderstand my poor English even if it
took hours to write. ;-)  Let me explain it again.

In the washingtonpost pages, there are many characters encoded
as what we call `entities', for example, "&#149;".  When
displaying those pages, emacs-w3m does the following:

First, emacs-w3m decodes the whole raw data by the charset which
is specified for the page.  If the charset is iso-8859-1 or
windows-1252, emacs-w3m uses windows-1252 for decoding.  Keep in
mind that entities aren't decoded at that time.

Finally, the contents are fontified and then entities in the
contents are decoded using `w3m-decode-entities'.  What
`w3m-decode-entities' does then is:

(with-temp-buffer
  (insert "&#149;")
  (w3m-decode-entities)
  (buffer-string))
 => "\x95"

It is no more than another representation for the number 149.
You might look at a human readable character if you've set the
display table, though.  On the other hand, `w3m-decode-entities'
decodes iso-10646 characters, such as "&#8226;", correctly.  The
way you proposed in [emacs-w3m:08003] was to replace "\x95" with
"&#8226;" before performing `w3m-decode-entities', but it is not
effective to "&#149;".

The way I committed today is to replace "&#149;" with "\x95" in
the raw data before decoding it by windows-1252.

References:
- shimbun/sb-zeit-de problem with content charset selection
  - From: Elias Oltmanns
- Re: shimbun/sb-zeit-de problem with content charset selection
  - From: Katsumi Yamaoka
- Re: shimbun/sb-zeit-de problem with content charset selection
  - From: Katsumi Yamaoka
- Re: shimbun/sb-zeit-de problem with content charset selection
  - From: Tsuyoshi CHO
- Re: shimbun/sb-zeit-de problem with contentcharset selection
  - From: Elias Oltmanns
- Re: shimbun/sb-zeit-de problem with content charset selection
  - From: David Hansen
- windows-1252 (Was: shimbun/sb-zeit-de problem with content charset selection)
  - From: David Hansen
- Re: windows-1252
  - From: Katsumi Yamaoka
- Re: windows-1252
  - From: ARISAWA Akihiro
- Re: windows-1252
  - From: Katsumi Yamaoka
- Re: windows-1252
  - From: Katsumi Yamaoka
- Re: windows-1252
  - From: Katsumi Yamaoka
- Re: windows-1252
  - From: Katsumi Yamaoka
- Re: windows-1252
  - From: David Hansen
- Re: windows-1252
  - From: Katsumi Yamaoka
- Re: windows-1252
  - From: David Hansen

Prev by Date: Re: windows-1252
Next by Date: sb-zdnet-jp
Previous by thread: Re: windows-1252
Next by thread: sb-laut-de fixes (was: Re: windows-1252)
Index(es):
- Date
- Thread

Namazu Search: [Help]