[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[emacs-w3m/emacs-w3m] Refactor codebase for utf-8 (#9)

From: Boruch Baum <notifications@xxxxxxxxxx>
Date: Sun, 03 Feb 2019 07:59:18 -0800
X-ml-name: emacs-w3m
X-mail-count: 13122

The code maintains inconsistent use of character encoding. Maybe
because parts of the code were written so long ago, before utf-8
became a defacto world-wide standard, other character encoding were
set for various files. The purpose of this commit is to try to make
the project consistently -utf-8 throughout.

This commit was originally applied in four months ago in the git
repository which I was using for development befor the project had its
own official git repository. During the intervening four months it was
publicly available for testing and was the version which I was using.
I received no complaints, and observed nothing suspicious; however, I
don't use Japanese, and I don't know whether anyone else bothered to
test.

In practice, much of the work should be easy to test just by using the
menu system in Japanese, and by using emacs-w3m for the various
shimbuns.

The character sets that had been in use included those which the w3
consortium say are to be especially avoided
https://www.w3.org/International/questions/qa-choosing-encodings#avoid.
Most that have explicit encoding are set to iso-2022-7bit, and file
w3m-bug.el is encoded for 'euc-japan'.

Since this was a huge and mind-numbing task, I automated it.

Step 1 was a few sed commands to change the first line of the *.el files.

Step 2 was to run iconv on the files.

#+BEGIN_SRC conf
for file in *.el ;
do echo "$file" ;
iconv -c -f iso-2022-jp -t utf-8 "$file" > "$file".new ;
done
#+END_SRC

In general, the conversion operation succeeded, in that it did
transform Japanese text from unintelligble ASCII escape sequences
into Japanese characters (likewise unintelligble to me, but with
samples verified by google translate).
Some files complained when not using the -c flag to force
completion, but it doesn't seem important:

#+BEGIN_QUOTE
sb-dennou-new.el
iconv: illegal input sequence at position 47

sb-kyoko-np.el
iconv: illegal input sequence at position 1169

sb-nikkangendai.el
iconv: illegal input sequence at position 1463

sb-tech-on.el
iconv: illegal input sequence at position 2723

sb-wikimedia.el
iconv: illegal input sequence at position 4937

shimbun.el
iconv: illegal input sequence at position 61716
#+END_QUOTE
Step 3 was to eyeball the results, and edit obvious errors.
Latin diacritics were sometimes clobbered by the iconv operation.
For example, é ó ú Á ź in lists
of month names for European languages.
On a few rare occassions, '@' was clobbered in regexes and in
comments.
Some files, such as w3m-bug.el were fine without iconv and
`iconv ruined them.

You can view, comment on, or merge this pull request online at:

https://github.com/emacs-w3m/emacs-w3m/pull/9

Commit Summary

Refactor codebase for utf-8

File Changes

Patch Links:

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

Follow-Ups:
- Re: [emacs-w3m/emacs-w3m] Refactor codebase for utf-8 (#9)
  - From: Masatoshi TSUCHIYA
- Re: [emacs-w3m/emacs-w3m] Refactor codebase for utf-8 (#9)
  - From: Masatoshi TSUCHIYA
- Re: [emacs-w3m/emacs-w3m] Refactor codebase for utf-8 (#9)
  - From: Boruch Baum
- Re: [emacs-w3m/emacs-w3m] Refactor codebase for utf-8 (#9)
  - From: Masatoshi TSUCHIYA

Prev by Date: Re: [emacs-w3m/emacs-w3m] Programming quibbles (#8)
Next by Date: Re: Remaining TODOs to finish migration from CVS to Git
Previous by thread: Re: [emacs-w3m/emacs-w3m] Programming quibbles (#8)
Next by thread: Re: [emacs-w3m/emacs-w3m] Refactor codebase for utf-8 (#9)
Index(es):
- Date
- Thread

Namazu Search: [Help]