The code maintains inconsistent use of character encoding. Maybe
because parts of the code were written so long ago, before utf-8
became a defacto world-wide standard, other character encoding were
set for various files. The purpose of this commit is to try to make
the project consistently -utf-8 throughout.
This commit was originally applied in four months ago in the git
repository which I was using for development befor the project had its
own official git repository. During the intervening four months it was
publicly available for testing and was the version which I was using.
I received no complaints, and observed nothing suspicious; however, I
don't use Japanese, and I don't know whether anyone else bothered to
test.
In practice, much of the work should be easy to test just by using the
menu system in Japanese, and by using emacs-w3m for the various
shimbuns.
The character sets that had been in use included those which the w3
consortium say are to be especially avoided
https://www.w3.org/International/questions/qa-choosing-encodings#avoid.
Most that have explicit encoding are set to iso-2022-7bit, and file
w3m-bug.el is encoded for 'euc-japan'.
Since this was a huge and mind-numbing task, I automated it.
Step 1 was a few sed
commands to change the first line of the *.el files.
Step 2 was to run iconv
on the files.
#+BEGIN_SRC conf
for file in *.el ;
do echo "$file" ;
iconv -c -f iso-2022-jp -t utf-8 "$file" > "$file".new ;
done
#+END_SRC
In general, the conversion operation succeeded, in that it did
transform Japanese text from unintelligble ASCII escape sequences
into Japanese characters (likewise unintelligble to me, but with
samples verified by google translate).
Some files complained when not using the -c
flag to force
completion, but it doesn't seem important:
#+BEGIN_QUOTE
sb-dennou-new.el
iconv: illegal input sequence at position 47
sb-kyoko-np.el
iconv: illegal input sequence at position 1169
sb-nikkangendai.el
iconv: illegal input sequence at position 1463
sb-tech-on.el
iconv: illegal input sequence at position 2723
sb-wikimedia.el
iconv: illegal input sequence at position 4937
shimbun.el
iconv: illegal input sequence at position 61716
#+END_QUOTE
Step 3 was to eyeball the results, and edit obvious errors.
Latin diacritics were sometimes clobbered by the iconv
operation.
For example, é ó ú Á ź in lists
of month names for European languages.
On a few rare occassions, '@' was clobbered in regexes and in
comments.
Some files, such as w3m-bug.el
were fine without iconv
and
`iconv ruined them.
https://github.com/emacs-w3m/emacs-w3m/pull/9
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.