Re: Offline mode for shimbun retrieval

Katsumi Yamaoka <yamaoka@xxxxxxx> writes:
>> The filenames used for saving the shimbuns are generated through a md5
>> of the URL (truncated to the first 10 chars).
> Isn't the truncation unnecessary?  I'm worried about the confliction
> of file names.

The truncation is purely cosmetic, so just remove the two substring
commands if you don't like it. But if I remember correctly, with a 40bit
md5 we can roughly expect one collision in 2^20 items, so I figured this
should be more than enough for the purpose.

>> +(defcustom shimbun-local-path temporary-file-directory
>> +  "Directory where local shimbun files are stored.
>> +Default is the system's temporary directory."
>> +  :group 'shimbun
>> +  :type 'directory)
> `temporary-file-directory' is available only in Emacs.  For XEmacs
> users, it should be the return value of the `(temp-directory)'
> like this:
> (defcustom shimbun-local-path (if (featurep 'xemacs)
> 				  (temp-directory)
> 				temporary-file-directory)
>   ...)
> But how about making it default to the value of
> `w3m-default-save-directory'?

Yes, I haven't thought of that one. I changed it to
w3m-default-save-directory; I had to change the location of (require
'w3m) for that, though. I also removed the 'umask' command in the script
generation, since it only really makes sense when using /tmp.

>> +(defun nnshimbun-generate-download-script (&optional async)
>> +  "Generate download script for all subscribed schimbuns.
>> +Output will be put in a new buffer.  If called with a prefix,
>> +puts a '&' after each curl command."
>                          ^^^^
> Is curl faster than w3m? ;-)  I guess it's true because curl is
> much smaller than w3m.  (If you make it customizable like mm-url.el
> does, it seems to be better to do it in shimbun.el since Wanderlust
> users and Mew users will use it in the future.)

Ah, I forgot to change the doc string... Yes, my first version was with
curl, but I switched to w3m to avoid another dependency. I don't think
it makes a big speed difference, though. I don't plan to make this
customizable, since w3m, curl and wget differ in how to include
information from the HEAD request in the file (for extraction of the
Content-Type/Charset). In curl, you can do this via "-w
${content_type}", but it's appended at the end of the file. Besides, I
think w3m does its job just fine. :-)

Regarding speed: Now that retrieving the feeds isn't the bottleneck
anymore, byte-compilation makes some difference, and it seems the
Makefile doesn't compile the shimbuns. For example, shimbun-rss-find-el
in sb-rss.el can take some time on bigger feeds, and byte compilation
makes it about twice as fast. Otherwise, most of the time is spend with
xml-parse-tag, which we cannot really do much about.

I've attached a new version of the patch. BTW, if you plan on including
this, please use dengste@xxxxxx as address in the ChangeLog. I can also
write something up for the documentation.

Index: shimbun.el
RCS file: /storage/cvsroot/emacs-w3m/shimbun/shimbun.el,v
retrieving revision 1.194
diff -u -r1.194 shimbun.el
--- shimbun.el	23 Jul 2008 08:25:51 -0000	1.194
+++ shimbun.el	26 Nov 2008 13:15:15 -0000
@@ -78,6 +78,7 @@
 (require 'eword-encode)
 (require 'luna)
 (require 'std11)
+(require 'w3m)
   (luna-define-class shimbun ()
@@ -185,6 +186,21 @@
 			 :match (lambda (widget value) (natnump value))
 			 :value 1)))
+(defcustom shimbun-use-local nil
+  "Specifies if local files should be used (\"offline\" mode).
+This way, you can use an external script to retrieve the
+necessary HTML/XML files.  For an example, see
+`nnshimbun-generate-download-script'.  If a local file for an URL
+cannot be found, it will silently be retrieved as usual."
+  :group 'shimbun
+  :type 'boolean)
+(defcustom shimbun-local-path w3m-default-save-directory
+  "Directory where local shimbun files are stored.
+Default is the value of `w3m-default-save-directory'."
+  :group 'shimbun
+  :type 'directory)
 (defun shimbun-servers-list ()
   "Return a list of shimbun servers."
   (let (servers)
@@ -219,20 +235,43 @@
   (shimbun-mua-shimbun-internal mua))
 ;;; emacs-w3m implementation of url retrieval and entity decoding.
-(require 'w3m)
 (defun shimbun-retrieve-url (url &optional no-cache no-decode
 				 referer url-coding-system)
   "Rertrieve URL contents and insert to current buffer.
 Return content-type of URL as string when retrieval succeeded.
 Non-ASCII characters `url' are escaped based on `url-coding-system'."
-  (let (type)
-    (if (and url
-	     (setq type (w3m-retrieve
-			 (w3m-url-transfer-encode-string url url-coding-system)
-			 nil no-cache nil referer)))
+  (let (type charset fname)
+    (if (and url 
+	     shimbun-use-local
+	     shimbun-local-path
+	     (file-regular-p 
+	      (setq fname (concat (file-name-as-directory
+				   (expand-file-name shimbun-local-path))
+				  (substring (md5 url) 0 10)
+				  "_shimbun"))))
+	;; get local file contents
+	(progn
+	  (let ((coding-system-for-read 'no-conversion))
+	    (insert-file-contents fname))
+	  (when (re-search-forward "^$" nil t)
+	    (let ((pos (match-beginning 0)))
+	      (re-search-backward 
+	       "^Content-Type: \\(.*?\\)\\(?:[ ;]+\\|$\\)\\(charset=\\(.*\\)\\)?"
+	       nil t)
+	      (setq type (match-string 1)
+		    charset (match-string 3))
+	      (delete-region (point-min) pos))))
+      ;; retrieve URL
+      (when url
+	(setq type (w3m-retrieve
+		    (w3m-url-transfer-encode-string url url-coding-system)
+		    nil no-cache nil referer))))
+    (if type
 	  (unless no-decode
-	    (w3m-decode-buffer url)
+	    (if charset
+		(w3m-decode-buffer url charset type)
+	      (w3m-decode-buffer url))
 	    (goto-char (point-min)))
       (unless no-decode
Index: nnshimbun.el
RCS file: /storage/cvsroot/emacs-w3m/shimbun/nnshimbun.el,v
retrieving revision 1.62
diff -u -r1.62 nnshimbun.el
--- nnshimbun.el	17 Oct 2007 11:15:58 -0000	1.62
+++ nnshimbun.el	26 Nov 2008 13:15:15 -0000
@@ -993,6 +993,34 @@
 		(gnus-group-make-group grp (list 'nnshimbun server)))))
 	(message "No group is found in nnshimbun+%s:" server)))))
+(defun nnshimbun-generate-download-script (&optional async)
+  "Generate download script for all subscribed schimbuns.
+Output will be put in a new buffer.  If called with a prefix,
+puts a '&' after each w3m command."
+  (interactive "P")
+  (switch-to-buffer
+   (get-buffer-create "*shimbun download script*"))
+  (erase-buffer)
+  (insert 
+   (concat "#!/bin/sh\n# shimbun download script\n\n"
+	   "W3M=" (if w3m-command w3m-command "/usr/bin/w3m")
+	   "\nOPTS=\"-no-cookie -o accept_encoding=identity -dump_both\"\n\n"))
+  (let ((path (file-name-as-directory
+		(expand-file-name shimbun-local-path)))
+	url fname)
+    ;; get all subscribed shimbun groups
+    (dolist (cur gnus-newsrc-alist)
+      (when (and (eq (car-safe (nth 4 cur)) 'nnshimbun)
+		 (<= (nth 1 cur) gnus-level-subscribed))
+	(when (string-match "nnshimbun\\+\\(.+\\):\\(.+\\)" (car cur))
+	  (nnshimbun-possibly-change-group (match-string 2 (car cur))
+					   (match-string 1 (car cur)))
+	  (setq url (shimbun-index-url nnshimbun-shimbun))
+	  (setq fname (concat path (substring (md5 url) 0 10) "_shimbun"))
+	  (insert
+	   (concat "$W3M $OPTS " url " > " fname
+		   (if async " &\n" "\n"))))))))
 (provide 'nnshimbun)
 ;;; nnshimbun.el ends here