[Date Prev][Date Next][Thread Prev][][Date Index][Thread Index]

Re: sb-atom and sb-rss fetching already deleted articles



Katsumi Yamaoka <yamaoka@xxxxxxx> writes:
>>>>>> In [emacs-w3m : No.11162] David Engster wrote:
>> I've implemented the approach I quoted above and attached a patch
>> against current CVS.  The advantage of this approach is that it works
>> more generally, without the need for an extra parameter.
>
> Oh, I understood what you meant at last.  It picks articles of
> which the date is newer than the ones having already been fetched,
> and saves articles of which the date cannot be identified.  That's
> an excellent idea!

I had to change this a bit. Turns out there are feeds out there which
happily introduce new items with *older* dates (the 'topthemen' from the
sueddeutsche-de shimbun, for instance). To make those items show up, I
now compare with the date of the *oldest* existing article in the
group. This should still make sure that the article wasn't expired and
is actually a new one.

I also made sure that the returned items are sorted by ascending
date. This is to make sure that newer articles get higher article
numbers in Gnus, since in the expiry process, it is the article with the
highest article number which is kept and never deleted.

It's still not perfect, though. Imagine a feed introduces one new item
with an old date; it will get the highest article number in Gnus. If you
now completely expire the group, only this article will remain. When you
now refresh the group, all items with a newer date will again show
up. However, I think this is a very rare case - it would need a feed
which introduces new items with old dates and a pretty short expiry
time. If this really turns out to be problem, it could still be fixed in
nnshimbun.el.

>> So far it has worked for me, but I obviously couldn't test some of
>> the special cases for 'hankaku' which are implemented there.
>
> Looks no problem, except for `float-time'.  It is not available in
> XEmacs, so we will have to provide it as `w3m-float-time', like Gnus
> does.

OK, I changed float-time to w3m-float-time. Since shimbun.el requires
'w3m, this function should always be present.

>> Maybe you could take a look at the patch and see if it works for you.  I
>> could implement the same for sb-atom.el if you should decide to apply
>> it.
>
> Please go ahead.

I attached a patch to the version I'm currently using. It works quite
well for me, but a bit more testing would surely be good.

Regards,
David
Index: ChangeLog
===================================================================
RCS file: /storage/cvsroot/emacs-w3m/shimbun/ChangeLog,v
retrieving revision 1.231
diff -u -r1.231 ChangeLog
--- ChangeLog	25 Feb 2010 02:11:10 -0000	1.231
+++ ChangeLog	3 Mar 2010 21:47:25 -0000
@@ -1,3 +1,17 @@
+2010-03-03  David Engster  <dengste@xxxxxx>
+
+	* sb-rss-blogs.el (shimbun-get-headers): Replace call to
+	shimbun-rss-get-headers with luna-call-next-method.
+
+	* sb-rss.el (shimbun-get-headers): Revert 5th arg `quit-immediately'.
+	(shimbun-rss-get-headers): Delete those headers which only appear to
+	be new but are actualy older than the ones we already have.  Sort the
+	headers by date if possible.  Added doc-string.
+	(shimbun-rss-get-headers-1): New helper function to fetch all items.
+
+	* sb-atom.el (shimbun-atom-get-headers, shimbun-atom-get-headers-1):
+	Same changes as for sb-rss.el.
+
 2010-02-25  Katsumi Yamaoka  <yamaoka@xxxxxxx>
 
 	* sb-rss.el (shimbun-rss-get-headers): Add 5th arg `quit-immediately'
Index: sb-rss.el
===================================================================
RCS file: /storage/cvsroot/emacs-w3m/shimbun/sb-rss.el,v
retrieving revision 1.50
diff -u -r1.50 sb-rss.el
--- sb-rss.el	25 Feb 2010 02:11:10 -0000	1.50
+++ sb-rss.el	3 Mar 2010 21:47:25 -0000
@@ -173,23 +173,69 @@
   (shimbun-rss-get-headers shimbun range t))
 
 (defun shimbun-rss-get-headers (shimbun &optional range
-					need-descriptions need-all-items
-					quit-immediately)
-  (let ((xml (condition-case err
-		 (xml-parse-region (point-min) (point-max))
-	       (error
-		(message "Error while parsing %s: %s"
-			 (shimbun-index-url shimbun)
-			 (error-message-string err))
-		nil)))
-	(ignored-subject (luna-slot-value shimbun 'ignored-subject))
-	dc-ns rss-ns author hankaku headers)
-    (when xml
-      (setq dc-ns (shimbun-rss-get-namespace-prefix
-		   xml "http://purl.org/dc/elements/1.1/")
-	    rss-ns (shimbun-rss-get-namespace-prefix
-		    xml "http://purl.org/rss/1.0/")
-	    author
+					need-descriptions need-all-items)
+  "Get headers from rss feed described by SHIMBUN.
+RANGE is currently ignored.  If NEED-DESCRIPTIONS, include node
+text as description.  By default, only existing and new items
+from the feed are returned, i.e., those items which are newer
+than the oldest one in the shimbun.  If NEED-ALL-ITEMS is
+non-nil, all items from the feed are returned.  If the entries
+from the feed have date information, the result is sorted by
+ascending date."
+  (let* ((xml (condition-case err
+		  (xml-parse-region (point-min) (point-max))
+		(error
+		 (message "Error while parsing %s: %s"
+			  (shimbun-index-url shimbun)
+			  (error-message-string err))
+		 nil)))
+	 header headers oldheaders newheaders oldest)
+    (dolist (tmp (shimbun-rss-get-headers-1 xml shimbun need-descriptions))
+      (let* ((date (shimbun-header-date tmp))
+	     (ftime
+	      (when (and (stringp date)
+			 (> (length date) 1))
+		(w3m-float-time (date-to-time date)))))
+	(push (list tmp ftime) headers)))
+    (when headers
+      (if (or need-all-items
+	      ;; If there's a header without date information, we
+	      ;; return everything, just to be safe.
+	      (memq nil (mapcar 'cadr headers)))
+	  (mapcar 'car headers)
+	;; Otherwise, sort according to date.
+	(setq headers
+	      (sort headers (lambda (a b)
+			      (> (cadr a) (cadr b)))))
+	(while headers
+	  (setq header (pop headers))
+	  (if (shimbun-search-id shimbun (shimbun-header-id (car header)))
+	      (push header oldheaders)
+	    (push header newheaders)))
+	(if (null oldheaders)
+	    ;; All items are new
+	    (mapcar 'car newheaders)
+	  ;; Delete all items which are older than the ones we already
+	  ;; have
+	  (setq oldest (cadr (car oldheaders)))
+	  (while (and newheaders
+		      (> oldest (cadr (car newheaders))))
+	    (setq newheaders (cdr newheaders)))
+	  (append
+	   (mapcar 'car newheaders)
+	   (mapcar 'car oldheaders)))))))
+
+(defun shimbun-rss-get-headers-1 (xml shimbun need-descriptions)
+  "Retrieve all items found in XML for SHIMBUN and return headers.
+If NEED-DESCRIPTIONS, include node text as description."
+  (when xml
+    (let  ((dc-ns (shimbun-rss-get-namespace-prefix
+		   xml "http://purl.org/dc/elements/1.1/"))
+	   (rss-ns (shimbun-rss-get-namespace-prefix
+		    xml "http://purl.org/rss/1.0/"))
+	   (ignored-subject (luna-slot-value shimbun 'ignored-subject))
+	   author hankaku headers)
+      (setq author
 	    (catch 'found-author
 	      (dolist (channel
 		       (shimbun-rss-find-el (intern (concat rss-ns "channel"))
@@ -203,29 +249,21 @@
 				  '(body nil))
 		      (generate-new-buffer " *temp*")))
       (unwind-protect
-	  (catch 'done
-	    (dolist (item (shimbun-rss-find-el (intern (concat rss-ns "item"))
-					       xml)
-			  headers)
-	      (let ((url (and (listp item)
-			      (eq (intern (concat rss-ns "item")) (car item))
-			      (shimbun-rss-node-text rss-ns 'link (cddr item)))))
-		(when url
-		  (let* ((date (or (shimbun-rss-get-date shimbun url)
-				   (shimbun-rss-node-text dc-ns 'date item)
-				   (shimbun-rss-node-text rss-ns 'pubDate item)))
-			 (id (shimbun-rss-build-message-id shimbun url date))
-			 (subject (shimbun-rss-node-text rss-ns 'title item)))
-		    (when (and id
-			       (or need-all-items
-				   (if (shimbun-search-id shimbun id)
-				       (if quit-immediately
-					   (throw 'done headers)
-					 nil)
-				     t))
-			       (if (and ignored-subject subject)
-				   (not (string-match ignored-subject subject))
-				 t))
+	  (dolist (item (shimbun-rss-find-el (intern (concat rss-ns "item"))
+					     xml)
+			headers)
+	    (let ((url (and (listp item)
+			    (eq (intern (concat rss-ns "item")) (car item))
+			    (shimbun-rss-node-text rss-ns 'link (cddr item)))))
+	      (when url
+		(let* ((date (or (shimbun-rss-get-date shimbun url)
+				 (shimbun-rss-node-text dc-ns 'date item)
+				 (shimbun-rss-node-text rss-ns 'pubDate item)))
+		       (id (shimbun-rss-build-message-id shimbun url date))
+		       (subject (shimbun-rss-node-text rss-ns 'title item)))
+		  (when id
+		    (unless (and ignored-subject subject
+				 (string-match ignored-subject subject))
 		      (push
 		       (shimbun-create-header
 			0
Index: sb-atom.el
===================================================================
RCS file: /storage/cvsroot/emacs-w3m/shimbun/sb-atom.el,v
retrieving revision 1.9
diff -u -r1.9 sb-atom.el
--- sb-atom.el	18 Feb 2010 22:19:58 -0000	1.9
+++ sb-atom.el	3 Mar 2010 21:47:26 -0000
@@ -51,76 +51,116 @@
 
 (defun shimbun-atom-get-headers (shimbun &optional range
 					 need-summaries need-all-entries)
-  (let ((xml (condition-case err
-		 (xml-parse-region (point-min) (point-max))
-	       (error
-		(message "Error while parsing %s: %s"
-			 (shimbun-index-url shimbun)
-			 (error-message-string err))
-		nil)))
-	headers)
-    (when xml
-      (let* ((atom-ns (shimbun-rss-get-namespace-prefix
-		       xml "http://www.w3.org/2005/Atom"))
-	     (dc-ns (shimbun-rss-get-namespace-prefix
-		     xml "http://purl.org/dc/elements/1.1/"))
-	     (author-node (shimbun-rss-find-el
-			   (intern (concat atom-ns "author")) xml))
-	     (fn `(lambda (item) (shimbun-rss-node-text ,atom-ns 'name item)))
-	     (author (when (consp author-node)
-		       (mapconcat fn author-node ",")))
-	     url)
-	(catch 'done
-	  (dolist (entry (shimbun-rss-find-el
-			  (intern (concat atom-ns "entry")) xml))
-	    (setq url
-		  (catch 'url
-		    (dolist (link (shimbun-rss-find-el
-				   (intern (concat atom-ns "link")) entry))
-		      (when (string= (shimbun-atom-attribute-value
-				      (intern (concat atom-ns "rel")) link)
-				     "alternate")
-			(throw 'url (shimbun-atom-attribute-value
-				     (intern (concat atom-ns "href")) link))))))
-	    (unless url
-	      (setq url (shimbun-atom-attribute-value
-			 (intern (concat atom-ns "href"))
-			 (car (shimbun-rss-find-el
-			       (intern (concat atom-ns "link")) entry)))))
-	    (when url
-	      (let* ((date (or (shimbun-rss-get-date shimbun url)
-			       (shimbun-rss-node-text atom-ns 'updated entry)
-			       (shimbun-rss-node-text atom-ns 'published entry)
-			       (shimbun-rss-node-text atom-ns 'modified entry)
-			       (shimbun-rss-node-text atom-ns 'created entry)
-			       (shimbun-rss-node-text atom-ns 'issued entry)
-			       (shimbun-rss-node-text dc-ns 'date entry)))
-		     (author-node (shimbun-rss-find-el
-				   (intern (concat atom-ns "author")) entry))
-		     (author (or (and (consp author-node)
-				      (mapconcat fn author-node ","))
-				 (shimbun-rss-node-text dc-ns 'creator entry)
-				 (shimbun-rss-node-text dc-ns 'contributor entry)
-				 author))
-		     (id (shimbun-rss-build-message-id shimbun url date)))
-		(when (and id
-			   (not need-all-entries)
-			   (shimbun-search-id shimbun id))
-		  (throw 'done headers))
-		(when id
-		  (push (shimbun-create-header
-			 0
-			 (or (shimbun-rss-node-text atom-ns 'title entry)
-			     (shimbun-rss-node-text dc-ns 'subject entry))
-			 (or author (shimbun-from-address shimbun))
-			 (shimbun-rss-process-date shimbun date)
-			 id "" 0 0 url
-			 (when need-summaries
-			   (let ((summary (shimbun-rss-node-text
-					   atom-ns 'summary entry)))
-			     (when summary
-			       (list (cons 'summary summary))))))
-			headers)))))))
+  "Get headers from atom feed described by SHIMBUN.
+RANGE is currently ignored.  If NEED-SUMMARIES, include node text
+as summary.  By default, only existing and new items from the
+feed are returned, i.e., those items which are newer than the
+oldest one in the shimbun.  If NEED-ALL-ENTRIES is non-nil, all
+items from the feed are returned.  If the entries from the feed
+have date information, the result is sorted by ascending date."
+  (let* ((xml (condition-case err
+		  (xml-parse-region (point-min) (point-max))
+		(error
+		 (message "Error while parsing %s: %s"
+			  (shimbun-index-url shimbun)
+			  (error-message-string err))
+		 nil)))
+	 headers header newheaders oldheaders oldest)
+    (dolist (tmp (shimbun-atom-get-headers-1 xml shimbun need-summaries))
+      (let* ((date (shimbun-header-date tmp))
+	     (ftime
+	      (when (and (stringp date)
+			 (> (length date) 1))
+		(w3m-float-time (date-to-time date)))))
+	(push (list tmp ftime) headers)))
+    (when headers
+      (if (or need-all-entries
+	      ;; If there's a header without date information, we
+	      ;; return everything, just to be safe.
+	      (memq nil (mapcar 'cadr headers)))
+	  (mapcar 'car headers)
+	;; Otherwise, sort according to date.
+	(setq headers
+	      (sort headers (lambda (a b)
+			      (> (cadr a) (cadr b)))))
+	(while headers
+	  (setq header (pop headers))
+	  (if (shimbun-search-id shimbun (shimbun-header-id (car header)))
+	      (push header oldheaders)
+	    (push header newheaders)))
+	(if (null oldheaders)
+	    ;; All items are new
+	    (mapcar 'car newheaders)
+	  ;; Delete all items which are older than the ones we already
+	  ;; have
+	  (setq oldest (cadr (car oldheaders)))
+	  (while (and newheaders
+		      (> oldest (cadr (car newheaders))))
+	    (setq newheaders (cdr newheaders)))
+	  (append
+	   (mapcar 'car newheaders)
+	   (mapcar 'car oldheaders)))))))
+
+(defun shimbun-atom-get-headers-1 (xml shimbun need-summaries)
+  "Retrieve all items found in XML for SHIMBUN and return headers.
+If NEED-SUMMARIES, include node text as summary."
+  (when xml
+    (let* ((atom-ns (shimbun-rss-get-namespace-prefix
+		     xml "http://www.w3.org/2005/Atom"))
+	   (dc-ns (shimbun-rss-get-namespace-prefix
+		   xml "http://purl.org/dc/elements/1.1/"))
+	   (author-node (shimbun-rss-find-el
+			 (intern (concat atom-ns "author")) xml))
+	   (fn `(lambda (item) (shimbun-rss-node-text ,atom-ns 'name item)))
+	   (author (when (consp author-node)
+		     (mapconcat fn author-node ",")))
+	   url headers)
+      (dolist (entry (shimbun-rss-find-el
+		      (intern (concat atom-ns "entry")) xml))
+	(setq url
+	      (catch 'url
+		(dolist (link (shimbun-rss-find-el
+			       (intern (concat atom-ns "link")) entry))
+		  (when (string= (shimbun-atom-attribute-value
+				  (intern (concat atom-ns "rel")) link)
+				 "alternate")
+		    (throw 'url (shimbun-atom-attribute-value
+				 (intern (concat atom-ns "href")) link))))))
+	(unless url
+	  (setq url (shimbun-atom-attribute-value
+		     (intern (concat atom-ns "href"))
+		     (car (shimbun-rss-find-el
+			   (intern (concat atom-ns "link")) entry)))))
+	(when url
+	  (let* ((date (or (shimbun-rss-get-date shimbun url)
+			   (shimbun-rss-node-text atom-ns 'updated entry)
+			   (shimbun-rss-node-text atom-ns 'published entry)
+			   (shimbun-rss-node-text atom-ns 'modified entry)
+			   (shimbun-rss-node-text atom-ns 'created entry)
+			   (shimbun-rss-node-text atom-ns 'issued entry)
+			   (shimbun-rss-node-text dc-ns 'date entry)))
+		 (author-node (shimbun-rss-find-el
+			       (intern (concat atom-ns "author")) entry))
+		 (author (or (and (consp author-node)
+				  (mapconcat fn author-node ","))
+			     (shimbun-rss-node-text dc-ns 'creator entry)
+			     (shimbun-rss-node-text dc-ns 'contributor entry)
+			     author))
+		 (id (shimbun-rss-build-message-id shimbun url date)))
+	    (when id
+	      (push (shimbun-create-header
+		     0
+		     (or (shimbun-rss-node-text atom-ns 'title entry)
+			 (shimbun-rss-node-text dc-ns 'subject entry))
+		     (or author (shimbun-from-address shimbun))
+		     (shimbun-rss-process-date shimbun date)
+		     id "" 0 0 url
+		     (when need-summaries
+		       (let ((summary (shimbun-rss-node-text
+				       atom-ns 'summary entry)))
+			 (when summary
+			   (list (cons 'summary summary))))))
+		    headers)))))
       headers)))
 
 (defun shimbun-atom-attribute-value (attribute node)
Index: sb-rss-blogs.el
===================================================================
RCS file: /storage/cvsroot/emacs-w3m/shimbun/sb-rss-blogs.el,v
retrieving revision 1.3
diff -u -r1.3 sb-rss-blogs.el
--- sb-rss-blogs.el	25 Feb 2010 02:11:10 -0000	1.3
+++ sb-rss-blogs.el	3 Mar 2010 21:47:26 -0000
@@ -80,7 +80,7 @@
 						  shimbun-rss-blogs)
 						 &optional range)
   (let ((group (shimbun-current-group-internal shimbun))
-	(headers (shimbun-rss-get-headers shimbun range t nil t))
+	(headers (luna-call-next-method))
 	(type (sb-rss-blogs-guess-type-from-rss))
 	from)
     (cond