[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: sb-zeit-de sometimes gets only part of anarticle
Andreas Seltenreich writes:
> Elias Oltmanns writes:
>
>> since there is going to be another release soon, I just wanted to
>> point out that sb-zeit-de sometimes gets only the first part of an
>> article. Unfortunately, I haven't quite worked out the pattern yet.
>
> I suspect this is happening when an advertisement is placed in the
> middle of an article. I'll take a closer look at it...
Right, adjusting shimbun-zeit-de-content-end fixed the problem with
articles containing advertisements. I also added a method
shimbun-clear-contents to get rid of advertisements and webbugs. Patch
attached.
regards,
Andreas
Changes:
* sb-zeit-de.el: (shimbun-zeit-de-content-end) don't match
on advertisements
(luna-define-method) added to get rid of webbugs and
advertisements
Index: sb-zeit-de.el
===================================================================
RCS file: /storage/cvsroot/emacs-w3m/shimbun/sb-zeit-de.el,v
retrieving revision 1.2
diff -c -r1.2 sb-zeit-de.el
*** sb-zeit-de.el 9 Mar 2005 01:44:35 -0000 1.2
--- sb-zeit-de.el 23 Mar 2005 23:58:07 -0000
***************
*** 33,39 ****
(defvar shimbun-zeit-de-content-start "title\">")
(defvar shimbun-zeit-de-content-end
! "navigation[^>]*>[^A]\\|</p></p></td>\\|\<script\\|</body>\\|</html>")
(defvar shimbun-zeit-de-from-address "DieZeit@zeit.de")
(luna-define-method shimbun-headers :before ((shimbun shimbun-zeit-de)
--- 33,42 ----
(defvar shimbun-zeit-de-content-start "title\">")
(defvar shimbun-zeit-de-content-end
! (concat
! "</body>\\|</html>\\|navigation[^><]*>[^A]\\|"
! "<script language=\"JavaScript1\.2\" type=\"text/javascript\">"))
!
(defvar shimbun-zeit-de-from-address "DieZeit@zeit.de")
(luna-define-method shimbun-headers :before ((shimbun shimbun-zeit-de)
***************
*** 79,84 ****
--- 82,97 ----
(luna-define-method shimbun-index-url ((shimbun shimbun-zeit-de))
"http://newsfeed.zeit.de/")
+
+ (luna-define-method shimbun-clear-contents :after ((shimbun shimbun-zeit-de)
+ header)
+
+ ;; remove advertisements and 1-pixel-images aka webbugs
+ (shimbun-remove-tags "<a[^>]*doubleclick.net" "</a>")
+ (shimbun-remove-tags "<IFRAME[^>]*doubleclick.net[^>]*>")
+ (shimbun-remove-tags "<img[^>]*doubleclick.net[^>]*>")
+ (shimbun-remove-tags "<img[^>]*\\(width\\|height\\)=\"1px\"[^>]*>")
+ (shimbun-remove-tags "<tr><td[^>]*>Anzeige</td></tr>"))
(provide 'sb-zeit-de)