[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.5.2 Getting web page and header information

Let's identify a target web page URL to gather subjects and other informations first. If a web site uses a frame, a target is only one of the web pages. Second, lets create a body of the shimbun-index-url method function using the luna-define-method form in your `sb-foobar.el' file. And make the user customizable variable shimbun-foobar-groups, which we will explain later(12).

 
(defvar shimbun-foobar-url "http://www.foobar.net")

(luna-define-method shimbun-index-url ((shimbun shimbun-foobar))
  shimbun-foobar-url)

(defvar shimbun-foobar-groups '("news"))

After you create a body of the shimbun-index-url method, the shimbun-headers method can get a web page source since the `shimbun.el' module already has the default shimbun-headers method. After the shimbun-headers method gets a web page source, it calls the shimbun-get-headers method to gather headers information. As the `shimbun.el' module does not have the shimbun-get-headers method, you have to create it in your `sb-foobar.el' file.

Now look carefully in the page source and create the shimbun-get-headers method in your `sb-foobar.el' file.

Create a regular expression that can gather headers information. Minimally necessary information are subject, date, author, URL and message-id of an article. They are used in MUA as Subject, Date, From, Xref and Message-ID.

If you want to make an article from a line in a web page source, like:

 
<a href="053003.html">some talks on May 30(posted by Mikio &lt;foo@bar.net&gt;)</a>

use the following regexp:

 
"<a href=\"\\(\\([0-9][0-9][0-9][0-9]\\)[0-9][0-9]\\.html\\)\">\\([^<(]+\\)(posted by \\([^<]+\\))<\/a>"

You can get a value for Xref by (match-string 1). You can get a value for Date by modifying a value of (match-string 2). Subject by (match-string 3) and From from (match-string 4). You can modify them further for showing additional information in MUA. See the `sb-muchy.el' file which makes its original subject form from a web page source.

If URL of an article is a relative path like above, use shimbun-expand-url to expand it before putting information to header. If each article doesn't have a each unique URLs (i.e. URL of headers and URL of articles are just same), you have to ask Emacs to remember body of an article when gathering headers information, For more detail see the files `sb-palmfan.el', `sb-dennou.el' and `sb-tcup.el'.

Sometimes you cannot identify Date information when gathering headers information only from a web page source. If so, leave it, just set a null string, "" to its value. If you can identify Date only when you see contents of an article, you can set it at that time by using shimbun-make-contents method. And you may use a fixed From for a web site (e.x. "webmaster@foobar.net").

Be careful when you build a message-id. Make sure it has uniqueness otherwise you may not be able to read some articles in the `shimbun'(13). Assure uniqueness by building message-id using date information, a domain of the page and/or a part of URL of the page. And use `@' but `:' as a part of message-id in order to display inline images. See RFC2387 and RFC822 for more detail.

Put these information to header using function shimbun-create-header of the `shimbun.el' module.

A bare bone of shimbun-get-headers in your `sb-foobar.el' file is as follows:

 
(luna-define-method shimbun-get-headers ((shimbun shimbun-foobar)
                                         &optional range)
  (let ((regexp "....")
        subject from date id url headers)
    ...
    (catch 'stop
      (while (re-search-forward regexp nil t nil)
        ...
        (when (shimbun-search-id shimbun id)
          (throw 'stop nil))
        (push (shimbun-create-header
               0 subject from date id "" 0 0 url)
              headers)))
    headers))

Note that you can access `shimbun-foobar' instance via temporary variable shimbun in the method.

Now we will explain a user variable shimbun-foobar-groups.

Assume that you have two groups of articles in http://www.foobar.net and there are two different web pages for such groups in where `shimbun' module gathers header information. For examples, there are what's new information of the web site in http://www.foobar.net/whatsnew/index.hmtl, and there are archive lists of email messages posted to ML in http://www.foobar.net/ml/index.html. In such case you may want to access the group by `shimbun' folders `foobar.whatsnew' and `foobar.ml'. If so, put the following S expressions to the `sb-foobar.el' file.

 
(defvar shimbun-foobar-url "http://www.foobar.net")

(defvar shimbun-foobar-group-path-alist
  '(("whatsnew" . "/whatsnew/index.html")
    ("ml" . "/ml/index.html")))

(defvar shimbun-foobar-groups
  (mapcar 'car shimbun-foobar-group-path-alist))

(luna-define-method shimbun-index-url ((shimbun shimbun-foobar))
  (concat shimbun-foobar-url
          (cdr (assoc (shimbun-current-group-internal shimbun)
                      shimbun-foobar-group-path-alist))))

You can get the current group by using shimbun-current-group-internal. You can use it in shimbun-get-headers method (or others) in order to change its behavior in accordance with the current group.

Each `shimbun' module needs at least one group. There is not a special rule for naming a group, but if you don't find out a good name, use `news' or `main'.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

This document was generated by TSUCHIYA Masatoshi on November, 3 2005 using texi2html