[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Extract real urls in Google search

From: Katsumi Yamaoka <yamaoka@xxxxxxx>
Date: Mon, 04 Jun 2012 08:29:15 +0900
X-ml-name: emacs-w3m
X-mail-count: 11838
References: <87ehpwzh1z.fsf@xxxxxxxxxxx>

In [emacs-w3m : No.11837] jidanni@xxxxxxxxxxx wrote:
> Even though I use the functions in
> http://jidanni.org/comp/configuration/.emacs-w3m
> still the links in
> http://www.google.com.tw/search?q=%E9%AB%98%E9%9B%84%E5%9C%96%E6%9B%B8%E9%A4%A8&ie=utf-8&oe=utf-8
> have
> http://www.google.com.tw/url?q=htt... attached.

Ok.  The regexp need to be improved.  Try this, or use the latest
emacs-w3m CVS:

--8<---------------cut here---------------start------------->8---
(eval-after-load "w3m-filter"
  '(progn
     (nconc w3m-filter-rules
	   '(("\\`https?://[a-z]+\\.google\\." w3m-filter-google)))
     (defun w3m-filter-google (url)
       "Extract real urls in Google search."
       (goto-char (point-min))
       (while (re-search-forward "\\(<a[\t\n ]+\\(?:[^\t\n >]+[\t\n ]+\\)*\
href=\"\\)/\\(?:imgres\\?imgurl\\|url\\?q\\)=\\([^&]+\\)[^>]+>"
				 nil t)
      (insert (w3m-url-decode-string
	       (prog1
		   (concat (match-string 1) (match-string 2) "\">")
		 (delete-region (match-beginning 0) (match-end 0)))))))))
--8<---------------cut here---------------end--------------->8---

> I also notice an interesting issue.
> If I browse
> httP://jidanni.org/comp/ instead of
> http://jidanni.org/comp/
> many of the link destinations in that page get messed up!

What differ between them?  I tried your .emacs-w3m and saw no
difference.

Follow-Ups:
- httP vs http
  - From: jidanni

References:
- Extract real urls in Google search
  - From: jidanni

Prev by Date: Extract real urls in Google search
Next by Date: httP vs http
Previous by thread: Extract real urls in Google search
Next by thread: httP vs http
Index(es):
- Date
- Thread

Namazu Search: [Help]