[Date Prev][Date Next][Thread Prev][][Date Index][Thread Index]

BUG: emacs-4m regexp stack overflow with long lines??



I've just started using emacs-w4m.  It's wonderful!  However, here's a
bug that I encountered on the first page I tried it on.

The bug is fully reproduceable. Unfortunately, I was not able to send
the HTML with the bug report, but I have shown how to reproduce the
bug.


---+ BRIEF

Is an error such as

  eror in process sentinel: w3m-form-parse-and-fontify: Stack overflow in regexp matcher

already known?

It occurs for a fairly complex web page - the previw of a wiki page.

---+ ALTERNATE

emacs-sw3m seems to depend on regexps that cause emacs' regexp stack
to overflow when presented with single lines of roughly 69K in size.

(May depend on the structure of the line, and or its presence
in a particular hidden input element)


---+ DETAIL

I have just installed w3m and emacs/w3m. I am able to browse simple
web pages, possibly even complicated ones.

My main purpose in trying emacs/w3m is to be able to access wiki pages
(e.g. www.twiki.org) from inside emacs. Wikis are a simple system for
editable web pages. I am able to browse my wiki pages. I am able to
edit a very small wiki page.  

However, I have problems when I try to edit a fairly complicated wiki
page - my blog. 1793 lines, 85718 bytes.

---++ Bug Scenario

Step 1 - OK: I can view this wiki page. 
    Buffer menu line
    . % *w3m*       85297  w3m      TWiki . Glew . GlewBlog

Step 2 - OK: When I start trying to edit this wiki page, I am given a web 
page with a 1793 line long text area. well, actually, only 16 lines
of the textarea are displayed, but I assume the others are there.

    Buffer menu line
    . % *w3m*        2616  w3m      TWiki . Glew . GlewBlog (edit)

Step 3 - OK: when I hit return in the textarea, I correctly enter the
emacs/w3m emacs editor window/buffer.

    Buffer menu lines
    .*  *w3m form textarea*    82925 w3m form textarea 
      % *w3m*        2616  w3m      TWiki . Glew . GlewBlog (edit)


Step 4 - OK: I can save the page from emacs/w3m via ctl-C ctl-C.  This
puts me back into the wiki page with the 1793 line long text area.

    Buffer menu line:
    . % *w3m*        2616  w3m      TWiki . Glew . GlewBlog (edit)


Step 5 - Problem:  when I hit the preview button on the wiki page of
step 4, I end up with an error.

    Error message in minibuffer (extracted from *messages* buffer):

   error in process sentinel: w3m-form-parse-and-fontify: Stack overflow in regexp matcher
   error in process sentinel: Stack overflow in regexp matcher

    Buffer menu line:
    .*% *w3m*      183853  w3m      TWiki . Glew . GlewBlog (edit)

    The *w3m* buffer contains partially formatted text that looks like

       |<base href="https://dpg-or.pdx.intel.com/Tools/TWiki/bin/view/Glew/GlewBlog"><pre_int><img_alt src="/Tools/TWiki/pub/wikiHome.gif" hseq="1" title="TWiki Home">TWiki Ho</img_alt></pre_int>                                        <_SYMBOL TYPE=32>•</_SYMBOL> To save changes: Press the [Save Changes] button.      
       |         TWiki . Glew . GlewBlog (preview)      <_SYMBOL TYPE=32>•</_SYMBOL> To make more changes: Go back in your browser.         
       |                                                <_SYMBOL TYPE=32>•</_SYMBOL> To cancel: Go back twice.                              
       |                                                                                                         
       |         Note: This is a preview. Do not forget to save your changes.                                    
       |
       |
       |Glew's pseudo-wiki-blog
       |
       |GlewBlogTOC - Table of Contents
       |
       |Friday December 10, 2004

    (The pipe symbol | indicates beginning of line, and is not in the buffer.)

    I.e. the buffer appears to be partially formatted.

    The w3m process is still running - at least, if I try to "g" to go
    to a nw page I get told that I cannot start the asynchronous
    process twice.

   Cannot run two w3m processes simultaneously (Type `C-c C-k' to stop asynchronous process)

---++ Reproducing the bug for you

Unfortunately, my blog in on an Intel internal web site.  Not only can
outsiders not access it, but my blog may contain stuff that I would
get in trouble for releasing to the outside world.

I'm recording the above, as I first encountered the bug, because it's
better than nothing.  I understand that it would be nicer to provide
you with an example where you can reproduce the bug, and I will be
attempting to do so.

(However, I am not one of those who say that a non-reproduceable bug
report is worthless. Reproduceable bug reports are best, but sometimes
reproducing a bug is hard. Sometimes a good description will allow
somebody more expert than the bug reporter to locate, reproduce,
and/or fix the bug.)

...

I have reproduced the bug at http://www.twiki.org.

Specifically, page
http://twiki.org/cgi-bin/view/Sandbox/SandBoxW3m
holds the bug description (a version of this email)
http://twiki.org/cgi-bin/view/Sandbox/EmacsW3mBugReport
and a page that demonstrates the bug 
http://twiki.org/cgi-bin/view/Sandbox/EmacsW3mBugDemo

The page that demonstrates the bug can be edited,
but the edited page cannot be previewed.

...

I was able to edit and preview the demo page using
w3m, but not using emacs-w3m. From w3m I was
able to save the HTML.

w3m-find-file on the saved HTML reproduced the bug.
Unfortunately, the HTML that causes the bug is too large
to attach to this wiki or to mail (as you willl see below).

Binary search revealed that the problem was a hidden input
element.

<verbatim>
   <input type="hidden" name="text" value="TBD deleted large text" />
</verbatim>

This element contained the entire text of the page,
converted to a single line.

Binary search revealed that the stack overflow occurred between a line
length of 68K and 69K bytes.


I.e. apparently the regexps used by emacs-w4m cause a stack overflow
for line lengths of approximately 69K bytes.


COMMENT: that's a pretty long line!  However, I anticiapte and
disagree with statements such as "nobody in their right mind would
create such a long line. TWiki apparently does. Automatically
generated code.

Also note that an HTML document need have no newlines
at all. Line length limited regexp parsing is dangerous.

I have not yet investigated what might be done to emacs-w4m to avoid
this problem.  Usually what needs to be done in such cases is to
replace a single powerful regexp with simpler regexps. Possibly also
to skip long lines.