Mnemonic XML/HTML/DTD parser
General Info

Introduction
Screenshots
Mailing Lists and IRC
Alternative Browsers
Special Thanks

FAQ
Understanding Mnemonic
TODO list and ideas
Bug Reports


User Info

Download binaries
Platforms
Compiling Mnemonic
Other useful software


Developer Info

Core
Message modules
Library modules
Object modules
Coding Guidelines
Browse Source
Using CVS


View with any browser

Website questions to:
webmaster@mnemonic.org

Mnemonic questions to:
disc@mnemonic.org

 

Overview

This module listens to msg_network_status messages of type content and starts to listen to msg_network_data messages as soon as a text/html, text/xml or text/css stream has been announced. It converts these ascii streams into instances of DOM::Document trees. These are announced using msg_dom_create_event messages.

The xmlhandler does not spawn new threads, since the network layer is expected to have started one thread for each connection already.

As the xmlhandler is building the DOM tree, it sends out msg_dom_mutation_event messages to notify listeners. The W3 event model is not very smart, as it has separate registration points for every node. We could do that, but it would correspond to having keyed messages and we don't like that. Instead, you will register for the message TYPE, instead of for TYPE and TARGET, and do the target filtering in the receiver. In particular, the display module will listen to modification events and use the pointers stored in them to find the elements of the DOM that have changed. This is only an initial version (which focusses on making a correct DOM tree from all that broken HTML out there). It will probably still fail most of the tests in the oasis testsuite.

Namespaces

Some XML namespace terminology first. In the example

<foo xmlns:bar="http://bla">
<bar:one>
  • foo is a local name
  • one is a local name
  • bar is a namespace prefix
  • http://bla is a namespace identifier

From the working draft, here's a larger example to illustrate most of the namespace aspects one will ever use:

<?xml version="1.0"?>
<!-- initially, the default namespace is "books" -->
<book xmlns='urn:loc.gov:books'
      xmlns:isbn='urn:ISBN:0-395-36341-6'>
    <title>Cheaper by the Dozen</title>
    <isbn:number>1568491379</isbn:number>
    <notes>
      <!-- make HTML the default namespace 
              for some commentary -->
      <p xmlns='urn:w3-org-ns:HTML'>
          This is a <i>funny</i> book!
      </p>
    </notes>
</book>

Parsing of HTML

The parser follows the specifications of the HTML DTD, including

  • Automatic insertion of optional start tags. Example: a file not enclosed in <HTML>...</HTML> will get those tags inserted automatically. This only works one level deep though (eg. it will choke if you forget both HTML and BODY opening tags).

  • Automatic insertion of optional end tags (ie. you can forget about </LI> tags; they will be added automatically.
All of this is smart parsing but required by the specs. On the other hand, the parser also applies various smart tricks to parse illegal HTML, see below.

Automatic repair of broken HTML

The following corrections are automatically made to broken HTML streams in order to convert them to valid DOM trees:

  • When an element is encountered that is not allowed as a child of the current node, and if the current node does not have an optional end tag, the parser will assume that the end tag was forgotten and insert it automatically. Example:
    <A HREF="a">foo
    <A HREF="b">bar</A>
    will be converted to
    <A HREF="a">foo</A>
    <A HREF="b">bar</A>
  • When a closing tag is encountered that does not match the current opening tag, additional closing tags are inserted. Example:
    <HEAD>
    <TITLE>foobar
    </HEAD>
    will be converted to
    <HEAD>
    <TITLE>foobar</TITLE>
    </HEAD>
    This is only done when the closing tag (ie. </HEAD>) that is encountered actually matches an opening tag encountered earlier. If not, the additional closing tag is discarded. Example:
    <HEAD>
    <TITLE>foobar
    </FONT>
    will be converted to
    <HEAD>
    <TITLE>foobar
    

Explanation of the test_xmlparser output

When you run `test_xmlparser [filename]' the output is a textual representation of the DOM tree which was created by the parser from the XML/HTML file. Some things you can get:

 [parse_error]:  
   A '/' in a tag did not get followed by '>'
 #document
    HTML
       HEAD
          TITLE
             #text Slash...
       BODY BGCOLOR="#000000" LINK="#006666" 
            TEXT="#333333" VLINK="#000000"
          #text  
  1. [parse_error]: is an exception, thrown because the parser got seriously confused.
  2. #document is the document's root node
  3. HTML, HEAD, TITLE and BODY are what you guessed they are, nested as indicated.
  4. #text indicates a text node, with the first few characters of the text displayed after it (eg. `Slash...' indicates that this text node starts with `Slash' but has more text following it, while an empty `#text' is a whitespace textnode).