Introduction
Screenshots
Mailing Lists and IRC
Alternative Browsers
Special Thanks
FAQ
Understanding Mnemonic
TODO list and ideas
Bug Reports
Download binaries
Platforms
Compiling Mnemonic
Other useful software
Core
Message modules
Library modules
Object modules
Coding Guidelines
Browse Source
Using CVS
Website questions to:
webmaster@mnemonic.org
Mnemonic questions to:
disc@mnemonic.org
| |
This module listens to msg_network_status messages of
type content and starts to listen to
msg_network_data messages as soon as a
text/html , text/xml or text/css
stream has been announced. It converts these ascii streams into
instances of DOM::Document trees. These are announced
using msg_dom_create_event messages.
The xmlhandler does not spawn new threads, since the network layer is
expected to have started one thread for each connection already.
As the xmlhandler is building the DOM tree, it sends out
msg_dom_mutation_event messages to notify listeners.
The W3 event model is not very smart, as it has separate registration
points for every node. We could do that, but it would correspond to
having keyed messages and we don't like that. Instead, you will
register for the message TYPE, instead of for TYPE and TARGET, and do the
target filtering in the receiver.
In particular, the display module will listen to modification events
and use the pointers stored in them to find the elements of the DOM
that have changed.
This is only an initial version (which focusses on making a correct
DOM tree from all that broken HTML out there). It will probably still
fail most of the tests in the oasis
testsuite.
Some XML
namespace terminology first. In the example
<foo xmlns:bar="http://bla">
<bar:one>
- foo is a local name
- one is a local name
- bar is a namespace prefix
- http://bla is a namespace identifier
From the working draft, here's a larger example to illustrate most
of the namespace aspects one will ever use:
<?xml version="1.0"?>
<!-- initially, the default namespace is "books" -->
<book xmlns='urn:loc.gov:books'
xmlns:isbn='urn:ISBN:0-395-36341-6'>
<title>Cheaper by the Dozen</title>
<isbn:number>1568491379</isbn:number>
<notes>
<!-- make HTML the default namespace
for some commentary -->
<p xmlns='urn:w3-org-ns:HTML'>
This is a <i>funny</i> book!
</p>
</notes>
</book>
The parser follows the specifications of the HTML DTD, including
- Automatic insertion of optional start tags. Example: a file not
enclosed in
<HTML> ...</HTML> will get those
tags inserted automatically. This only works one level deep though (eg. it will
choke if you forget both HTML and BODY opening tags).
- Automatic insertion of optional end tags (ie. you can forget about
</LI> tags; they will be added automatically.
All of this is smart parsing but required by the specs. On the other
hand, the parser also applies various smart tricks to parse illegal
HTML, see below.
Automatic repair of broken HTML
|
The following corrections are automatically made to broken
HTML streams in order to convert them to valid DOM trees:
- When an element is encountered that is not allowed as a child of the
current node, and if the current node does not have an optional end tag,
the parser will assume that the end tag was forgotten and insert it
automatically. Example:
<A HREF="a">foo
<A HREF="b">bar</A>
will be converted to
<A HREF="a">foo</A>
<A HREF="b">bar</A>
- When a closing tag is encountered that does not match the current
opening tag, additional closing tags are inserted. Example:
<HEAD>
<TITLE>foobar
</HEAD>
will be converted to
<HEAD>
<TITLE>foobar</TITLE>
</HEAD>
This is only done when the closing tag (ie. </HEAD> )
that is encountered actually matches an opening tag encountered earlier.
If not, the additional closing tag is discarded. Example:
<HEAD>
<TITLE>foobar
</FONT>
will be converted to
<HEAD>
<TITLE>foobar
Explanation of the test_xmlparser output
|
When you run `test_xmlparser [filename]' the output is a textual representation
of the DOM tree which was created by the parser from the XML/HTML file.
Some things you can get:
[parse_error]:
A '/' in a tag did not get followed by '>'
#document
HTML
HEAD
TITLE
#text Slash...
BODY BGCOLOR="#000000" LINK="#006666"
TEXT="#333333" VLINK="#000000"
#text
- [parse_error]: is an exception, thrown because the parser got
seriously confused.
- #document is the document's root node
- HTML, HEAD, TITLE and BODY are what you guessed they are, nested as
indicated.
- #text indicates a text node, with the first few characters of the
text displayed after it (eg. `Slash...' indicates that this text
node starts with `Slash' but has more text following it, while an
empty `#text' is a whitespace textnode).
|