This document describes how to perform a set of XML manipulation tasks with the 4Suite XML processing library. These tasks include parsing XML using either DOM-like or SAX-like models, querying XML or XML models using XPath, using XSLT, using XUpdate, and validating documents with RELAX NG.
3.1.1 Quick access to the Domlette reader API
3.1.2 The full Domlette reader API
3.1.3 The importance of base URIs
3.1.4 Parsing XML that's already a Unicode string
3.1.5 NonvalidatingReader
3.1.6 EntityReader Examples
3.1.7 ValidatingReader
3.1.8 NoExtDtdReader
3.1.9 Creating your own reader instance
3.1.10 InputSource objects
3.1.11 Converting from other DOM libraries
3.2.1 What about
getElementsByTagName()?
3.3 Serializing Domlette nodes
3.4 Building a DOM from scratch
3.5 XPath query
3.7 Why does Domlette diverge from the DOM specification?
4 SAX
4.1 Validating a document while parsing it using SAX
4.2 Walking a DOM to fire SAX events
4.3 Building a Domlette from SAX events
4.4 Feeding a generator from SAX events
4.5 SAX filters
4.6 Streaming canonicalization
5.2 Type mappings
5.3 Advanced use
5.4 Reusing parsed XPath queries
5.5 Migration from PyXML's XPath
6.3 Example
6.4 Using Domlette objects instead of InputSources
6.6 Using xml-stylesheet processing instructions
6.7 Alternative output destinations
6.9 XSLT patterns
7.1 Extension functions (XPath and XSLT)
7.3.1 Controlling output from XSLT extensions
7.3.2 Creating result tree fragments
7.3.3 Comunicating with the external code that invokes XSLT
8.1 Starting with MarkupWriter
8.5 How to insert a complete chunk
8.6 How to insert processing instructions and comments
8.7 Using namespaces
8.9 More examples
8.9.1 Writing XHTML with MarkupWriter
8.9.2 Writing information of directory listing as a
XML document
8.9.3 Building a bot
11.1 About XInclude
11.2 XInclude support in 4Suite
11.3 Examples
12.1 About XPointer
12.2 XPointer support in 4Suite
12.3 Examples
13.1 Transforming DocBook using the DocBook XSL stylesheets
14 Resources
4Suite allows users to take advantage of standard XML technologies rapidly and to develop and integrate Web-based applications. It also puts practical technologies for knowledge management projects in the hands of developers. It is implemented in Python with C extensions.
At the core of 4Suite is a library of integrated tools (including convenient command-line tools) for XML processing, implementing open technologies such as DOM, SAX, XSLT, XInclude, XPointer, XLink, XPath, XUpdate, RELAX NG, and XML/SGML Catalogs.
With 4Suite, you can:
Parse a document into an efficient DOM-like structure (Domlette)
Apply XSLT to a document, whether or not it has been separately parsed
And much more. These tasks are covered in this manual.
Please see the UNIX or Windows install documents. Remember that if you are using Cygwin on Windows, you should follow the UNIX instructions.
Domlette is 4Suite's lightweight DOM implementation. It is optimized for XPath operations, speed, and relatively low memory overhead. The Domlette API is accessible through Ft.Xml.Domlette. This section describes how to parse, manipulate, and then serialize XML documents using this API.
Below, we briefly summarize the various elements of the API that form the basic life span of Domlette objects.
The Ft.Xml module contains the function Parse that gets the job done quickly. See “Quick access to the Domlette reader API” for details. For a bit more more advanced parsing, you will need a combination of the reader instances in the Ft.Xml.Domlette module and Ft.Xml.CreateInputSource for constructing InputSource instances. In rare cases you might need lower-level APIs in in the Ft.Xml.InputSource module. Read “The full Domlette reader API” if Ft.Xml.Parse isn't enough.
The Domlette API for interacting with XML documents—accessible as methods of the various Domlette objects—is similar to the DOM Level 2 specification. See “Domlette API summary” for more information.
The Ft.Xml.Domlette module provides two functions, Print and PrettyPrint, for writing your XML documents. The Print function writes the XML document precisely as given in the model. On the other hand, the PrettyPrint function adds whitespace nodes to your document to try to indent the resulting output nicely. See “Serializing Domlette nodes” for details.
We begin our discussion of the Domlette API by describing how to obtain a model of your XML documents to manipulate further. Because XML documents offer such rich functionality and exist in such varied environments, there can be a surprising amount of work that you must do to simply load your XML documents. We begin by providing a short-cut for easy access. We will then dive into the full suite of document loading utilities.
For basic document manipulations or to get started quickly, the Ft.Xml module offers a quick way to parse XML documents and directly obtain access to the Domlette interface to those documents. Within this module the function of interest is Parse.
This function will get you started quickly because it specifically chooses some default values for some of the more advanced parsing features. If you are passing in a string or stream, and the material in “The importance of base URIs” applies to your parsing situation, then you will want to use the full-featured API. In brief, if your XML document references external resources, you should not use this convenience function. See “The full Domlette reader API” instead.
This function returns a Domlette Document representing the root of the document from the argument.
Parse(source)The Parse function takes a single argument, which is a byte string (not unicode object), file-like object (stream), file path or URI.
XML = """
<ham>
<eggs n='1'/>
This is the string content with <em>emphasized text</em> text
</ham>"""
from Ft.Xml import Parse
doc = Parse(XML)
# If the above XML document were located in the file
# "target.xml", we could have used `Parse("target.xml")`.
print doc.xpath('string(ham//em[1])')
You create Domlette instances by parsing XML documents with the reader system. For general use, the Ft.Xml.Domlette package contains instances of the different reader classes that can be used directly after you import them. These instances include NonvalidatingReader and ValidatingReader, which provide non-validating parsing and validating parsing services, respectively. The validation in this case refers to DTD validation. For RELAX NG validation, see “Validation using RELAX NG”. All the reader classes (and, hence, their bundled instances) are described in later sections. After you have obtained one of these reader instances, you feed your XML document entity's byte stream to the reader. We summarize the available reader methods below.
parseUri(uri)The parseUri method takes a single argument; this uri argument is the absolute URI of the document entity to parse. The URI will be dereferenced by the default resolver.
parseString(st, uri)The parseString method takes two arguments; st is the XML document entity in the form of an encoded Python string (not a Unicode string). See the next section for details on the uri argument.
parseStream(stream, uri)The parseStream method takes two arguments; stream is a Python file-like object that can supply the document entity's bytes via read() calls. See the next section for details on the uri argument.
parse(inputSource)The parse method takes a single argument; inputSource is an Ft.Xml.InputSource.InputSource object, described in “InputSource objects”.
The next two sections cover some of the issues that you should understand before using these functions. Then we start seeing some examples in “NonvalidatingReader”.
In the first 3 methods listed in the previous section, the uri argument is the URI of the document entity that you are feeding to the parser. It is a very important—but often overlooked—concept in document processing.
The URI gives the document entity a unique identifier that can used to refer to the document as a whole. Also, each Domlette node derived from a particular entity inherits that entity's URI as the node's baseURI property, unless an alternative base URI was indicated, such as with xml:base, or if part of the document was loaded as an external entity or XInclude.
The document's URI is also used as the "base URI" for resolving any relative URI references that may appear within the document itself. Relative URI references may occur in a document in places like:
<!DOCTYPE> or <!ENTITY>, immediately following the keyword SYSTEM
<xsl:import> and <xsl:include>, in the value of the href attribute
<xi:include>, in the value of the href attribute
<exsl:document>, in the value of the href attribute
the arguments to XSLT's document() function
It is a common misconception that relative URI references in a document's content are considered to be relative to the processor's current working directory. They are actually resolved relative to the URI of the document that contains the relative URI reference (more specifically, relative to the URI of the entity in which the reference occurs, keeping in mind that a document may be comprised of multiple entities, i.e., separate files).
In all cases, the document URI that you supply in the reader API must be "absolute", which means that it has a scheme, e.g. "http://spam/eggs.xml", not just "/spam/eggs.xml" or "eggs.xml".
If you know there are not going to be any relative URI references to resolve during initial parsing or during processing of the Domlette by other tools, then you can safely omit the argument, or, preferably, supply a dummy URI like "urn:dummy" or "http://spam/eggs.xml". If you choose to omit URI arguments from APIs that need them, you may get a Python warning, and a random URI—which is probably not what you want—will be assigned.
If you've understood all this and yet you want to just go ahead and not specify a base URI, you may have to turn off the likely warnings. You can do so with code such as in the following example.
import Ft.Xml.Domlette
import warnings
def disable_warnings(*args): pass
warnings.filterwarnings("ignore", category=Warning)
warnings.showwarning = disable_warnings
XML = "<spam/>"
doc = Ft.Xml.Domlette.NonvalidatingReader.parseString(XML)
Ft.Xml.Domlette.Print(doc)
You can also in such a case use the convenience function Ft.Xml.Parse (see above).
Because 4Suite is trying to provide as thin a wrapper as possible to the underlying parser, and due to complexities in the APIs of these parsers, there is no API in 4Suite for parsing Python's Unicode strings.
If your XML is in the form of a Unicode string, you must encode the string as bytes so that the underlying parser can read it. Once you have an encoded string, you can pass it to the reader's parseString(), or wrap it in an InputSource using Ft.Xml.CreateInputSource, or the fromString() method of an InputSourceFactory. If the string is not UTF-16 or UTF-8 encoded, then you must tell the reader what encoding it actually uses. You can do this either by writing or replacing the XML declaration in the string itself, or (much easier) setting the optional encoding keyword argument in the reader's parseString() method or the InputSourceFactory's fromString() method. For an example, see the Akara article on external encoding declarations.
Use NonvalidatingReader for basic parsing. NonvalidatingReader performs its parsing without validating against a DTD.
The following example will parse an XML source taken from the supplied URI, which is treated as a URL by the default resolver.
from Ft.Xml.Domlette import NonvalidatingReader doc = NonvalidatingReader.parseUri( "http://www.w3.org/2000/08/w3c-synd/home.rss")
The following example also parses an XML source taken from the supplied URI, which is treated as a URL. In this case, the default resolver tries to read the XML source from the filesystem.
from Ft.Xml.Domlette import NonvalidatingReader
doc = NonvalidatingReader.parseUri("file:///tmp/spam.xml")
The following example parses XML from the filesystem. When given a relative file path in the local OS's format, we must first convert that path to a URI that our reader objects can use.
from Ft.Xml.Domlette import NonvalidatingReader
from Ft.Lib import Uri
file_uri = Uri.OsPathToUri('spam.xml')
doc = NonvalidatingReader.parseUri(file_uri)
The following example parses XML from a string. Note that it does not provide a document/base URI.
from Ft.Xml.Domlette import NonvalidatingReader
doc = NonvalidatingReader.parseString("<spam>eggs</spam>")
In the following example, we are parsing XML from a string in a case where the document does need a base URI to be specified.
from Ft.Xml.Domlette import NonvalidatingReader s = """<!DOCTYPE spam [ <!ENTITY eggs "eggs.xml"> ]> <spam>&eggs;</spam>""" doc = NonvalidatingReader.parseString(s, 'http://foo/test/spam.xml') # during parsing, the replacement text for &eggs; # will be obtained from http://foo/test/eggs.xml
In all of the above examples, doc is now a Domlette node object. 4Suite currently offers one Domlette implementation, written in C, called cDomlette.
Sometimes you need to parse a fragment of XML rather than the full document. If operating in non-validating mode is sufficient, Domlette has a reader that can handle this case. When parsing such a fragment, EntityReader returns a Domlette document fragment rather than a document object.
from Ft.Xml.Domlette import EntityReader s = """ <spam1>eggs</spam1> <spam2>more eggs</spam2> """ docfrag = EntityReader.parseString(s, 'http://foo/test/spam.xml')
The content parsed by EntityReader must be an XML External Parsed Entity. This means that it can't be just any XML document. The main limitation is that it must not have a document type declaration.
If you want to validate a document with a DTD as you parse it, use the ValidatingReader object instead. If ValidatingReader discovers that the document that it is currently parsing is invalid, then it throws a Ft.Xml.ReaderException and does not finish parsing the document. The following example illustrates these concepts.
# ValidatingReader is a global instance from Ft.Xml.Domlette import ValidatingReader XML = """<!DOCTYPE a [ <!ELEMENT a (b, b)> <!ELEMENT b EMPTY> ]> <a><b/><b/></a>""" doc = ValidatingReader.parseString(XML, "urn:x-example:valid-a") # And of course, as with other readers, you can use `parse`, `parseUri`, and # `parseStream` as well. # The following document, however, is invalid because an `a` element can only # have two `b` children according to its DTD. XML = """<!DOCTYPE a [ <!ELEMENT a (b, b)> <!ELEMENT b EMPTY> ]> <a><b/><b/><b/></a>""" # This throws a `Ft.Xml.ReaderException` when it encounters invalid structure, # and does not finish parsing the document into `doc`. doc = ValidatingReader.parseString(XML, "urn:x-example:invalid-a")
When using NonvalidatingReader to parse a document, that document's DTD is still opened and read to obtain information such as entity declarations and default attribute values. You cannot suppress reading of the internal DTD subset, but you can prevent the external subset from being accessed by using NoExtDtdReader. This won't affect the processing of external parameter entities defined in the internal DTD subset. Use this object as you would use NonvalidatingReader.
In some cases you might not want to use the global reader instances. For instance in multithreaded use, you might want a reader per thread. Or you might want to change some of the parameters on the readers. If so, you can create your own reader instance:
from Ft.Xml.Domlette import NonvalidatingReaderBase
reader = NonvalidatingReaderBase()
doc = reader.parseUri("http://xmlhack.com/read.php?item=1560")
Instead of NonvalidatingReaderBase, you could instead use NoExtDtdReaderBase or ValidatingReaderBase, depending on your needs. Each of these 3 readers take an optional inputSourceFactory constructor argument, which you can use to supply a custom URI resolver.
All of the previous examples involve parsing URIs or strings of data. You can also handle InputSource objects. An InputSource is an object that encapsulates a source of encoded text for parsing, and a URI resolver. The advantage to using an InputSource is that it provides a standard API to the text stream, and—perhaps more importantly—allows you to associate a custom URI resolver with the stream.
Normally, you can just get an InputSource by calling the convenience function Ft.Xml.CreateInputSource with a single argument, which is a string (not Unicode object), file-like object (stream), file path or URI. You can then pass the InputSource object to the reader's parse() method, as in the following example.
from Ft.Xml import InputSource, CreateInputSource
from Ft.Xml.Domlette import NonvalidatingReader
#
# Use CreateInputSource to parse a URL:
#
isrc = CreateInputSource("http://xmlhack.com/read.php?item=1560")
doc1 = NonvalidatingReader.parse(isrc)
#
# Or a string:
#
isrc = CreateInputSource("<spam>eggs</spam>", "http://spam.com/base")
doc2 = NonvalidatingReader.parse(isrc)
#
# InputSource is a file-like object, so you can treat it as such:
#
isrc = CreateInputSource("http://xmlhack.com/read.php?item=1560")
raw_text = isrc.read()
#
# The uri/system ID you used for it is maintained
#
print isrc.uri
#
# You can also create other InputSources from URIs relative to this one
#
isrc2 = isrc.resolve("read.php?item=1703")
If you need lower-level control you can use an InputSourceFactory instance, calling the appropriate method: fromUri(uri), fromString(st), or fromStream(stream), much like the reader API described earlier. The following listing is functionally equivalent to the above one.
from Ft.Xml import InputSource
from Ft.Xml.Domlette import NonvalidatingReader
factory = InputSource.DefaultFactory
isrc = factory.fromUri("http://xmlhack.com/read.php?item=1560")
doc1 = NonvalidatingReader.parse(isrc)
#
# The factory is reusable. Here we also parse a string:
#
isrc = factory.fromString("<spam>eggs</spam>", "http://spam.com/base")
doc2 = NonvalidatingReader.parse(isrc)
#
# InputSource is a file-like object, so you can treat it as such:
#
isrc = factory.fromUri("http://xmlhack.com/read.php?item=1560")
raw_text = isrc.read()
#
# The uri/system ID you used for it is maintained
#
print isrc.uri
#
# You can also create other InputSources from URIs relative to this one
#
isrc2 = isrc.resolve("read.php?item=1703")
You can convert another Python DOM object (e.g. 4DOM or minidom) to a Domlette object using the function ConvertDocument:
from Ft.Xml.Domlette import ConvertDocument converted_document = ConvertDocument(oldDocument, documentURI=u'http://www.example.org/')
The DocumentURI parameter provides a base URI for the converted nodes. If not specified, attributes documentURI and then baseURI are checked in the source DOM, as defined in DOM Level 3. If no URI is found in this way, a warning is issued and a UUID URI is generated for the new Domlette object.
You will use a large part of the Domlette API to interact with the model of your XML documents. The implementation of this part of the API is found in the Ft.Xml.cDomlette module. This part of the API allows you to navigate around a document and modify the content of that document. It is very similar to the DOM Level 2 specification and follows some of the DOM Level 3 specification; feel free to refer to those specifications and the 4Suite API documentation for details about the intended behavior of this API. You can find brief descriptions of the methods and attributes provided by this API listed below. This API is also nearly the same as the API for xml.dom, which is bundled with Python. The node type constants are inherited directly from xml.dom.Node.
Many objects that you will work with in the Domlette API are descendents of the Domlette Node class. Documents, document fragments (of class DocumentFragment), Elements, attributes (class Attr), text (class Text), processing instructions (class ProcessingInstruction), and comments (class Comment) are all nodes; any node operations are defined on objects of these types, as well. Some operations do not make sense on some objects, however. For example, it does not make sense to add children to an attribute node.
In the DOM model of XML documents, there is a Document node which represents the starting point for the other pieces of the document. This node is not the root element of the document; rather, the Document node contains the root element as its only element child. The Document node may have other children, though, such as processing instructions and comments.
You can easily access properties of a node directly. The following properties are available on any node. These properties generally store information about the structure of the document in the near "vicinity" of the target node.
Properties available on every Node object
This is a python dictionary containing the attributes defined on the target node. The key for the dictionary is a tuple containing the namespace and local name of the attribute. The value associated with this attribute name tuple is the attribute (of class Attr) itself.
node = Parse("<foo a='1'/>")
print node.childNodes[0].attributes
{(None, u'a'): <Attr at 0x40870ecc: name u'a', value u'1'>}
This is the base URI in scope for the target node as a Python unicode string. It is read-only and is computed dynamically according to DOM L3 Core.
This is the Python list of all the node children of the target node. Note that in DOM terminology, the attributes of a node are not children of that node.
node = Parse("<foo a='1'/>")
print node.childNodes
[<Element at 0x4086052c: name u'foo', 1 attributes, 0 children>]
This is the first child node of the target node. This is equivalent to childNodes[0], and is a useful property for quickly walking the document tree.
node = Parse("<foo a='1'/>")
print node.firstChild
<Element at 0x40860a6c: name u'foo', 1 attributes, 0 children>
This is the last child node of the target node. This is equivalent to childNodes[-1].
node = Parse("<foo a='1'/><!--Hi!-->")
print node.lastChild
<Comment at 0x4087caf4: u'Hi!'>
This is the local name of the target node as a Python unicode string.
This is the namespace URI of the target node as a Python unicode string.
This is the node immediately following the target node, or None if the target node is the last child of its parent (or if the target node is an attribute, as attributes are unordered).
This is the value of the target node as a Python unicode string, if the target node has a string value. If not, this is None. To illustrate some of the possibilities, attributes and text nodes have values, while elements and documents do not.
This is the Document node in which the target node is contained.
This is the parent of the target node. If the target node is a Document node, then this will be None; Document nodes do not have parents.
This is the namespace prefix of the current node, or None if the current node does not (or cannot) have a namespace prefix.
This is the node immediately preceding the target node, or None if the target node is the first child of its parent (or if the target node is an attribute, as attributes are unordered).
This is a synonym for ownerDocument.
This is a synonym for baseURI.
In addition to accessing the structure relative to a node, there are also a set of operations that we can perform on these structures, including a variety of operations for modifying the document. Some of these methods allow you to add new nodes in various places; note that in the DOM, only Document nodes can create new nodes. See “Methods available to Document objects” for details. The following methods are available on any node.
Methods available to every Node object
appendChild(node)This method adds node as the last child of the current instance. This is useful for manually building a document in breadth-first document order.
insertBefore(newChild, refChild)This method adds the node newChild to the current instance immediately before child node refChild.
replaceChild(newChild, oldChild)This method replaces the child node oldChild with the newChild node.
removeChild(oldChild)This method removes the oldChild node as a child of the instance node.
cloneNode(deep)This method returns a new copy of the current instance. If (and only if) deep is true, then we copy deeply: the node's attributes and children are also copied deeply.
isSameNode(otherNode)This method determines whether the instance node and otherNode are the same node based upon object identity.
normalize()This method merges any adjacent text nodes in the attributes or descendents of the current instance.
hasChildNodes()This method returns true if and only if the instance node has any child nodes.
xpath(expr, explicitNss)This method evaluates the XPath expression expr with the current instance as the expression context and returns an appropriately-valued result. The explicitNss parameter is optional; it is a Python dictionary mapping namespace prefixes to namespaces for use in the expression. See “XPath queries” for details.
In addition to their behavior as nodes, Document nodes are uniquely responsible for a number of tasks. For example, only Document nodes can create other nodes. The following methods are availble only to Document nodes.
Methods available to Document objects
createElementNS(namespaceURI, qualifiedName)This method creates and returns a new Element with the given namespace URI and qualified name.
createAttributeNS(namespaceURI, qualifiedName)This method creates and returns a new attribute (Attr object) with the given namespace URI and qualified name.
createTextNode(data)This method creates and returns a new Text node with the string value of data.
createProcessingInstruction(target, data)This method creates and returns a new processing instruction (ProcessingInstruction object) with the given target name and contents taken from data.
createComment(data)This method creates and returns a new Comment with the string value of data.
createDocumentFragment()This method creates and returns a new, empty document fragment (DocumentFragment object).
importNode(importedNode, deep)Nodes can only belong to one document at a time. This method creates a copy of the node importedNode that belongs to the instance (but which does not yet have a parent). If (and only if) deep is true, then we copy deeply: the node's attributes and children are also copied deeply and imported.
Document nodes also have a number of properties that are not found on other nodes. These properties are summarized in the following list.
Properties available on Document objects
This is a DocumentType object that encapsulates info about the document's "type", as described in its DOCTYPE tag. In Domlette, which doesn't use such objects, the value of the doctype property will always be None.
This is the root element of the document.
This is the URI that identifies the document.
This is the DOMImplementation that created the document.
This Domlette-specific property is the public ID of the DTD of this document.
This refers to the current instance.
This Domlette-specific property is the system ID of the DTD of this document.
This is the list of unparsed entities in the current document.
Attributes (Attr objects) do not have any special methods, but they do have a few additional properties. These properties are summarized in the following list.
Properties available on Attr objects
This is the qualified name of the current instance.
This is a synonym for the name property.
This is a synonym for the parentNode property.
You will probably never need this property. It is always 1. DOM says it should be 0 if it is present through defaulting, rather than explicitly specified in the document. This is only possible if the DOM implementation preserves certain details from DTD processing, which 4Suite never does. Therefore the value is always 0.
This is a synonym for the nodeValue property.
Since attributes can only be attached to elements, Element objects have a set of special methods for managing which attributes are attached to them. We describe these methods below.
Methods available to Element objects
hasAttributeNS(namespaceURI, localName)This method returns true if the current instance has an attribute with the given namespace URI and local name, and false otherwise.
getAttributeNS(namespaceURI, localName)This method returns the attribute value of the attribute with the given namespace URI and local name, if one exists. If not, this returns None.
getAttributeNodeNS(namespaceURI, localName)This method returns the Attr object of the attribute with the given namespace URI and local name, if one exists. If not, this returns None.
removeAttributeNS(namespaceURI, localName)This method removes the attribute with the given namespace URI and local name from the current instance element.
removeAttributeNode(node)This method removes the attribute node from the current instance element.
setAttributeNS(namespaceURI, qualifiedName, value)This method adds an attribute or replaces an attribute with the specified namespace URI and qualified name and sets the content of that attribute to value.
setAttributeNodeNS(node)This method adds or replaces an attribute using the Attr object node.
Elements also have several properties above and beyond what they get from being Nodes. See the list below for details.
Properties available on Element objects
This is the qualified name of the current instance.
This is a synonym for nodeName.
Both Text and Comment nodes are also more general CharacterData nodes in the DOM. CharacterData nodes have several additional properties and methods for managing the string data that they contain. The individual Text and Comment nodes, however, do not add any functionality to their general CharacterData parent class. You can find descriptions of the properties and methods offered by CharacterData objects below.
Properties available on CharacterData objects
This is the string content of the current instance.
This is the length of the string content of the current instance.
This is a synonym for data.
Methods available to CharacterData objects
insertData(offset, data)This method inserts the string data into the content of the current instance at the index specified by offset.
appendData(data)This method appends the string data to the end of the value of the current instance.
replaceData(offset, count, data)This method replaces count number of characters found at index offset in the current instance with the string data.
substringData(offset, count)This method retrieves and returns the part of the string value of the current instance that begins at index offset and extends count characters.
deleteData(offset, count)This method deletes the part of the string value of the current instance that begins at index offset and extends count characters.
A few DOM actions are not "owned" by any individual document. In effect, they are general-purpose operations. They can be found in DOMImplementation objects. One such precreated instance can be conveniently found at and used from Ft.Xml.Domlette.implementation. The general methods that such a DOMImplementation object offers are listed below.
DOMImplementation methods:
createDocument(namespaceURI, qualifiedName, doctype)This standard DOM method creates and returns a Document object associated with the given DocumentTyype object, and having a single element child with the given QName and namespace. Since Domlette does not use DocumentTyype objects, the doctype argument must be given as None.
createRootNode(documentURI)This Domlette-specific method creates a Document object with the specified document (base) URI. No document element is created. This method is generally preferred over createDocument(); see the following section, 'Building a DOM from scratch'.
hasFeature(feature, version)This method tests whether the DOM implementation implements a specific feature.
The getElementsByTagName() method isn't supported, because there are better options. In particular, you can just use XPath:
doc.xpath(u"//tagname")
For more possibilities, see getElementsByTagName Alternatives.
Domlette comes with a couple of very fast printer functions which also go to great pains to correctly handle character encoding issues: Print and PrettyPrint. Here are some serialization examples using the Domlette printers, given a node 'node' (it doesn't have to be a document node).
from Ft.Xml.Domlette import Print, PrettyPrint
# basic serialization to sys.stdout
Print(node)
# ... with extra whitespace (indenting)
PrettyPrint(node)
# ... using a single tab, rather than 2 spaces, to indent at each level
PrettyPrint(node, indent='\t')
# serializing to a utf-8 encoded file
f = open('output.xml','w')
Print(node, stream=f)
f.close()
# ... to an iso-8859-1 encoded file
f = open('output.xml','w')
Print(node, stream=f, encoding='iso-8859-1')
f.close()
# ... to an ascii encoded string
import cStringIO
buf = cStringIO.StringIO()
Print(node, stream=buf, encoding='us-ascii')
buf.close()
s = buf.getvalue()
# Normally, output syntax (XML or HTML) is chosen based on the DOM type,
# which is automatically detected. A Domlette or XML DOM can be output in
# HTML syntax if the asHtml=1 argument is given.
PrettyPrint(node, asHtml=1)
As an alternative to parsing a preexisting XML document, you can also build a document model, with certain limitations, from the ground up. W3C and Python DOM facilities for doing this are intended mainly for creating a temporary document whose nodes will be imported into an existing document, and while Domlette does offer a more convenient document creation method, it has many of the same limitations. However, for most documents, its capabilities should be sufficient.
The Ft.Xml.Domlette module contains a DOMImplementation instance named implementation which provides a set of methods for initializing new Documents. The implementation.createRootNode method takes a base URI argument and provides a natural approach for creating an XPath model root node. This is similar to the DOM idea of a document node and even closer to a DOM document fragment (multiple element children are allowed). The implementation.createDocument method, on the other hand, is designed to come close to the DOM interface, although its doctype argument must be None.
doc = implementation.createRootNode('file:///article.xml')
is the equivalent of
from Ft.Xml import EMPTY_NAMESPACE doc = implementation.createDocument(EMPTY_NAMESPACE, None, None)
with the added advantage of doc.baseURI being set to 'file:///article.xml', which is not possible to set via standard DOM interfaces (the baseURI attribute is read-only).
Similarly,
from Ft.Xml import EMPTY_NAMESPACE
doc = implementation.createRootNode('file:///article.xml')
docelement = doc.createElementNS(EMPTY_NAMESPACE, 'article')
doc.appendChild(docelement)
is the equivalent of
from Ft.Xml import EMPTY_NAMESPACE doc = implementation.createDocument(EMPTY_NAMESPACE, 'article', None)
plus doc.baseURI being set to 'file:///article.xml'.
If you want as much fidelity to the DOM API as Domlette offers, use implementation.createDocument. If you just want to create a document or other such root-level node, and never mind the strange parameters, use implementation.createRootNode.
You can easily perform XPath queries by use the xpath method for cDomlette nodes as follows:
from Ft.Xml.Domlette import NonvalidatingReader
doc = NonvalidatingReader.parseString("<spam>eggs<a/><a/></spam>")
print doc.xpath(u'//a')
print doc.xpath(u'string(/spam)')
Notice: this is nothing like W3C DOM's XPath query module. The emphasis, as usual with Domlette, is on speed, simplicity and pythonic-ness.
The API, in brief:
node.xpath(expr[, explicitNss])
node - will be used as core of the context for evaluating the XPath
expr - XPath expression in string or compiled form
explicitNss - (optional) any additional or overriding namespace mappings in the form of a dictionary that maps prefixes to namespace URIs. The base namespace mappings are taken from in-scope declarations on the given node. This explicit dictionary is superimposed on the base mappings.
For additional details, see “XPath queries”.
For some users, always specifying a base URI feels like an inconvenience. Perhaps they always generate XML sources from text or streams without naturally associated URIs, and they have to figure out schemes to come up with base URIs for the parse. But there is good reason for this pickiness. Just ask one of the users who got bitten by carelessness with base URIs in practice. It's better to always put some amount of thought into base URIs when processing XML, and 4Suite encourages this.
Note that 4Suite only enforces the requirement for base URIs in cases where they are needed to make sense of a requested operation. Your document must have a valid base URI if you use external entities, XInclude, xsl:import, xsl:include, the XSLT document() function, the EXSLT exsl:document element, or any other operations that require access to an external resource. If your main use for URI resolution is XSLT import and includes, you can avoid having to give valid base URIs by using XSLT include paths.
A valid base URI starts with a scheme, such as http:. A simple name, such as "spam" is a valid relative URI reference, but not a valid base URI. Without a base URI, a relative reference is no more useful than an apartment number given without the address of the entire apartment building. Merging a base URI with a relative reference is a string operation that is undertaken in a standard manner, and is generally only useful when the base URI is hierarchical; that is, it is a URL using one of the common schemes that have slashes as path separators (e.g., http:, ftp:, gopher:, and most file: URLs). The built-in 4Suite URI resolver Ft.Lib.Uri.BASIC_RESOLVER knows how to perform such resolution.
Domlette is not a complete or fully conformant DOM implementation, but it does provide an interface very close to W3C DOM Level 2 and the corresponding Python mapping as laid out in the xml.dom API docs.
The areas of divergence are inconsequential for most users, and generally reflect decisions made in the interest of eliminating redundancy, inefficiency, and, to some degree, un-Pythonic design. Also, one of the important design principles for Domlette is that where DOM and XPath disagree, XPath wins; aside from making things more efficient to implement, this behavior is generally what people want in an XML document model.
It is also worth noting that in the interest of usability, all DOM implementations exhibit some degree of variation from the specs. Coding a completely implementation-agnostic DOM application is difficult and usually unnecessary.
Saxlette is a fast SAX implementation, all written in C. Its API is similar to those of Python's built-in SAX.
from xml import sax
from Ft.Xml import CreateInputSource
class element_counter(sax.ContentHandler):
def startDocument(self):
self.ecount = 0
def startElementNS(self, name, qname, attribs):
self.ecount += 1
parser = sax.make_parser(['Ft.Xml.Sax'])
handler = element_counter()
parser.setContentHandler(handler)
#'file:ot.xml' or file('ot.xml') or file('ot.xml').read() would work just as well, of course
parser.parse(CreateInputSource('ot.xml'))
print "Elements counted:", handler.ecount
If you don't care about PySax compatibility, you can use the more specialized API, which involves the following lines in place of the equivalents above:
from Ft.Xml import Sax ... class element_counter: .... parser = Sax.CreateParser()
The biggest API differences between Saxlette and PySax are that Saxlette only supports SAX 2. For example, feature_namespaces is hard-wired to True and feature_namespace_prefixes to False (which is exactly what SAX2 says is required). Saxlette also combines all adgacent text events, which eliminates one of the pain points of PySax.
The argument to the parse method is a URI, a SAX input source or a 4Suite input source. In the example above a URI was used. The following example shows similar code using 4Suite's Ft.Xml.InputSource.
from Ft.Xml import InputSource, Sax
factory = InputSource.DefaultFactory
isrc = factory.fromUri("file:ot.xml")
doc1 = NonvalidatingReader.parse(isrc)
class element_counter:
def startDocument(self):
self.ecount = 0
def startElementNS(self, name, qname, attribs):
self.ecount += 1
parser = Sax.CreateParser()
handler = element_counter()
parser.setContentHandler(handler)
parser.parse(isrc)
print "Elements counted:", handler.ecount
To enable validation of your documents while otherwise parsing them normally with SAX, set the xml.sax.handler.feature_validation feature to True on your parser using a line similar to parser.setFeature(xml.sax.handler.feature_validation, True). The parser will then throw an xml.sax._exceptions.SAXParseException exception if it determines that the document is invalid, and it will stop parsing the document. Handlers for document components that have been parsed will be called, however. The following example illustrates these concepts.
from Ft.Xml import InputSource, Sax
factory = InputSource.DefaultFactory
XML = """<!DOCTYPE a [
<!ELEMENT a (b, b)>
<!ELEMENT b EMPTY>
]>
<a><b/><b/></a>"""
isrc = factory.fromString(XML, 'urn:x-example:valid-a')
class element_counter:
def startDocument(self):
self.scount = 0
self.ecount = 0
def startElementNS(self, name, qname, attribs):
self.scount += 1
def endElementNS(self, name, qname):
self.ecount += 1
parser = Sax.CreateParser()
handler = element_counter()
parser.setContentHandler(handler)
# And now, to enable validation...
import xml
parser.setFeature(xml.sax.handler.feature_validation, True)
parser.parse(isrc)
print "Saw", handler.scount, "start tags"
print "Saw", handler.ecount, "end tags"
# And now we show what happens on an invalid document:
XML = """<!DOCTYPE a [
<!ELEMENT a (b, b)>
<!ELEMENT b EMPTY>
]>
<a><b/><b/><b/></a>"""
isrc = factory.fromString(XML, 'urn:x-example:invalid-a')
parser.parse(isrc)
print "Saw", handler.scount, "start tags"
print "Saw", handler.ecount, "end tags"
# The above document is invalid; it has one more `b` element than is
# allowed by the DTD. The handlers have still been called for those
# parts of the document that have been parsed.
Saxlette has the ability to walk a Domlette tree, firing off events to a handler as if from a source document parse. This ability used to be too well, hidden, though, and I made an API addition to make it more readily available. This is the new Ft.Xml.Domlette.SaxWalker. The following example should show how easy it is to use:
from Ft.Xml.Domlette import SaxWalker
from Ft.Xml import Parse
XML = "<a><b/><b/></a>"
class element_counter:
def startDocument(self):
self.ecount = 0
def startElementNS(self, name, qname, attribs):
self.ecount += 1
#First get a Domlette document node
doc = Parse(XML)
#Then SAX "parse" it
parser = SaxWalker(doc)
handler = element_counter()
parser.setContentHandler(handler)
#You can set any properties or features, or do whatever
#you would to a regular SAX2 parser instance here
parser.parse() #called without any argument
print "Elements counted:", handler.ecount
Saxlette includes a convenience ContentHandler (Ft.Xml.Sax.DomBuilder) which listens for SAX events and constructs Domlette Documents.
Python's generators are special functions that can produce a series of partial results within the course of running. The calling program can start up a generator, which is suspended when a partial result is yielded, and resumed explicitly by the program when the next result is required. This capability is mirrored in the Expat parser that is the basis of Saxlette. Saxlette has a feature, FEATURE_GENERATOR which you can set on a parser object to enable generator semantics. If this feature is set, the parse() method returns an iterator. This iterator yields results set by the the SAX handlers. The handlers specify the partial results by setting the property PROPERTY_YIELD_RESULT with the value to be yielded. As an example, the following code reports the name of all attributes used in the document.
class report_attributes:
def __init__(self, parser):
self.parser = parser
return
def startElementNS(self, name, qname, attribs):
self.parser.setProperty(Sax.PROPERTY_YIELD_RESULT, attribs)
return
from Ft.Xml import Sax, CreateInputSource
parser = Sax.CreateParser()
parser.setFeature(Sax.FEATURE_GENERATOR, True)
handler = report_attributes(parser)
parser.setContentHandler(handler)
attribs_iterator = parser.parse(CreateInputSource('test.xhtml'))
for attribs in attribs_iterator:
for name in attribs.keys(): print name
In SAX processing, the parser passes to the application a stream of events that represents the XML content. An important aspect of SAX is the user's ability to create SAX filters, which accept a stream of SAX events and pass on a modified stream. For example, you might use a SAX filter to take look for DOcbook sect1, sect2 etc. elements, and rename them to section elements before passing them on for further processing (presumably by a SAX handler that only understands how to deal with the latter form). You can chain SAX filters as well, and the idea behind SAX filters is usually reuse across a broad array of applications, focusing each filter they on a single task that can be cleanly separated from upstream and downstream processing. SAX filters can thus be useful building blocks for XML pipelines.
from xml import sax
from xml.sax.saxutils import XMLFilterBase
from Ft.Xml import CreateInputSource, XML_NAMESPACE as XMLNS
from Ft.Xml.Sax import SaxPrinter
XML = """<?xml version="1.0" encoding="utf-8"?>
<menu>
<item id="A" xml:lang="en">Orange juice</item>
<item id="A" xml:lang="es">Jugo de naranja</item>
<item id="B" xml:lang="en">Toast</item>
<item id="B" xml:lang="es">Pan tostada
<note xml:lang="en">Wheat bread only, please</note>
</item>
</menu>
"""
#Define constants for the two states we care about
ALLOW_CONTENT = 1
SUPPRESS_CONTENT = 2
class english_only_filter(XMLFilterBase):
def __init__(self, downstream):
XMLFilterBase.__init__(self, downstream)
return
def startDocument(self):
#Set the initial state, and set up the stack of states
self._state_stack = [ALLOW_CONTENT]
XMLFilterBase.startDocument(self)
return
def startElementNS(self, name, qname, attrs):
#Check if there is any language attribute
lang = attrs.get((XMLNS, 'lang'))
if lang:
#Set the state as appropriate
if lang[:2] == 'en':
self._state_stack.append(ALLOW_CONTENT)
else:
self._state_stack.append(SUPPRESS_CONTENT)
#Always update the stack with the current state
#Even if it has not changed
#Only forward the event if the state warrants it
if self._state_stack[-1] == ALLOW_CONTENT:
XMLFilterBase.startElementNS(self, name, qname, attrs)
return
def endElementNS(self, name, qname):
self._state_stack.pop()
#Only forward the event if the state warrants it
if self._state_stack[-1] == ALLOW_CONTENT:
XMLFilterBase.endElementNS(self, name, qname)
return
def characters(self, content):
#Only forward the event if the state warrants it
if self._state_stack[-1] == ALLOW_CONTENT:
XMLFilterBase.characters(self, content)
return
if __name__ == "__main__":
parser = sax.make_parser(['Ft.Xml.Sax'])
#SaxPrinter is a special SAX handler that merely writes
#SAX events back into an XML document
filtered_parser = english_only_filter(parser)
handler = SaxPrinter()
filtered_parser.setContentHandler(handler)
filtered_parser.parse(CreateInputSource(XML))
Most SAX handlers operate as state machines, meaning they manage some variables based on the stream of events that come in, and change behavior based on these variables. english_only_filter is set up to be in one of two states: one in which content is passed on to the downstream handler, and one in which content is suppressed. This state is marked in the self._state_stack. The state is initially set to ALLOW_CONTENT, and changed to SUPPRESS_CONTENT if the filter encounters an xml:lang attribute that represents a language other than English (which can be done by checking the first two characters of the value, according to the rules of standard language codes). It has to be a stack because XML language specifications are scoped, so that in the example XML at the top of the listing the string "Pan tostada" is within the scope of the element with the attribute xml:lang="es", and so it is marked as being in Spanish. The entire note element, however, is marked as being in English by an overriding xml:lang="en" attribute.
The SAX handler is set to Ft.Xml.SaxPrinter, which channels the final SAX evenis onto a 4Suite printer which creates a serialized XML document. It's quite easy to chain filters. If you wanted the parser to send events to a filter of class some_other_filter which then passed on events to english_only_filter the relevant line would look as follows:
filtered_parser = english_only_filter(some_other_filter(parser))
The combination of streaming parsing using Saxlette and streaming serialization using Ft.Xml.Lib.CanonicalXmlPrinter allows for very efficient XML canonicalization (c14n).
import sys
from xml import sax
from Ft.Xml import CreateInputSource
from Ft.Xml.Sax import SaxPrinter
from Ft.Xml.Lib.XmlPrinter import CanonicalXmlPrinter
parser = sax.make_parser(['Ft.Xml.Sax'])
handler = SaxPrinter(CanonicalXmlPrinter(sys.stdout))
parser.setContentHandler(handler)
parser.parse(CreateInputSource(' <a><b b="1" a="2"/></a> '))
4Suite provides an XPath processing engine, compliant with the W3C XPath 1.0 specification. This query engine is accessible through Ft.Xml.XPath.
If you are using Domlette, as described above, the quickest and easiest way to use the XPath facility in 4Suite is the xpath() method, which any Domlette Node supports:
from Ft.Xml.Domlette import NonvalidatingReader
doc = NonvalidatingReader.parseString("<spam>eggs<a/><a/></spam>")
doc2 = NonvalidatingReader.parseString("<spam>eggs<eggs n='1'> and ham</eggs></spam>")
print doc.xpath(u'(//a)[1]')
print doc.xpath(u'string(/spam)')
print doc2.xpath(u'string(//eggs/@n)')
The line
print doc.xpath(u'(//a)[1]')
Is actually a shortcut for the following more involved construct, which is described in detail in the next section:
from Ft.Xml.XPath import Evaluate print Evaluate(u'(//a)[1]', contextNode=doc)
This example prints three lines. The first line shows a string representation of a list containing a single element. As we see from this line, an XPath selection of nodes returns a Python list. In this case, it is a list containing a single element—the first element with a local name of a, which has no attributes and no children. The second line shows the correct string value of the selected spam element, and the third line shows the correct string value of the n attribute.
[<Element at 0xb7d10bb4: name u'a', 0 attributes, 0 children>] eggs 1
4Suite XPath functions return results with Python types that depend on the XPath data model type of the query result. The following list shows how the five XPath result types (String, number, boolean, node-set and object) are mapped to Python types:
XPath string: Python unicode type
XPath number: Python float type (int or long also accepted), or instance of Ft.Lib.number.nan (for NaN) or Ft.Lib.number.inf (for Infinity)
XPath boolean: Ft.Lib.boolean instance
XPath node-set: Python list of Domlette nodes, in document order, with no duplicates
XPath foreign object: any other Python object (you will very rarely encounter this case)
XPath expressions can refer to both variables and qualified names (QNames) that must be defined by the environment that is executing the XPath expression. This section describes how to use these advanced features of XPath using the 4Suite interface.
4Suite's XPath implementation uses a Domlette node as the context node for XPath operations. The following example demonstrates the use of XPath to extract content from an XML document. The document must be parsed before Xpath can be used to access it. The following example parses the XML document and explicitly sets up an XPath context to run an XPath query.
XML = """
<ham>
<eggs n='1'/>
This is the string content with <em>emphasized text</em> text
</ham>"""
from Ft.Xml import Parse
from Ft.Xml.XPath.Context import Context
from Ft.Xml.XPath import Evaluate
doc = Parse(XML)
ctx = Context(doc)
nodes = Evaluate(u'//em', ctx)
# The return value, a node set, comes back as a Python list of nodes
# which may be accessed using an iterator
for n in nodes:
# print dir(n)
print n.tagName
print n.firstChild.nodeValue
XPath always requires a context for execution; a common XPath context is the root of the target document, such as we did in the above example. Think about an XPath query being executed from some location in an XML document. This location in the document is a necessary component of using XPath.
There is more to an XPath context than just the context node, but if your needs are as straightforward as that of the above example, there is an abbreviated version of the Evaluate method for this purpose. For example, the following fragment is equivalent to the two lines creating a context and evaluating the expression in the above example.
# No need to create a context object Evaluate(u'//em', contextNode=doc)
If your source document uses XML Namespaces you will likely need to use QNames in your XPath expressions. For this to work, you'll need to introduce namespace mappings into your XPath context. For example, if the elements of our XML document above are in an XML namespace, then we must set up our context slightly differently.
XML = """<ham xmlns="http://example.com/ns#">
<eggs n='1'/>
This is the string content with <em type='bold'>emphasized Namespaced Text</em> text
</ham>"""
from Ft.Xml import Parse
from Ft.Xml.XPath.Context import Context
from Ft.Xml.XPath import Evaluate
NSS = {u'ex': u'http://example.com/ns#'}
doc = Parse(XML)
ctx = Context(doc, processorNss=NSS)
nodes = Evaluate(u'//ex:em', ctx)
for n in nodes:
# print dir(n)
print n.tagName
print n.firstChild.nodeValue
You define XPath namespace prefixes through a Python dictionary (NSS in the above example) which maps these prefixes, such as 'ex' in the above example, to the appropriate namespace URI, such as 'http://example.com/ns#' in the above example. This prefix mapping is added to your XPath context using the processorNss parameter to the Context function.
In a similar way, you can also pass in variable bindings which may be used as values later in your XPath expressions. In this case, however, variables are Python tuples containing the namespace URI and local name of the variable.
ctx = Context(node, varBindings=
{(EMPTY_NAMESPACE, u'date'): u'2003-06-20'})
Evaluate('event[@date = $date]', context=ctx)
This creates a variable in the default namespace named 'date', with a value of '2003-06-20'; this is then used for comparison with the date attribute in the Xpath expression.
XPath variables are Qnames, so you pass in variable names as namespace/local name tuples. The values can be numbers, unicode objects or boolean objects:
from Ft.Xml.XPath import boolean
ctx = Context(node, varBindings={(EMPTY_NAMESPACE, u'test'): boolean.true})
This sets the variable 'test' to the boolean value true (remember that this is for the XPath environment, not the Python one), and again this may be used as in any XSLT stylesheet.
If you only want a value once, you may of course still use string constants, as in
nodes=Evaluate(u'//testPrefix:em[@type="bold"]',ctx)
Note the quotes used? These must be balanced, hence the literal value uses double quotes.
Sometimes you want to re-use an XPath expression and namespace mapping multiple times, for efficiency and convenience. The following example shows an example of this:
from Ft.Xml.XPath.Context import Context
from Ft.Xml.XPath import Compile, Evaluate
from Ft.Xml import Parse
DOCS = ["<spam xmlns='http://spam.com'>eggs</spam>",
"<spam xmlns='http://spam.com'>grail</spam>",
"<spam xmlns='http://spam.com'>nicht</spam>",
]
# Pre-compile for efficiency and convenience
expr = Compile(u"/a:spam[contains(., 'i')]")
ctx = Context(None, processorNss={u"a": u"http://spam.com"})
i = 1
for doc in DOCS:
doc = NonvalidatingReader.parseString(doc.encode('UTF-8'),
"http://spam.com/base")
retval = Evaluate(expr, doc, ctx)
if len(retval):
print "Document", i, "meets our criteria"
i += 1
Which should display:
Document 2 meets our criteria Document 3 meets our criteria
There is a usable XPath module in PyXML (warning: PyXML's XSLT implementation is not usable: use 4Suite if you need XSLT), but there are a lot of updates and improvements in the XPath library version in 4Suite.
If you are familiar with PyXML, you may have used a different form of imports to load in XPath and XSLT features. The imports are different under 4Suite.
Usage example:
PyXML usage (do not use with 4Suite):
import xml.xslt import xml.xpath
4Suite usage (use these imports):
import Ft.Xml.XPath import Ft.Xml.Xslt
For basic XSLT transform needs, or to get started quickly, the Ft.Xml.Xslt module offers a quick way to apply transforms XML documents and get back the simple string result. Within this module, the function of interest is Transform.
Transform(fname_or_uri, string_stream_fname_uri_isrc, [param], [output])The Transform function takes two arguments, with an optional third. The first is the source XML for the transform. The second is the XSLT document. Both are given as a string, an object like an open file, a local file path on your computer, an absolute URI, or an InputSource object. The optional params is a dictionary of stylesheet parameters, the keys of which may be given as unicode objects if they have no namespace, or as (uri, localname) tuples if they do. The values are the overriden parameter values. If you do not supply the optional output parameter the return value is a string with the result of this transform. If you do supply this parameter it must be a file-like object to which the output will be written, and then the return value is None.
XML = """
<ham>
<eggs n='1'/>
This is the string content with <em>emphasized text</em> text
</ham>"""
from Ft.Xml.Xslt import Transform
# URL for the identity transform: reproduces the input XML in the result
ID_TRANSFORM = 'http://cvs.4suite.org/viewcvs/*checkout*/4Suite/Ft/Data/identity.xslt'
result = Transform(XML, ID_TRANSFORM)
print result
# If the above XML document were located in the file
# "target.xml", we could have used `Transform("target.xml", ID_TRANSFORM)`.
#It's more efficient to redirect the processor output to an output stream. The following does so:
import sys
result = Transform(XML, ID_TRANSFORM, output=sys.stdout)
print result
Here is the general procedure for using the Python API for XSLT processing:
Create an Ft.Xml.Xslt.Processor.Processor instance.
Prepare Ft.Xml.InputSource instances (via their factory) for the source XML and stylesheet.
Call the Processor's appendStylesheet method, passing it the stylesheet's InputSource.
Call the Processor's run method, passing it the source document's InputSource.
For input to our transform, we will use the namespaced example as in the last section.
$ cat testNS.xml <ham xmlns="http://example.com/ns#"> <eggs n='1'/> This is the string content with <em type='bold' f='2'>emphasized Namespaced Text</em> text </ham>
For our stylesheet, we will again use one of the simplest useful examples, the identity stylesheet.
$ cat identity.xsl
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
The code below follows the processing outline, having converted the input file and stylesheet to the URI format.
from Ft.Xml.Xslt import Processor
# We use the InputSource architecture
from Ft.Xml import InputSource
from Ft.Lib.Uri import OsPathToUri # path to URI conversions
processor = Processor.Processor()
# Prepare an InputSource for the source document
# Convert from local file to uri
srcAsUri = OsPathToUri('testNS.xml')
source = InputSource.DefaultFactory.fromUri(srcAsUri)
# Prepare an InputSource for the stylesheet
# Convert from local file to uri
ssAsUri = OsPathToUri('identity.xsl')
transform = InputSource.DefaultFactory.fromUri(ssAsUri)
processor.appendStylesheet(transform)
result = processor.run(source)
# result is a string with the serialized transform result
print result
You can call run multiple times on different InputSources. When you're done, the processor's reset method can be used to restore a clean slate (at which point you would have to append stylesheets to the processor again).
The following example uses our processor from the previous example to transform a new XML document, this one constructed manually.
XML = """<foo><bar/></foo>""" source = InputSource.DefaultFactory.fromString(XML, 'http://example.org/foo') result = processor.run(source) # result is a string with the serialized transform result print result
This code continues from the previous example to process the second document, using the same processor and stylesheet. This is a useful form when there is a requirement for server side processing of multiple input documents with a common stylesheet.
In the example below, strings are used as the source of the transform (stylesheet) and source documents, and we are careful to pass in a URI to identify each of them. In the source document, the URI is needed for resolving external entity references and XIncludes. In the stylesheet, the URI is needed for resolving document function calls, xsl:includes and xsl:imports.
If you do not provide a URI and you attempt to use any of these features, you may get an exception.
# The identity transform: duplicates the input to output
TRANSFORM = """
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>"""
SOURCE = """<spam id="eggs">I don't like spam</spam>"""
# The processor class is the core of the XSLT API
from Ft.Xml.Xslt import Processor
processor = Processor.Processor()
# We use the InputSource architecture
from Ft.Xml import InputSource
# Prepare an InputSource for the transform
transform = InputSource.DefaultFactory.fromString(TRANSFORM,
"http://spam.com/identity.xslt")
# Prepare an InputSource for the source document
source = InputSource.DefaultFactory.fromString(SOURCE,
"http://spam.com/doc.xml")
processor.appendStylesheet(transform)
result = processor.run(source)
# result is a string with the serialized transform result
print result
If your documents are already in the form of Domlette documents, you don't need to create InputSources for them; you can just use the Processor's appendStylesheetNode and runNode methods instead of appendStylesheet and run, respectively.
It is usually slower to read the stylesheet from a Domlette object than to parse a serialized document.
The Domlette documents used in the following example are obtained by parsing existing XML, but this approach can just as easily be used on Domlette documents that are built programmatically (i.e. using the DOM API).
# The identity transform: duplicates the input to output
TRANSFORM = """
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>"""
SOURCE = """<spam id="eggs">I don't like spam</spam>"""
from Ft.Xml.Xslt import Processor
processor = Processor.Processor()
from Ft.Xml.Domlette import NonvalidatingReader
# Create a DOM for the transform
transform = NonvalidatingReader.parseString(TRANSFORM,
"http://spam.com/identity.xslt")
# Create a DOM for the source document
source = NonvalidatingReader.parseString(SOURCE, "http://spam.com/doc.xml")
processor.appendStylesheetNode(transform, "http://spam.com/identity.xslt")
result = processor.runNode(source, "http://spam.com/doc.xml")
print result
If you have objects from another DOM library, you can first convert them to Domlette objects as shown in “Converting from other DOM libraries”.
You can pass in stylesheet parameters as a Python dictionary. Use the parameter names for keys. Values use the 4Suite XPath library's standard type mappings, which are described in “Type mappings”.
Parameter and variable names in XPath/XSLT are actually expanded-names, which we represent as (namespaceURI, localName) tuples. If your parameter name is in a namespace, you have to use a tuple as the mapping key. Otherwise, you may simply use a unicode string that represents the local-name part only (Ft.Xml.EMPTY_NAMESPACE is the default namespace).
Here is an example, which passes in the computed "date" parameter to the stylesheet from the program:
SRC = """<?xml version="1.0"?><dummy/>"""
STY = """<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:param name="date" select="'unknown'"/>
<xsl:output method="xml" indent="yes" encoding="us-ascii"/>
<xsl:template match="/">
<result>
<xsl:value-of select="$date"/>
</result>
</xsl:template>
</xsl:stylesheet>"""
from Ft.Xml import InputSource
from Ft.Xml.Xslt import Processor
import time
src_isrc = InputSource.DefaultFactory.fromString(SRC, 'http://foo/dummy.xml')
sty_isrc = InputSource.DefaultFactory.fromString(STY, 'http://foo/dummy.xsl')
proc = Processor.Processor()
proc.appendStylesheet(sty_isrc)
params = {u'date': unicode(time.asctime())}
result = proc.run(src_isrc, topLevelParams=params)
print result
4Suite honors the Associating Stylesheets with XML Documents W3C Recommendation and RFC 3023: XML Media Types. Instead of (or in addition to) using the processor's explicit APIs to establish the stylesheet to be used for the transformation, the source document may contain an xml-stylesheet processing instruction (PI) that refers to a stylesheet via a URI reference.
The xml-stylesheet PI must meet the following criteria:
It must appear in the document prolog.