You can not select more than 25 topics
			Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
		
		
		
		
		
			
		
			
				
					226 lines
				
				8.2 KiB
			
		
		
			
		
	
	
					226 lines
				
				8.2 KiB
			| 
											3 years ago
										 | # sax js
 | ||
|  | 
 | ||
|  | A sax-style parser for XML and HTML. | ||
|  | 
 | ||
|  | Designed with [node](http://nodejs.org/) in mind, but should work fine in | ||
|  | the browser or other CommonJS implementations. | ||
|  | 
 | ||
|  | ## What This Is
 | ||
|  | 
 | ||
|  | * A very simple tool to parse through an XML string. | ||
|  | * A stepping stone to a streaming HTML parser. | ||
|  | * A handy way to deal with RSS and other mostly-ok-but-kinda-broken XML | ||
|  |   docs. | ||
|  | 
 | ||
|  | ## What This Is (probably) Not
 | ||
|  | 
 | ||
|  | * An HTML Parser - That's a fine goal, but this isn't it.  It's just | ||
|  |   XML. | ||
|  | * A DOM Builder - You can use it to build an object model out of XML, | ||
|  |   but it doesn't do that out of the box. | ||
|  | * XSLT - No DOM = no querying. | ||
|  | * 100% Compliant with (some other SAX implementation) - Most SAX | ||
|  |   implementations are in Java and do a lot more than this does. | ||
|  | * An XML Validator - It does a little validation when in strict mode, but | ||
|  |   not much. | ||
|  | * A Schema-Aware XSD Thing - Schemas are an exercise in fetishistic | ||
|  |   masochism. | ||
|  | * A DTD-aware Thing - Fetching DTDs is a much bigger job. | ||
|  | 
 | ||
|  | ## Regarding `<!DOCTYPE`s and `<!ENTITY`s
 | ||
|  | 
 | ||
|  | The parser will handle the basic XML entities in text nodes and attribute | ||
|  | values: `& < > ' "`. It's possible to define additional | ||
|  | entities in XML by putting them in the DTD. This parser doesn't do anything | ||
|  | with that. If you want to listen to the `ondoctype` event, and then fetch | ||
|  | the doctypes, and read the entities and add them to `parser.ENTITIES`, then | ||
|  | be my guest. | ||
|  | 
 | ||
|  | Unknown entities will fail in strict mode, and in loose mode, will pass | ||
|  | through unmolested. | ||
|  | 
 | ||
|  | ## Usage
 | ||
|  | 
 | ||
|  | ```javascript | ||
|  | var sax = require("./lib/sax"), | ||
|  |   strict = true, // set to false for html-mode | ||
|  |   parser = sax.parser(strict); | ||
|  | 
 | ||
|  | parser.onerror = function (e) { | ||
|  |   // an error happened. | ||
|  | }; | ||
|  | parser.ontext = function (t) { | ||
|  |   // got some text.  t is the string of text. | ||
|  | }; | ||
|  | parser.onopentag = function (node) { | ||
|  |   // opened a tag.  node has "name" and "attributes" | ||
|  | }; | ||
|  | parser.onattribute = function (attr) { | ||
|  |   // an attribute.  attr has "name" and "value" | ||
|  | }; | ||
|  | parser.onend = function () { | ||
|  |   // parser stream is done, and ready to have more stuff written to it. | ||
|  | }; | ||
|  | 
 | ||
|  | parser.write('<xml>Hello, <who name="world">world</who>!</xml>').close(); | ||
|  | 
 | ||
|  | // stream usage | ||
|  | // takes the same options as the parser | ||
|  | var saxStream = require("sax").createStream(strict, options) | ||
|  | saxStream.on("error", function (e) { | ||
|  |   // unhandled errors will throw, since this is a proper node | ||
|  |   // event emitter. | ||
|  |   console.error("error!", e) | ||
|  |   // clear the error | ||
|  |   this._parser.error = null | ||
|  |   this._parser.resume() | ||
|  | }) | ||
|  | saxStream.on("opentag", function (node) { | ||
|  |   // same object as above | ||
|  | }) | ||
|  | // pipe is supported, and it's readable/writable | ||
|  | // same chunks coming in also go out. | ||
|  | fs.createReadStream("file.xml") | ||
|  |   .pipe(saxStream) | ||
|  |   .pipe(fs.createWriteStream("file-copy.xml")) | ||
|  | ``` | ||
|  | 
 | ||
|  | 
 | ||
|  | ## Arguments
 | ||
|  | 
 | ||
|  | Pass the following arguments to the parser function.  All are optional. | ||
|  | 
 | ||
|  | `strict` - Boolean. Whether or not to be a jerk. Default: `false`. | ||
|  | 
 | ||
|  | `opt` - Object bag of settings regarding string formatting.  All default to `false`. | ||
|  | 
 | ||
|  | Settings supported: | ||
|  | 
 | ||
|  | * `trim` - Boolean. Whether or not to trim text and comment nodes. | ||
|  | * `normalize` - Boolean. If true, then turn any whitespace into a single | ||
|  |   space. | ||
|  | * `lowercase` - Boolean. If true, then lowercase tag names and attribute names | ||
|  |   in loose mode, rather than uppercasing them. | ||
|  | * `xmlns` - Boolean. If true, then namespaces are supported. | ||
|  | * `position` - Boolean. If false, then don't track line/col/position. | ||
|  | * `strictEntities` - Boolean. If true, only parse [predefined XML | ||
|  |   entities](http://www.w3.org/TR/REC-xml/#sec-predefined-ent) | ||
|  |   (`&`, `'`, `>`, `<`, and `"`) | ||
|  | 
 | ||
|  | ## Methods
 | ||
|  | 
 | ||
|  | `write` - Write bytes onto the stream. You don't have to do this all at | ||
|  | once. You can keep writing as much as you want. | ||
|  | 
 | ||
|  | `close` - Close the stream. Once closed, no more data may be written until | ||
|  | it is done processing the buffer, which is signaled by the `end` event. | ||
|  | 
 | ||
|  | `resume` - To gracefully handle errors, assign a listener to the `error` | ||
|  | event. Then, when the error is taken care of, you can call `resume` to | ||
|  | continue parsing. Otherwise, the parser will not continue while in an error | ||
|  | state. | ||
|  | 
 | ||
|  | ## Members
 | ||
|  | 
 | ||
|  | At all times, the parser object will have the following members: | ||
|  | 
 | ||
|  | `line`, `column`, `position` - Indications of the position in the XML | ||
|  | document where the parser currently is looking. | ||
|  | 
 | ||
|  | `startTagPosition` - Indicates the position where the current tag starts. | ||
|  | 
 | ||
|  | `closed` - Boolean indicating whether or not the parser can be written to. | ||
|  | If it's `true`, then wait for the `ready` event to write again. | ||
|  | 
 | ||
|  | `strict` - Boolean indicating whether or not the parser is a jerk. | ||
|  | 
 | ||
|  | `opt` - Any options passed into the constructor. | ||
|  | 
 | ||
|  | `tag` - The current tag being dealt with. | ||
|  | 
 | ||
|  | And a bunch of other stuff that you probably shouldn't touch. | ||
|  | 
 | ||
|  | ## Events
 | ||
|  | 
 | ||
|  | All events emit with a single argument. To listen to an event, assign a | ||
|  | function to `on<eventname>`. Functions get executed in the this-context of | ||
|  | the parser object. The list of supported events are also in the exported | ||
|  | `EVENTS` array. | ||
|  | 
 | ||
|  | When using the stream interface, assign handlers using the EventEmitter | ||
|  | `on` function in the normal fashion. | ||
|  | 
 | ||
|  | `error` - Indication that something bad happened. The error will be hanging | ||
|  | out on `parser.error`, and must be deleted before parsing can continue. By | ||
|  | listening to this event, you can keep an eye on that kind of stuff. Note: | ||
|  | this happens *much* more in strict mode. Argument: instance of `Error`. | ||
|  | 
 | ||
|  | `text` - Text node. Argument: string of text. | ||
|  | 
 | ||
|  | `doctype` - The `<!DOCTYPE` declaration. Argument: doctype string. | ||
|  | 
 | ||
|  | `processinginstruction` - Stuff like `<?xml foo="blerg" ?>`. Argument: | ||
|  | object with `name` and `body` members. Attributes are not parsed, as | ||
|  | processing instructions have implementation dependent semantics. | ||
|  | 
 | ||
|  | `sgmldeclaration` - Random SGML declarations. Stuff like `<!ENTITY p>` | ||
|  | would trigger this kind of event. This is a weird thing to support, so it | ||
|  | might go away at some point. SAX isn't intended to be used to parse SGML, | ||
|  | after all. | ||
|  | 
 | ||
|  | `opentagstart` - Emitted immediately when the tag name is available, | ||
|  | but before any attributes are encountered.  Argument: object with a | ||
|  | `name` field and an empty `attributes` set.  Note that this is the | ||
|  | same object that will later be emitted in the `opentag` event. | ||
|  | 
 | ||
|  | `opentag` - An opening tag. Argument: object with `name` and `attributes`. | ||
|  | In non-strict mode, tag names are uppercased, unless the `lowercase` | ||
|  | option is set.  If the `xmlns` option is set, then it will contain | ||
|  | namespace binding information on the `ns` member, and will have a | ||
|  | `local`, `prefix`, and `uri` member. | ||
|  | 
 | ||
|  | `closetag` - A closing tag. In loose mode, tags are auto-closed if their | ||
|  | parent closes. In strict mode, well-formedness is enforced. Note that | ||
|  | self-closing tags will have `closeTag` emitted immediately after `openTag`. | ||
|  | Argument: tag name. | ||
|  | 
 | ||
|  | `attribute` - An attribute node.  Argument: object with `name` and `value`. | ||
|  | In non-strict mode, attribute names are uppercased, unless the `lowercase` | ||
|  | option is set.  If the `xmlns` option is set, it will also contains namespace | ||
|  | information. | ||
|  | 
 | ||
|  | `comment` - A comment node.  Argument: the string of the comment. | ||
|  | 
 | ||
|  | `opencdata` - The opening tag of a `<![CDATA[` block. | ||
|  | 
 | ||
|  | `cdata` - The text of a `<![CDATA[` block. Since `<![CDATA[` blocks can get | ||
|  | quite large, this event may fire multiple times for a single block, if it | ||
|  | is broken up into multiple `write()`s. Argument: the string of random | ||
|  | character data. | ||
|  | 
 | ||
|  | `closecdata` - The closing tag (`]]>`) of a `<![CDATA[` block. | ||
|  | 
 | ||
|  | `opennamespace` - If the `xmlns` option is set, then this event will | ||
|  | signal the start of a new namespace binding. | ||
|  | 
 | ||
|  | `closenamespace` - If the `xmlns` option is set, then this event will | ||
|  | signal the end of a namespace binding. | ||
|  | 
 | ||
|  | `end` - Indication that the closed stream has ended. | ||
|  | 
 | ||
|  | `ready` - Indication that the stream has reset, and is ready to be written | ||
|  | to. | ||
|  | 
 | ||
|  | `noscript` - In non-strict mode, `<script>` tags trigger a `"script"` | ||
|  | event, and their contents are not checked for special xml characters. | ||
|  | If you pass `noscript: true`, then this behavior is suppressed. | ||
|  | 
 | ||
|  | ## Reporting Problems
 | ||
|  | 
 | ||
|  | It's best to write a failing test if you find an issue.  I will always | ||
|  | accept pull requests with failing tests if they demonstrate intended | ||
|  | behavior, but it is very hard to figure out what issue you're describing | ||
|  | without a test.  Writing a test is also the best way for you yourself | ||
|  | to figure out if you really understand the issue you think you have with | ||
|  | sax-js. |