You can not select more than 25 topics
			Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
		
		
		
		
		
			
		
			
				
					226 lines
				
				8.2 KiB
			
		
		
			
		
	
	
					226 lines
				
				8.2 KiB
			| 
								 
											3 years ago
										 
									 | 
							
								# sax js
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								A sax-style parser for XML and HTML.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Designed with [node](http://nodejs.org/) in mind, but should work fine in
							 | 
						||
| 
								 | 
							
								the browser or other CommonJS implementations.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								## What This Is
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								* A very simple tool to parse through an XML string.
							 | 
						||
| 
								 | 
							
								* A stepping stone to a streaming HTML parser.
							 | 
						||
| 
								 | 
							
								* A handy way to deal with RSS and other mostly-ok-but-kinda-broken XML
							 | 
						||
| 
								 | 
							
								  docs.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								## What This Is (probably) Not
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								* An HTML Parser - That's a fine goal, but this isn't it.  It's just
							 | 
						||
| 
								 | 
							
								  XML.
							 | 
						||
| 
								 | 
							
								* A DOM Builder - You can use it to build an object model out of XML,
							 | 
						||
| 
								 | 
							
								  but it doesn't do that out of the box.
							 | 
						||
| 
								 | 
							
								* XSLT - No DOM = no querying.
							 | 
						||
| 
								 | 
							
								* 100% Compliant with (some other SAX implementation) - Most SAX
							 | 
						||
| 
								 | 
							
								  implementations are in Java and do a lot more than this does.
							 | 
						||
| 
								 | 
							
								* An XML Validator - It does a little validation when in strict mode, but
							 | 
						||
| 
								 | 
							
								  not much.
							 | 
						||
| 
								 | 
							
								* A Schema-Aware XSD Thing - Schemas are an exercise in fetishistic
							 | 
						||
| 
								 | 
							
								  masochism.
							 | 
						||
| 
								 | 
							
								* A DTD-aware Thing - Fetching DTDs is a much bigger job.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								## Regarding `<!DOCTYPE`s and `<!ENTITY`s
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								The parser will handle the basic XML entities in text nodes and attribute
							 | 
						||
| 
								 | 
							
								values: `& < > ' "`. It's possible to define additional
							 | 
						||
| 
								 | 
							
								entities in XML by putting them in the DTD. This parser doesn't do anything
							 | 
						||
| 
								 | 
							
								with that. If you want to listen to the `ondoctype` event, and then fetch
							 | 
						||
| 
								 | 
							
								the doctypes, and read the entities and add them to `parser.ENTITIES`, then
							 | 
						||
| 
								 | 
							
								be my guest.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Unknown entities will fail in strict mode, and in loose mode, will pass
							 | 
						||
| 
								 | 
							
								through unmolested.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								## Usage
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								```javascript
							 | 
						||
| 
								 | 
							
								var sax = require("./lib/sax"),
							 | 
						||
| 
								 | 
							
								  strict = true, // set to false for html-mode
							 | 
						||
| 
								 | 
							
								  parser = sax.parser(strict);
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								parser.onerror = function (e) {
							 | 
						||
| 
								 | 
							
								  // an error happened.
							 | 
						||
| 
								 | 
							
								};
							 | 
						||
| 
								 | 
							
								parser.ontext = function (t) {
							 | 
						||
| 
								 | 
							
								  // got some text.  t is the string of text.
							 | 
						||
| 
								 | 
							
								};
							 | 
						||
| 
								 | 
							
								parser.onopentag = function (node) {
							 | 
						||
| 
								 | 
							
								  // opened a tag.  node has "name" and "attributes"
							 | 
						||
| 
								 | 
							
								};
							 | 
						||
| 
								 | 
							
								parser.onattribute = function (attr) {
							 | 
						||
| 
								 | 
							
								  // an attribute.  attr has "name" and "value"
							 | 
						||
| 
								 | 
							
								};
							 | 
						||
| 
								 | 
							
								parser.onend = function () {
							 | 
						||
| 
								 | 
							
								  // parser stream is done, and ready to have more stuff written to it.
							 | 
						||
| 
								 | 
							
								};
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								parser.write('<xml>Hello, <who name="world">world</who>!</xml>').close();
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								// stream usage
							 | 
						||
| 
								 | 
							
								// takes the same options as the parser
							 | 
						||
| 
								 | 
							
								var saxStream = require("sax").createStream(strict, options)
							 | 
						||
| 
								 | 
							
								saxStream.on("error", function (e) {
							 | 
						||
| 
								 | 
							
								  // unhandled errors will throw, since this is a proper node
							 | 
						||
| 
								 | 
							
								  // event emitter.
							 | 
						||
| 
								 | 
							
								  console.error("error!", e)
							 | 
						||
| 
								 | 
							
								  // clear the error
							 | 
						||
| 
								 | 
							
								  this._parser.error = null
							 | 
						||
| 
								 | 
							
								  this._parser.resume()
							 | 
						||
| 
								 | 
							
								})
							 | 
						||
| 
								 | 
							
								saxStream.on("opentag", function (node) {
							 | 
						||
| 
								 | 
							
								  // same object as above
							 | 
						||
| 
								 | 
							
								})
							 | 
						||
| 
								 | 
							
								// pipe is supported, and it's readable/writable
							 | 
						||
| 
								 | 
							
								// same chunks coming in also go out.
							 | 
						||
| 
								 | 
							
								fs.createReadStream("file.xml")
							 | 
						||
| 
								 | 
							
								  .pipe(saxStream)
							 | 
						||
| 
								 | 
							
								  .pipe(fs.createWriteStream("file-copy.xml"))
							 | 
						||
| 
								 | 
							
								```
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								## Arguments
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Pass the following arguments to the parser function.  All are optional.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`strict` - Boolean. Whether or not to be a jerk. Default: `false`.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`opt` - Object bag of settings regarding string formatting.  All default to `false`.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								Settings supported:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								* `trim` - Boolean. Whether or not to trim text and comment nodes.
							 | 
						||
| 
								 | 
							
								* `normalize` - Boolean. If true, then turn any whitespace into a single
							 | 
						||
| 
								 | 
							
								  space.
							 | 
						||
| 
								 | 
							
								* `lowercase` - Boolean. If true, then lowercase tag names and attribute names
							 | 
						||
| 
								 | 
							
								  in loose mode, rather than uppercasing them.
							 | 
						||
| 
								 | 
							
								* `xmlns` - Boolean. If true, then namespaces are supported.
							 | 
						||
| 
								 | 
							
								* `position` - Boolean. If false, then don't track line/col/position.
							 | 
						||
| 
								 | 
							
								* `strictEntities` - Boolean. If true, only parse [predefined XML
							 | 
						||
| 
								 | 
							
								  entities](http://www.w3.org/TR/REC-xml/#sec-predefined-ent)
							 | 
						||
| 
								 | 
							
								  (`&`, `'`, `>`, `<`, and `"`)
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								## Methods
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`write` - Write bytes onto the stream. You don't have to do this all at
							 | 
						||
| 
								 | 
							
								once. You can keep writing as much as you want.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`close` - Close the stream. Once closed, no more data may be written until
							 | 
						||
| 
								 | 
							
								it is done processing the buffer, which is signaled by the `end` event.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`resume` - To gracefully handle errors, assign a listener to the `error`
							 | 
						||
| 
								 | 
							
								event. Then, when the error is taken care of, you can call `resume` to
							 | 
						||
| 
								 | 
							
								continue parsing. Otherwise, the parser will not continue while in an error
							 | 
						||
| 
								 | 
							
								state.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								## Members
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								At all times, the parser object will have the following members:
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`line`, `column`, `position` - Indications of the position in the XML
							 | 
						||
| 
								 | 
							
								document where the parser currently is looking.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`startTagPosition` - Indicates the position where the current tag starts.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`closed` - Boolean indicating whether or not the parser can be written to.
							 | 
						||
| 
								 | 
							
								If it's `true`, then wait for the `ready` event to write again.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`strict` - Boolean indicating whether or not the parser is a jerk.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`opt` - Any options passed into the constructor.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`tag` - The current tag being dealt with.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								And a bunch of other stuff that you probably shouldn't touch.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								## Events
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								All events emit with a single argument. To listen to an event, assign a
							 | 
						||
| 
								 | 
							
								function to `on<eventname>`. Functions get executed in the this-context of
							 | 
						||
| 
								 | 
							
								the parser object. The list of supported events are also in the exported
							 | 
						||
| 
								 | 
							
								`EVENTS` array.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								When using the stream interface, assign handlers using the EventEmitter
							 | 
						||
| 
								 | 
							
								`on` function in the normal fashion.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`error` - Indication that something bad happened. The error will be hanging
							 | 
						||
| 
								 | 
							
								out on `parser.error`, and must be deleted before parsing can continue. By
							 | 
						||
| 
								 | 
							
								listening to this event, you can keep an eye on that kind of stuff. Note:
							 | 
						||
| 
								 | 
							
								this happens *much* more in strict mode. Argument: instance of `Error`.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`text` - Text node. Argument: string of text.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`doctype` - The `<!DOCTYPE` declaration. Argument: doctype string.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`processinginstruction` - Stuff like `<?xml foo="blerg" ?>`. Argument:
							 | 
						||
| 
								 | 
							
								object with `name` and `body` members. Attributes are not parsed, as
							 | 
						||
| 
								 | 
							
								processing instructions have implementation dependent semantics.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`sgmldeclaration` - Random SGML declarations. Stuff like `<!ENTITY p>`
							 | 
						||
| 
								 | 
							
								would trigger this kind of event. This is a weird thing to support, so it
							 | 
						||
| 
								 | 
							
								might go away at some point. SAX isn't intended to be used to parse SGML,
							 | 
						||
| 
								 | 
							
								after all.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`opentagstart` - Emitted immediately when the tag name is available,
							 | 
						||
| 
								 | 
							
								but before any attributes are encountered.  Argument: object with a
							 | 
						||
| 
								 | 
							
								`name` field and an empty `attributes` set.  Note that this is the
							 | 
						||
| 
								 | 
							
								same object that will later be emitted in the `opentag` event.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`opentag` - An opening tag. Argument: object with `name` and `attributes`.
							 | 
						||
| 
								 | 
							
								In non-strict mode, tag names are uppercased, unless the `lowercase`
							 | 
						||
| 
								 | 
							
								option is set.  If the `xmlns` option is set, then it will contain
							 | 
						||
| 
								 | 
							
								namespace binding information on the `ns` member, and will have a
							 | 
						||
| 
								 | 
							
								`local`, `prefix`, and `uri` member.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`closetag` - A closing tag. In loose mode, tags are auto-closed if their
							 | 
						||
| 
								 | 
							
								parent closes. In strict mode, well-formedness is enforced. Note that
							 | 
						||
| 
								 | 
							
								self-closing tags will have `closeTag` emitted immediately after `openTag`.
							 | 
						||
| 
								 | 
							
								Argument: tag name.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`attribute` - An attribute node.  Argument: object with `name` and `value`.
							 | 
						||
| 
								 | 
							
								In non-strict mode, attribute names are uppercased, unless the `lowercase`
							 | 
						||
| 
								 | 
							
								option is set.  If the `xmlns` option is set, it will also contains namespace
							 | 
						||
| 
								 | 
							
								information.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`comment` - A comment node.  Argument: the string of the comment.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`opencdata` - The opening tag of a `<![CDATA[` block.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`cdata` - The text of a `<![CDATA[` block. Since `<![CDATA[` blocks can get
							 | 
						||
| 
								 | 
							
								quite large, this event may fire multiple times for a single block, if it
							 | 
						||
| 
								 | 
							
								is broken up into multiple `write()`s. Argument: the string of random
							 | 
						||
| 
								 | 
							
								character data.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`closecdata` - The closing tag (`]]>`) of a `<![CDATA[` block.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`opennamespace` - If the `xmlns` option is set, then this event will
							 | 
						||
| 
								 | 
							
								signal the start of a new namespace binding.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`closenamespace` - If the `xmlns` option is set, then this event will
							 | 
						||
| 
								 | 
							
								signal the end of a namespace binding.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`end` - Indication that the closed stream has ended.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`ready` - Indication that the stream has reset, and is ready to be written
							 | 
						||
| 
								 | 
							
								to.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								`noscript` - In non-strict mode, `<script>` tags trigger a `"script"`
							 | 
						||
| 
								 | 
							
								event, and their contents are not checked for special xml characters.
							 | 
						||
| 
								 | 
							
								If you pass `noscript: true`, then this behavior is suppressed.
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								## Reporting Problems
							 | 
						||
| 
								 | 
							
								
							 | 
						||
| 
								 | 
							
								It's best to write a failing test if you find an issue.  I will always
							 | 
						||
| 
								 | 
							
								accept pull requests with failing tests if they demonstrate intended
							 | 
						||
| 
								 | 
							
								behavior, but it is very hard to figure out what issue you're describing
							 | 
						||
| 
								 | 
							
								without a test.  Writing a test is also the best way for you yourself
							 | 
						||
| 
								 | 
							
								to figure out if you really understand the issue you think you have with
							 | 
						||
| 
								 | 
							
								sax-js.
							 |