Microcontent Parser
Parsing
To create a new parser, instantiate Microparser with a given set of rules. You can then parse HTML, XHTML and XML content by passing a MicroparserNode object to the parse() method. The method returns an array of values created during the parsing.
You can create rules in one of two ways. Using an array of rule objects (see below) you can implement your own rules for matching elements and performing actions. Alternatively, you can specify rules using the selectors and actions syntax.
$relTag = array("a[rel~=tag]", "tags[]=text()");
$hCalendar = array(".vevent", "vevent",
array(".dtstart", "dtstart=abbr@title|text()"),
array(".dtend", "dtend=abbr@title|text()"),
array(".summary", "summary=text()"),
array("a.url[href]", "url=a@href")
);
$parser = new MicroParser($relTag, $hCalendar);
The PHP implementation currently supports only documents generated with Tidy, and provides the convenience method parseTidy. For example:
$tidy = new Tidy();
$tidy->parseFile("test.html");
$tidy->cleanRepair();
$parser = new MicroParser($relTag, $hCalendar);
$values = $parser->parseTidy($tidy);
Selectors
Simple rules can be written in terms of selectors. Selectors use a CSS-like syntax to identify matching elements. For example:
a[rel~=tag]
This selector will match any element A with an attribute rel that has a space-separated list of values, one of which is tag.
The following table summarizes the selector syntax:
| Selector | Matches |
| * | Match any element |
| foo | Match any element foo |
| .foo | Match any element with the class foo |
| .foo.bar | Match any element with the class foo and class bar |
| #foo | Match the element with id foo |
| foo.bar | Match any element foo with class bar |
| [foo] | Match any element with attribute foo |
| [foo=bar] | Match any element with attribute foo equal to bar |
| [foo~=bar] | Match any element with attribute foo containing bar |
| [foo|=bar] | Match any element with attribute foo starting with bar |
| foo[bar=baz] | Match any element foo with attribute bar equal to baz |
| [class~=foo] | Equivalent to .foo |
Actions
Actions that extract value directly from the matched element use this syntax:
name=extract(|extract)* name[]=extract(|extract)*
A value is extracted from the matched element using the first applicable extract. In some cases, you may need to write more than one extract by separating them with |. For example:
dtstart=abbr@title|text()
If the matched element is abbr with an attribute title, then this action uses the value of the title attribute, otherwise it uses the text value of the matched element. The result is set to dtstart.
The following table specifies the type of extracts supported:
| Extract | Will extract |
| @foo | Extract the value of the attribute @foo from the matched element |
| foo@bar | Extract the value of the attribute @bar only if the matched element is foo |
| text() | Extract the text value of the matched element |
| xml() | Extract the XML content of the matched element |
| function() | Call some other function on the matched element |
When using foo=extract, the first matching of the rule will set foo to the extracted value. When using foo[]=extract, each matching of the rule will add an array item with the extracted value to foo. For example:
tags[]=text()
This action extracts the text value of the matched element and adds it to the array tags.
Actions with Subrules
Actions that deal with structures use sub-rules instead of extracts. The action is written using the syntax:
name name[]
This creates either a single value or an array of values. Each value is itself an array, containing one or more values set by the sub-rules. Sub-rules are evaluated against the matching element and all of its descendants.
For example:
$hCalendar = array(".vevent", "vevent",
array(".dtstart", "dtstart=abbr@title|text()"),
array(".dtend", "dtend=abbr@title|text()"),
array(".summary", "summary=text()"),
array(".contact", "contact", $hCard)
);
The rule $hCalendar looks for a matching element with the class .vevent. It then extracts the start time, end time and summary from elements with matching classes. If it finds an element of class .contact inside the event, it creates an array with contact information by using the $hCard rules (not shown here).
Custom Rules
You can write your own rules by creating objects that implement two methods:
- match -- Returns true for every matching element, to which the rule applies.
- perform -- Perform an action based on the matching element, using the current state object.
