Ruby Microformat Parser
A Ruby library for creating parsers that can be used to extract microcontent from (X)HTML documents in a variety of microformats.
A MicroformatParser is a class with a set of rules for extracting interesting content from (X)HTML documents. You create your own parser by writing a class and specifying the set of rules for selecting elements and extracting their content. The magic happens in the parse method which taks an (X)HTML document or element, runs all the rules on it, and returns new object that holds the extracted valus.
Here's a simple example to find all links and all tags in a document:
require 'uformatparser' class MyParser include MicroformatParser rule :links, "a", "a@href" rule :tags, "a[rel~=tag]", "text()" end content = MyParser.parse(doc) puts "Found " + content.links.size + " links" if content.links puts "Tagged with " + content.tags.join(', ') if content.tags
Get the code
- gem install uformatparser
- RubyGem/source releases
- Directly from SVN
Writing Rules
A rule identifies matching elements in the document, extracts value from these elements and assigns that value to an instance variable. The value is then available from the instance variable, or an attribute accessor with the same name.
The most common way to write a rule is:
rule name, selector?, extractor?, limit? rule_1 name, selector?, extractor?
The name argument indicates which instance variable/attribute accessor to use. Sometimes it's easy to think of it as the rule name, for example, when referring to the tags rule in the example. It is possible to specify multiple rules with the same name, all assigning values to the same variable. The name may be a string or a symbol.
The selector argument identifies all the matching elements. There are several ways to write selectors, the most common form uses CSS-style selectors. For example, a[rel~=tag] matches all the a elements with a rel attribute that contains the name tag. If absent or nil, the default behavior is to select all elements that have a class with the same name as the rule.
The extractor argument extracts the rule's value from a matched element. There are several ways to write extractors, the simplest form uses a simple expression that extracts specific element and attribute values, or calls functions. For example, a@href selects the value of the href attribute if the selected element is a, while text() selects the text value of the selected element. If absent or nil, the default extractor is abbr@title|a@href|text().
The limit argument specifies the cardinality of the rule's value. If the limit is 0, no values are set for that rule. If the limit is 1, only the first extracted value is set. If the limit is n, up to n values are extracted and set in an array. If the limit is -1, any number of values are extracted are set in an array. The default limit is -1.
The rule_1 method is identical to rule with a limit of 1.
To use the parser, call the parse method on a parser class. The method takes a document or an element, creates a new object instance and runs all the rules, setting values in that object, finally returning that object.
The parser uses REXML for documents and elements, but you can use the HTML parser to handle (X)HTML content.
Compound Rules
Some rules deal with structured values that contain multiple fields. CSS selectors can pick specific fields, for example, '.vevent .dtstart' will pick the start date/time of each event. But if the page contains multiple events, there's no way to tell which start time belongs to which event.
Instead, compound rules are parsed by creating parsers for each of the structures and using these classes as extractors. For example:
class MyParser include MicroformatParser class Event include MicroformatParser rule_1 :dtstart # Events have only one start date/time rule_1 :dtend rule_1 :summary end rule :events, ".vevent", Event # Parse structured events end content = MyParser.parse(doc) content.events.each do |event| puts event.summary + " starts at " + event.dtstart end
In this example the Event class is defined within MyParser, but this is not a requirement. However, both classes must separately include the MicroformatParser module.
Another example extracts an hCard for the location and contact fields of an event:
rule_1 :contact, nil, HCard rule_1 :location, nil, HCard
Reduction and Recursion
Rules are not recursive. Once a rule is matched against a element, it is no longer processed for any children of that element. So, while the events rule will extract all events in the document, it will ignore events contained inside events. In most cases, this is exactly the intended behavior.
If you want a rule to be recursive, simplify define that rule to include itself. For example:
class Div include MicroformatParser rule :divs, "div", Div end
Selectors
Selectors identify matching elements. The most convenient way to write selectors is to use CSS-style selectors. The syntax for CSS style selectors is:
| * | Match any element |
| foo | Match any element foo |
| #bar | Match the element with ID bar |
| foo#bar | Match element foo if it has ID bar |
| .foo | Match any element with class foo |
| .foo.bar | Match any element with classes foo and bar |
| foo.bar | Match any element foo with class bar |
| [bar=baz] | Match any element with attribute bar equal to baz |
| foo[bar=baz] | Match any element foo with attribute bar equal to baz |
| [bar~=baz] | Match any element with attribute bar containing baz |
| [bar|=baz] | Match any element with attribute bar starting with baz |
| foo, bar | Match any element foo or bar |
| foo>bar | Match any element bar that is a child of an element foo |
| foo bar | Match any element bar that is a descendant of an element foo |
| foo+bar | Match any element bar that follows an element foo |
You can also pre-define selectors using the selector method. This method can be used in two ways:
selector name, expression selector name { |element| block }
The first form takes a CSS-style selector, the second form uses a block that receives the element and returns true if the element matches. After calling selector, a new method is created with the specified name that acts as the selector. You can reference the selector from a rule using a symbol. For example:
selector :alt_no_src_selector { |element| element.attributes['alt'] and ! element.attributes['src'] } rule :alt_no_src, :alt_no_src_selector, :xml
In addition, you can always write your own selector as a method or a proc. The method/proc accepts a single argument with the element and returns true if the element matches.
Extractors
Extractors extract the rule's value from an element matched by the selector. As a convenience, there's a simple expression language for writing extractors:
| foo | Extracts the text value of an element, if the element is foo |
| @bar | Extracts the value of the attribute bar |
| foo@bar | Extracts the value of the attribute bar, if the element is foo |
| text() | Extracts the text value of the element |
| xml() | Extracts the XML value of the element (returns the element itself) |
| func() | Calls the function func on the element |
| foo|bar | Extracts the value foo, if none found, extracts the value bar |
The default extractor is abbr@title|a@href|text(). If the selected element is abbr with an attribute title, it extracts the title. Otherwise, if the selected element is a with an attribute href, it extracts the URL. Otherwise, it extracts the element's text value.
You can reference class methods and module functions in the extractor expression. The methods text and xml are defined in all parser classes.
You can also pre-define extractors using the extractor method. This method can be used in two ways:
extractor name, expression extractor name { |element| block }
The first form takes an extractor expression, the second form uses a block that receives the element and returns the extracted value, or nil. After calling extractor, a new method is created with the specified name that acts as the extractor. You can reference the extractor from a rule using a symbol. For example:
extractor :dt_extractor do |element| value = element.attributes['title'] if element.name == 'abbr' value = text(element) unless value value ? Time.parse(value) : nil end rule :dtstart, nil, :dt_extractor rule :dtend, nil, :dt_extractor
In addition, you can always write your own extractor as a method or a proc. The method/proc accepts a single argument with the element and returns the extracted value or nil.
Manipulating Rules
In addition to using selectors and extractors, you can always write your own rules directly. The simplest way is to call rule with a block, for example:
rule :some_text { |element| text(element)[0..50] if element.name == 'p' }
As you can see the block acts as both selector and extractor by deciding which elements to operate on and returning the extracted value.
You can write more complex rules by creating objects that implement the process method. The process method takes two arguments: the element being processed and the context object (the object holding the parsed values). The rule is responsible for extraction, and also for setting values in the context object. In that way it can extract multiple values at the same time, perform validation, override existing values, etc.
The process method returns true if the rule needs to be reduced. If the rule is reduced, it will not be applied to any children of the current element.
To add new rules use the rules method, which returns an array of all rules defined in the class. You can also use this method to list existing rules, remove rules, and even change rules. You use this method on the class to change all parsers. If you want to change rules for a specific object, create an array of rules and pass it to the parse method directly.
Rules are processed in the order in which they are added. For example, you can add validation as the last rule in your parser:
MyParser.rules << validation_rule
Example
require 'uformatparser' class Microformats include MicroformatParser class HCalendar include MicroformatParser # Extract ISO date/time extractor :dt_extractor do |node| value = node.attributes['title'] if node.name == 'abbr' value = text(node) unless value value ? Time.parse(value) : nil end rule_1 :dtstart, nil, :dt_extractor rule_1 :dtend, nil, :dt_extractor rule_1 :summary, nil, :text rule_1 :description, nil, :xml rule_1 :url, nil, "a@href" end rule :tags, "a[rel~=tag]", "text()" rule :events, ".vevent", HCalendar end content = Microformats.parse(doc) puts content.tags puts content.events
Parsing (X)HTML
We recommend using HTree to parse the (X)HTML document and create an REXML tree.
For example:
require 'htree' html = HTree("<p>paragraph</p>").to_rexml content = MyParser.parse(html.document)
License
This package is licensed under the MIT license and/or the Creative Commons Attribution-ShareAlike.
