Ruby Microformat Parser

A Ruby library for creating parsers that can be used to extract microcontent from (X)HTML documents in a variety of microformats.

A MicroformatParser is a class with a set of rules for extracting interesting content from (X)HTML documents. You create your own parser by writing a class and specifying the set of rules for selecting elements and extracting their content. The magic happens in the parse method which taks an (X)HTML document or element, runs all the rules on it, and returns new object that holds the extracted valus.

Here's a simple example to find all links and all tags in a document:

require 'uformatparser'

class MyParser
  include MicroformatParser
  
  rule :links, "a", "a@href"
  rule :tags, "a[rel~=tag]", "text()"
end

content = MyParser.parse(doc)
puts "Found " + content.links.size + " links" if content.links
puts "Tagged with " + content.tags.join(', ') if content.tags

Get the code

Writing Rules

A rule identifies matching elements in the document, extracts value from these elements and assigns that value to an instance variable. The value is then available from the instance variable, or an attribute accessor with the same name.

The most common way to write a rule is:

rule   name, selector?, extractor?, limit?
rule_1 name, selector?, extractor?

The name argument indicates which instance variable/attribute accessor to use. Sometimes it's easy to think of it as the rule name, for example, when referring to the tags rule in the example. It is possible to specify multiple rules with the same name, all assigning values to the same variable. The name may be a string or a symbol.

The selector argument identifies all the matching elements. There are several ways to write selectors, the most common form uses CSS-style selectors. For example, a[rel~=tag] matches all the a elements with a rel attribute that contains the name tag. If absent or nil, the default behavior is to select all elements that have a class with the same name as the rule.

The extractor argument extracts the rule's value from a matched element. There are several ways to write extractors, the simplest form uses a simple expression that extracts specific element and attribute values, or calls functions. For example, a@href selects the value of the href attribute if the selected element is a, while text() selects the text value of the selected element. If absent or nil, the default extractor is abbr@title|a@href|text().

The limit argument specifies the cardinality of the rule's value. If the limit is 0, no values are set for that rule. If the limit is 1, only the first extracted value is set. If the limit is n, up to n values are extracted and set in an array. If the limit is -1, any number of values are extracted are set in an array. The default limit is -1.

The rule_1 method is identical to rule with a limit of 1.

To use the parser, call the parse method on a parser class. The method takes a document or an element, creates a new object instance and runs all the rules, setting values in that object, finally returning that object.

The parser uses REXML for documents and elements, but you can use the HTML parser to handle (X)HTML content.

Compound Rules

Some rules deal with structured values that contain multiple fields. CSS selectors can pick specific fields, for example, '.vevent .dtstart' will pick the start date/time of each event. But if the page contains multiple events, there's no way to tell which start time belongs to which event.

Instead, compound rules are parsed by creating parsers for each of the structures and using these classes as extractors. For example:

class MyParser
  include MicroformatParser

  class Event
    include MicroformatParser

    rule_1 :dtstart # Events have only one start date/time
    rule_1 :dtend
    rule_1 :summary
  end
  
  rule :events, ".vevent", Event # Parse structured events
end

content = MyParser.parse(doc)
content.events.each do |event|
  puts event.summary + " starts at " + event.dtstart
end

In this example the Event class is defined within MyParser, but this is not a requirement. However, both classes must separately include the MicroformatParser module.

Another example extracts an hCard for the location and contact fields of an event:

rule_1 :contact, nil, HCard
rule_1 :location, nil, HCard

Reduction and Recursion

Rules are not recursive. Once a rule is matched against a element, it is no longer processed for any children of that element. So, while the events rule will extract all events in the document, it will ignore events contained inside events. In most cases, this is exactly the intended behavior.

If you want a rule to be recursive, simplify define that rule to include itself. For example:

class Div
  include MicroformatParser

  rule :divs, "div", Div
end

Selectors

Selectors identify matching elements. The most convenient way to write selectors is to use CSS-style selectors. The syntax for CSS style selectors is:

* Match any element
foo Match any element foo
#bar Match the element with ID bar
foo#bar Match element foo if it has ID bar
.foo Match any element with class foo
.foo.bar Match any element with classes foo and bar
foo.bar Match any element foo with class bar
[bar=baz] Match any element with attribute bar equal to baz
foo[bar=baz] Match any element foo with attribute bar equal to baz
[bar~=baz] Match any element with attribute bar containing baz
[bar|=baz] Match any element with attribute bar starting with baz
foo, bar Match any element foo or bar
foo>bar Match any element bar that is a child of an element foo
foo bar Match any element bar that is a descendant of an element foo
foo+bar Match any element bar that follows an element foo

You can also pre-define selectors using the selector method. This method can be used in two ways:

selector name, expression
selector name { |element| block }

The first form takes a CSS-style selector, the second form uses a block that receives the element and returns true if the element matches. After calling selector, a new method is created with the specified name that acts as the selector. You can reference the selector from a rule using a symbol. For example:

selector :alt_no_src_selector { |element| element.attributes['alt'] and ! element.attributes['src'] }
rule :alt_no_src, :alt_no_src_selector, :xml

In addition, you can always write your own selector as a method or a proc. The method/proc accepts a single argument with the element and returns true if the element matches.

Extractors

Extractors extract the rule's value from an element matched by the selector. As a convenience, there's a simple expression language for writing extractors:

foo Extracts the text value of an element, if the element is foo
@bar Extracts the value of the attribute bar
foo@bar Extracts the value of the attribute bar, if the element is foo
text() Extracts the text value of the element
xml() Extracts the XML value of the element (returns the element itself)
func() Calls the function func on the element
foo|bar Extracts the value foo, if none found, extracts the value bar

The default extractor is abbr@title|a@href|text(). If the selected element is abbr with an attribute title, it extracts the title. Otherwise, if the selected element is a with an attribute href, it extracts the URL. Otherwise, it extracts the element's text value.

You can reference class methods and module functions in the extractor expression. The methods text and xml are defined in all parser classes.

You can also pre-define extractors using the extractor method. This method can be used in two ways:

extractor name, expression
extractor name { |element| block }

The first form takes an extractor expression, the second form uses a block that receives the element and returns the extracted value, or nil. After calling extractor, a new method is created with the specified name that acts as the extractor. You can reference the extractor from a rule using a symbol. For example:

extractor :dt_extractor do |element|
  value = element.attributes['title'] if element.name == 'abbr'
  value = text(element) unless value
  value ? Time.parse(value) : nil
end
rule :dtstart, nil, :dt_extractor
rule :dtend, nil, :dt_extractor

In addition, you can always write your own extractor as a method or a proc. The method/proc accepts a single argument with the element and returns the extracted value or nil.

Manipulating Rules

In addition to using selectors and extractors, you can always write your own rules directly. The simplest way is to call rule with a block, for example:

rule :some_text { |element| text(element)[0..50] if element.name == 'p' }

As you can see the block acts as both selector and extractor by deciding which elements to operate on and returning the extracted value.

You can write more complex rules by creating objects that implement the process method. The process method takes two arguments: the element being processed and the context object (the object holding the parsed values). The rule is responsible for extraction, and also for setting values in the context object. In that way it can extract multiple values at the same time, perform validation, override existing values, etc.

The process method returns true if the rule needs to be reduced. If the rule is reduced, it will not be applied to any children of the current element.

To add new rules use the rules method, which returns an array of all rules defined in the class. You can also use this method to list existing rules, remove rules, and even change rules. You use this method on the class to change all parsers. If you want to change rules for a specific object, create an array of rules and pass it to the parse method directly.

Rules are processed in the order in which they are added. For example, you can add validation as the last rule in your parser:

MyParser.rules << validation_rule

Example

require 'uformatparser'

class Microformats
  include MicroformatParser

  class HCalendar
    include MicroformatParser

    # Extract ISO date/time
    extractor :dt_extractor do |node|
      value = node.attributes['title'] if node.name == 'abbr'
      value = text(node) unless value
      value ? Time.parse(value) : nil
    end

    rule_1 :dtstart, nil, :dt_extractor
    rule_1 :dtend, nil, :dt_extractor
    rule_1 :summary, nil, :text
    rule_1 :description, nil, :xml
    rule_1 :url, nil, "a@href"
  end

  rule :tags, "a[rel~=tag]", "text()"
  rule :events, ".vevent", HCalendar
end

content = Microformats.parse(doc)
puts content.tags
puts content.events

Parsing (X)HTML

We recommend using HTree to parse the (X)HTML document and create an REXML tree.

For example:

require 'htree'

html = HTree("<p>paragraph</p>").to_rexml
content = MyParser.parse(html.document)

License

This package is licensed under the MIT license and/or the Creative Commons Attribution-ShareAlike.