Project Home

Nabu: Writing an Extractor

Abstract

This document shows an example of writing an extractor for identifying and saving parts of documents from within a Nabu server.

Contents

Introduction

Nabu offers a framework for the extraction of meaningful portions of documents into structured storage (e.g. database tables). This extraction will generally be customized for each end application and therefore we anticipate that people will write their own extractors.

This document provides an example of that. It aims at demonstrating the simplicity of the task.

Note that we will be storing data with an SQL DBAPI-2.0 interface. You may want to read PEP 249 if you're not familiar with Python DBAPI-2.0.

Example Application: Extracting Books from Field Lists

For our example, we will examine the task of extracting information for book references. Imagine that we want to be able to put field lists that are book references, scattered all over our documents, something that looks more or less like BibTeX entries, for example:

:book:
:title: National Geographic Photography Field Guide 2nd Edition:
        Secrets to Making Great Pictures
:author: Peter Burian, Bob Caputo
:isbn: 079225676X
:review:

  Excellent diversified advice from top photographers.  The book
  manages to pack lots of relevant content in a small format.  It
  contains a nice section on composition, which is what originally
  attracted me to it.  As per usual, I found the digital
  photography section useless, but for the most part, the
  information available in this book is of great value.  This is
  the best general book about photography that I've read.

Since the docutils parser works recursively, those field lists can be located anywhere within the document [1], for example, in an item list, e.g.:

Recent books read:

- :title: Animal Farm
  :author: George Orwell
  :url: http://www.online-literature.com/orwell/animalfarm/
  :review:

     A classic work.

- :title: Free Culture
  :author: Lawrence Lessig
  :url: http://www.amazon.com/o/ASIN/0143034650/

- :title: A Scanner Darkly
  :author: Philip K. Dick
  :isbn: 0679736654

...

We would like to avoid having to say that the field list is for a book; it would be nice if the parser could just figure that out by itself. For example, the empty :book: field in the first example above should not be mandatory for the book reference to be detected.

Note

One additional detail is that the field lists should be separate. In reStructuredText, fields are allowed to have whitespace between them, so that:

:title: A Scanner Darkly

:author: Philip K. Dick

:isbn: 0679736654
:url: http://www.amazon.com/o/ASIN/0679736654/

is a single field list. This means that you cannot separate field lists with just whitespace. I'm using an itemized list in the previous example, but other document structures could be used as well (see the docutils document tree a list of constructs that get recognized in reStructuredText). This example exhibits a subtlety that may arise when you choose which structures will be parsed: you have to think about how the docutils parser will represent the data in the tree of nodes. For example, the following would not work because it gets parsed as a single field list:

:title: Free Culture
:author: Lawrence Lessig
:url: http://www.amazon.com/o/ASIN/0143034650/

:title: A Scanner Darkly
:author: Philip K. Dick
:isbn: 0679736654

Looking at the Document Structure

Here is a pseudo-xml rendering of how the above example gets parsed by docutils into a tree of nodes. This is the tree that we will have to visit to extract the references from. You can use docutils' rst2pseudoxml.py tool to generate such a tree to be able to visualize the document structure, and to help design and write your extractor/visitor:

<document source="/tmp/example.txt">
    <paragraph>
        Recent books read:
    <bullet_list bullet="-">
        <list_item>
            <field_list>
                <field>
                    <field_name>
                        title
                    <field_body>
                        <paragraph>
                            Animal Farm
                <field>
                    <field_name>
                        author
                    <field_body>
                        <paragraph>
                            George Orwell
                <field>
                    <field_name>
                        url
                    <field_body>
                        <paragraph>
                            <reference refuri="http://www.online-literature.com/orwell/animalfarm/">
                                http://www.online-literature.com/orwell/animalfarm/
                <field>
                    <field_name>
                        review
                    <field_body>
                        <paragraph>
                            A classic work.
        <list_item>
            <field_list>
                <field>
                    <field_name>
                        title
                    <field_body>
                        <paragraph>
                            Free Culture
                <field>
                    <field_name>
                        author
                    <field_body>
                        <paragraph>
                            Lawrence Lessig
                <field>
                    <field_name>
                        url
                    <field_body>
                        <paragraph>
                            <reference refuri="http://www.amazon.com/o/ASIN/0143034650/">
                                http://www.amazon.com/o/ASIN/0143034650/
        <list_item>
            <field_list>
                <field>
                    <field_name>
                        title
                    <field_body>
                        <paragraph>
                            A Scanner Darkly
                <field>
                    <field_name>
                        author
                    <field_body>
                        <paragraph>
                            Philip K. Dick
                <field>
                    <field_name>
                        isbn
                    <field_body>
                        <paragraph>
                            0679736654

The Extractor and the ExtractorStorage

In order to extract some stuff from our documents, we need to provide at least two classes:

  1. an extractor class, which is essentially a class derived from the docutils Transform class, whose role is to visit the parsed document tree and to find the stuff that it wants to find. In our example, this class will be running a visitor to look for docutils field lists nodes and it will check the field names to find out if they are for books;

  2. an extractor storage class, whose role is to put detected book references into whatever storage is desired. It is decoupled from the extractor so that the same extractor algorithm can support different storage mechanisms. This storage object is normally provided by the publisher handler script in its configuration.

    Typically, we would store the references in a database, or convert them into BibTeX format and store them sequentially in a file for later use.

Example code

All the code for implementing the book algorithm above fits neatly in a short file. Let's name it book.py.

Imports

We start our Python script with a brief description and some imports:

#!/usr/bin/env python
"""
Extract book entries.
"""

# nabu imports
from nabu import extract
from nabu.extractors.flvis import FieldListVisitor

We import the nabu.extract module which contains the base class that we need. We also import a utility class that knows how to visit docutils field lists [2].

The Extractor Class

We then define the extractor class, which is a docutils Transform:

class BookExtractor(extract.Extractor):
    """
    Transform that looks at field lists and that heuristically attempts
    to find references to books.  For example,

      :title: National Geographic Photography Field Guide 2nd Edition:
              Secrets to Making Great Pictures
      :author: Peter Burian, Bob Caputo
      :url: http://www.amazon.com/o/ASIN/079225676X/
      :review:

        Excellent diversified advice from top photographers.  The book
        manages to pack lots of relevant content in a small format.  It
        contains a nice section on composition, which is what originally
        attracted me to it.  As per usual, I found the digital
        photography section useless, but for the most part, the
        information available in this book is of great value.  This is
        the best general book about photography that I've read.

    The heuristic looks at an empty :book: field, an :ISBN: field, or
    some list that has an author and a title. See the Book class below to
    find out which fields are stored.
    """

    default_priority = 900

Note that we put the full description in the docstring of the extractor class rather than in the module docstring because the Nabu publisher program can concatenate all the document strings of the extractors configured in the publish handler CGI script and return that to the client, so that people who publish content to a specific Nabu store can get an idea of what structures are configured on the server (see the --help-transforms option to the Nabu publisher program).

We also set the default priority that will specify the order in which the extractors get to run. This is important for extractors which modify the document tree (the document tree can be stored and then served on a web server just like extracted content--in fact, there is a trivial extractor provided that does just that. This can be easily leveraged to implement a Wiki or a Blog).

We then implement the apply() method which is called by docutils, after it sets the .document attribute on the extractor (that is part of how transforms are run by docutils):

...
    def apply( self, **kwargs ):
        self.unid, self.storage = kwargs['unid'], kwargs['storage']

        v = FieldListVisitor(self.document)
        v.initialize()
        self.document.walk(v)
        v.finalize()

The keyword arguments are always the unique id for the source document and the extractor storage object that is configured on the publisher handler.

Note that we use a special FieldListVisitor class that we've built to simplify visiting generic field lists. You could use any of the docutils visitors here instead. We run the visitor on the document by calling the walk() method. This special visitor accumulates all the field lists in the document into a dictionary that we then process and implement our heuristic to find the field lists that match our criteria:

...
        for flist in v.getfieldlists():
            book = 0
            # if there is an empty book field, this is explicitly a book.
            if 'book' in flist and not flist['book'].strip():
                book = 1
            # if there is an ISBN number, then it is *definitely* a book.
            elif 'isbn' in flist:
                book = 1
            # if there is an author and a title
            elif 'author' in flist and 'title' in flist:
                book = 1

            if book:
                self.store(flist) # store the book

Next, we implement a store method that converts the field value nodes returned by the FieldListVisitor into Unicode text and calls our associated storage object to actually put the data somewhere:

...
    def store( self, flist ):
        emap = {}
        for k, v in flist.iteritems():
            emap[k] = v.astext()

        self.storage.store(self.unid, emap)

That's it for the extractor class!

The ExtractorStorage Class

This class is in charge of storing the information identified by the extractor and is called by it when it finds a new instance to store. Different storage classes can be created for different backends.

Note that we declared a unid field that is required not to be null. This is necessary to make it possible to remove the objects from a source document that is being reprocessed. We first remove all the old objects (this is done automatically) and then the reprocessing adds the new set back into the database. All information extracted from a document is required to be associated with the source document using this unique id field unid.

Next, we write the extractor storage class:

class BookStorage(extract.SQLExtractorStorage):
    """
    Book storage.
    """
    sql_tables = { 'book': '''

        CREATE TABLE book
        (
           unid TEXT NOT NULL,
           title TEXT,
           author TEXT,
           year TEXT,
           url TEXT,
           review TEXT
        )

        '''
        }

    def store( self, unid, *args ):
        data, = args

        cols = ('unid', 'title', 'author', 'year', 'url', 'review')
        values = [unid]
        for n in cols[1:]:
            values.append( data.get(n, '') )

        cursor = self.connection.cursor()
        cursor.execute("""
          INSERT INTO book (%s) VALUES (%%s, %%s, %%s, %%s, %%s, %%s)
          """ % ', '.join(cols), values)

        self.connection.commit()

Here we derive it from the special SQLExtractorStorage that Nabu provides for storing stuff in databases using SQL. Knowing which schemas the storage class uses, this base class simply implements the protocol to initialize the database connection for the wrapper objects, implements the protocol for clearing objects extracted from a specific document, and to reset the tables (see the source code if desired, it is very very simple). The expected schema classes are specified in the class attribute sql_tables.

The class basically just creates a new book reference entry by instantiating the Book schema class. It fills in default empty values for the entries that have not been found by the extractor (we could have implemented this in the extractor itself--this is by choice of contract between the extractor and the storage object).

This completes the source code for our example.

Testing Your Extractor

Before setting up your Nabu publisher handler with the new extractor, you can test it using the nabu-test-extractor code provided with Nabu (installed under nabu/lib/python/nabu/testextr.py).

Run it like this, on a test document in reStructuredText format reading.txt, which presumably contains book references (create your own test document with your favourite books):

$ ./testextr.py book.py ~/reading.txt

Publisher Handler Configuration Code

The only part that remains to be done in order to feed a database with book references is to configure your Nabu publisher handler with the new extractor. The Nabu publisher handler is the script that you should install on your web server to receive the source documents and run the extractors on them. There is an example publisher handler under nabu/cgi-bin/nabu-publish-handler.cgi in the source distribution.

First import your extractor code at the top of the file:

...
import book
...

Then add the new extractor to the list of transforms that the server will be configured with:

transforms = (
    ...
    (book.BookExtractor, book.BookStorage(dbmodule, dbconn)),
    )

Complete Source Code

The complete source code for the book example is reproduced here for convenience:

#!/usr/bin/env python
"""
Extract book entries.
"""

# nabu imports
from nabu import extract
from nabu.extractors.flvis import FieldListVisitor


class BookExtractor(extract.Extractor):
    """
    Transform that looks at field lists and that heuristically attempts
    to find references to books.  For example,

      :title: National Geographic Photography Field Guide 2nd Edition:
              Secrets to Making Great Pictures
      :author: Peter Burian, Bob Caputo
      :url: http://www.amazon.com/o/ASIN/079225676X/
      :review:

        Excellent diversified advice from top photographers.  The book
        manages to pack lots of relevant content in a small format.  It
        contains a nice section on composition, which is what originally
        attracted me to it.  As per usual, I found the digital
        photography section useless, but for the most part, the
        information available in this book is of great value.  This is
        the best general book about photography that I've read.

    The heuristic looks at an empty :book: field, an :ISBN: field, or
    some list that has an author and a title. See the Book class below to
    find out which fields are stored.
    """

    default_priority = 900

    def apply( self, **kwargs ):
        self.unid, self.storage = kwargs['unid'], kwargs['storage']

        v = FieldListVisitor(self.document)
        v.initialize()
        self.document.walk(v)
        v.finalize()

        for flist in v.getfieldlists():
            book = 0
            # if there is an empty book field, this is explicitly a book.
            if 'book' in flist and not flist['book'].strip():
                book = 1
            # if there is an ISBN number, then it is *definitely* a book.
            elif 'isbn' in flist:
                book = 1
            # if there is an author and a title
            elif 'author' in flist and 'title' in flist:
                book = 1

            if book:
                self.store(flist) # store the book

    def store( self, flist ):
        emap = {}
        for k, v in flist.iteritems():
            emap[k] = v.astext()

        self.storage.store(self.unid, emap)


class BookStorage(extract.SQLExtractorStorage):
    """
    Book storage.
    """
    sql_tables = { 'book': '''

        CREATE TABLE book
        (
           unid TEXT NOT NULL,
           title TEXT,
           author TEXT,
           year TEXT,
           url TEXT,
           review TEXT
        )

        '''
        }

    def store( self, unid, *args ):
        data, = args

        cols = ('unid', 'title', 'author', 'year', 'url', 'review')
        values = [unid]
        for n in cols[1:]:
            values.append( data.get(n, '') )

        cursor = self.connection.cursor()
        cursor.execute("""
          INSERT INTO book (%s) VALUES (%%s, %%s, %%s, %%s, %%s, %%s)
          """ % ', '.join(cols), values)

        self.connection.commit()

Modfying the Document Tree

Your extractors do not have to modify the document tree, their goal is to find information from the document tree and save it somewhere else (e.g. in the database), and to leave the document tree alone (the information from the document tree gets rendered if you then render the document itself).

However, if you like, you can make modifications to the document tree. For example, if your extractor discovers some pattern of nodes for storage, you may also want to add a custom class to the parent node so that you can later render those parts of the document specially marked with CSS rules. You can also remove nodes from the tree if you like.

When Nabu processes the extractors on the server, the document tree gets stored after your extractors have run. Your modified tree will be stored in the sources table in the database.

Final Notes

Certainly, some improvements could be applied to the example we provide. The extractor could use a more sophisticated heuristic to determine if the field list represents a book, parse various formats for the ISBN value so that it is normalized and can then be used to automatically look up details from a URL on some external website, we could convert and store the extracted values in BibTeX format for later reuse, add more fields to the schema, parse not just generic field lists, but also bibliographic fields of documents to allow them to enter a book entry as well, etc. This would not be difficult at all.

As you can see, writing extractors is very easy. You can create as many extractors as you want. Nabu comes with a small library of extractors which will grow as I leverage it more and more for my own applications, and with contribution from other people if they care to send them to me.

You can find other extractors--including this one--in the Nabu source distribution under nabu/lib/python/nabu/extractors.

If you have found any part of this document unclear and you still do not understand the nature of Nabu after reading this example, please feel free to contact the author with questions.

[1]If the field list appears at the top of the document, it will have been parsed by the bibliographic fields parser and thus the :author: field will have been converted into an author node, and a tiny little bit more work would have to be done to look into those fields if we want a book reference to be specifiable at the top of the document as the bibliographic field.
[2]The code for this class is very simple and you could write similar utilities to enhance code reuse for detecting different generic patterns of document structure. See the docutils document tree document for information about what kinds of nodes the document tree is made of.