Experience Management

Document filter inside out (part 1): the fundamentals

Document Filter is a mechanism from Google connector framework 3.x to manipulate document during traversal for connectors. It is mainly supported at Connector Manager (CM) level. Thanks to the open source nature of Google connector framework, we could examine carefully about how document filter is defined and implemented.
document-filters
com.google.enterprise.connector.spi.Document
Document is an interface defined by Connector Manager. (the javaDoc stated) A document is a map of String property names to Property objects. Map-like functionality is provided through the findProperty(String) method. In addition, a method provides the caller with the Set of all property names, which it can use to iterate over all properties.
The interface has two methods defined:

    public Property findProperty(String name) throws RepositoryException;
  • Finds a Property by name. If the current document has a property then that property is returned.
    public Set<String> getPropertyNames() throws RepositoryException;
  • Gets the set of names of all Properties in this Document.

When implementing a custom document filter, most of the time we will only deal with these two methods.
com.google.enterprise.connector.util.filter.DocumentFilterFactory
It is an interface for factories that create document filters.
The interface defined a method:

    public Document newDocumentFilter(Document source) throws RepositoryException;

Returns a new Document that acts as a filter for the supplied source Document.
It’s a very interesting that a method named to create a new Document Filter just returns an object of type Document. It showed that after all, a document filter is just another document object.
Another noteworthy point is that the new document filter (or new Document) is created/generated based on a given document. It renders the filter kind of a wrapper of the original document. Document filter acts to transform the information retrieved from its source document to something else.
Multiple document filters may be chained together to form a document processing pipeline.
Document Filter chain
It implements the interface of DocumentFilterFactory as well. The method to create a new document filter (i.e. a new document object as we already knew) is actually loop through all the DocumentFilterFactory objects in the chain to wrap up the given source Document object.

  @Override
  public Document newDocumentFilter(Document source)
      throws RepositoryException {
    Preconditions.checkNotNull(source);
    for (DocumentFilterFactory factory : factories) {
      source = factory.newDocumentFilter(source);
    }
    return source;
  }

(within file documetfilters.xml an explanation stated) DocumentFilterChain constructs a series of document filters. The filters are constructed from a list of DocumentFilterFactory beans, and linked together like pop-beads, each using the previous as its source document.
com.google.enterprise.connector.util.filter.AbstractDocumentFilter
It is a base class defined by Connector Manager framework for easily implementing other document filters. There are several interesting and important aspects about this class.
– First of all, it implemented interface DocumentFilterFactory;
– All the pre-defined Document filters are subclasses of AbstractDocumentFilter;
– It defined a private class DocumentFilter (we will discuss it later);
– It has two methods defined, which will be likely overridden by subclasses:

 public Property findProperty(Document source, String name)
      throws RepositoryException {
    return source.findProperty(name);
  }
  public Set<String> getPropertyNames(Document source)
      throws RepositoryException {
    return source.getPropertyNames();
  }

As we talked before, these two methods are the essential part of the filter implementation.
com.google.enterprise.connector.util.filter.AbstractDocumentFilter.DocumentFilter
Connector Manager never defined a public DocumentFilter. However, within abstract class AbstractDocumentFilter, a private inner class DocumentFilter was defined. Its definition clearly showed that a document filter is just a document object.

 /**
 * A  {@link Document}  implementation that calls back to the outer class {@code  findProperty()}  and {code getPropertyNames()} methods, which are likely to be overridden by subclasses.
 */
  private class DocumentFilter implements Document {
    /**
     * The  {@link Document}  that acts as the source for this filter.
     * @uml.property  name="source"
     * @uml.associationEnd
     */
    protected Document source;
    /**
     * Constructs a {@link DocumentFilter} with the supplied {@code source}
     * Document.
     *
     * @param source the source {@link Document} for this filter
     */
    public DocumentFilter(Document source) {
      this.source = source;
    }
    @Override
    public Property findProperty(String name) throws RepositoryException {
      return AbstractDocumentFilter.this.findProperty(source, name);
    }
    @Override
    public Set<String> getPropertyNames() throws RepositoryException {
      return AbstractDocumentFilter.this.getPropertyNames(source);
    }
  }

The implementation of above two methods of DocumentFilter in association with the creation of DocumentFilterChain results in the chain effect of document filters.
These are the basic building blocks for document filter. You may check Google Javadoc and the source code of Connector Manager to further your understanding. Next time, we will discuss how document filter is used within connector framework.

Thoughts on “Document filter inside out (part 1): the fundamentals”

  1. Just want an idea how to crawl thousands of XML files and their element as metadata and values in google search appliance.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Subscribe to the Weekly Blog Digest:

Sign Up
Follow Us
TwitterLinkedinFacebookYoutubeInstagram