Experience Management

Document filter inside out (p4): work with Lister/Retriever Model

In last installment, we carefully examined the usage of document filters in creating feed XML. Feed protocol is a push mechanism for the content source to send information to GSA.
Since GSA connector framework 3.0, GSA introduced a Lister/Retriever model, which was first implemented in File System Connector. The connector is no longer using the Diffing Connector model, where the connector compared what had changed between crawls to gather new information for indexing. Under lister/Retriever model the connector sends links (lists) to the GSA, which the GSA then can crawl using standard HTTP.
Conceptually, the lister/Retriever model is a mixed push/pull mechanism. During lister stage, connector pushes the information on behalf of content source system to GSA following feed protocol. Then later, GSA would pull information necessary from content source via connector to finish the retriever stage.
In this installment, we will further discussion the role document filter plays with Lister/Retriever model.
As we discussed in the second part of this series, DocumentFilterFactoryFactoryImpl::getDocumentFilterFactory(String connectorName) is linking the connector level filter with CM level filter factory. In order to make the call take effect, the method has to be called somewhere.
If you can recall the hint we gave during the second part of this series where we mentioned that the bean DocumentFilterFactoryFactory is referred by two other beans, PusherFactory and Manager.
The PushFactory is where the lister activities happen.
The action flow for lister stage
The call happens within DocPusherFactory::newPusher(String dataSource). The method is to create a new Pusher instance appropriate for the supplied dataSource, where dataSource, typically the name of a connector instance, is a data source for a Feed.
Here is an extraction of DocPusherFactory:

90  @Override
91  public Pusher newPusher(String dataSource) {
92    return new DocPusher(feedConnection, dataSource, fileSizeLimit,
93        documentFilterFactoryFactory.getDocumentFilterFactory(dataSource));
94  }

The method DocPusherFactory::newPusher() is in turn called by DocumentAcceptorImpl::take(Document document). Here is the take() method from DocumentAcceptorImpl:

59   /**
60    * Takes an spi Document and pushes it along, presumably to the GSA Feed.
61    *
62    * @param document A Document
63    * @throws RepositoryException if transient error accessing the Repository
64    * @throws RepositoryDocumentException if fatal error accessing the Document
65    * @throws DocumentAcceptorException if a transient error occurs in the
66    *         DocumentAcceptor
67    */
68   public synchronized void take(Document document)
69       throws DocumentAcceptorException, RepositoryException {
70     try {
71       if (pusher.take(document) != PusherStatus.OK) {
72         waitForOkStatus();
73       }
74     } catch (NullPointerException e) {
75       // Ugly, but avoids checking for null Pusher on every call to take.
76       if (pusher == null) {
77         // Opps. We need to get a new Pusher.
78         try {
79           pusher = pusherFactory.newPusher(connectorName);
80           this.take(document);
81         } catch (PushException pe) {
82           LOGGER.log(Level.SEVERE, "DocumentAcceptor failed to get Pusher", e);
83           throw new DocumentAcceptorException("Failed to get Pusher", e);
84         }
85       } else {
86         throw e;
87       }
88     } catch (PushException e) {
... ...
98       // Woke from sleep. Just return.
99     }
100  }

For connector implemented following lister/retriever model, bean Manager is the place where document filters got involved for retriever.
The action flow for Retriever stage
Here is a method getDocumentMetaData defined by ProductionManager:

210  @Override
211  public Document getDocumentMetaData(String connectorName, String docid)
212      throws ConnectorNotFoundException, InstantiatorException,
213             RepositoryException {
214    if (LOGGER.isLoggable(Level.FINER)) {
215      LOGGER.finer("RETRIEVER: Retrieving metadata from connector "
216                   + connectorName + " for document " + docid);
217    }
218    Retriever retriever = instantiator.getRetriever(connectorName);
219    if (retriever == null) {
220      // We are borked here.  This should not happen.
221      LOGGER.warning("GetDocumentMetaData request for connector "
222                     + connectorName
223                     + " that does not support the Retriever interface.");
224      return null;
225    }
226    Document metaDoc = retriever.getMetaData(docid);
227    if (metaDoc == null) {
228      LOGGER.finer("RETRIEVER: Document has no metadata.");
229      // TODO: Create empty Document?
230    } else {
231      if (documentFilterFactoryFactory != null) {
232        DocumentFilterFactory documentFilterFactory = 
233          documentFilterFactoryFactory.getDocumentFilterFactory(connectorName);
234        metaDoc = documentFilterFactory.newDocumentFilter(metaDoc);
235      }
... ...
256    return metaDoc;
257  }

At line 218, Retriever object is obtained and later used to get metadata for the object at line 226. The call to documentFilterFactoryFactory.getDocumentFilterFactory(connectorName) at line 232~233 is the place to trigger the association of connector level filter with CM level filter.
Communication protocol used for retriever stage
As we talked before, feed protocol is used during the lister stage for connector to send content to GSA. During retriever stage, the communication protocol is HTTP, where GSA would send request to connector to query information about a particular document that was sent to it earlier by connector during lister stage. The gateway to make it happen with a connector is class GetDocumentContent.
Not surprisingly, the class GetDocumentContent is actually an servlet, which will accept requests from GSA to achieve retriever activities.
Here is an extraction from GetDocumentContent. The method getMetadataHeader() showed the similar two-step process to obtain metadata for a given document during the retriever stage. At line of 290, the whole list of available properties are checked, and then during section of line of 299~312, each property is checked and its value is patched to form the http header for the response to a retriever request.
From the comments at line 284, Google engineers also indicated the similarity between the process for Lister and Retriever.
How SpringFramework links them together?
As we discussed before, class GetDocumentContent is handling retriever requests from GSA. The doGet() method first obtains ‘Manager’ by checking Context object at line 145.

135  /**
136   * Retrieves the content of a document from a connector instance.
137   *
138   * @param req
139   * @param res
140   * @throws IOException
141   */
142  @Override
143  protected void doGet(HttpServletRequest req, HttpServletResponse res)
144      throws IOException {
145    doGet(req, res, Context.getInstance().getManager());
146  }

The class Context just relied on SpringFramework to instantiate the Manager bean.

738  /**
739   * Gets the singleton {@link Manager}.
740   *
741   * @return the Manager
742   */
743  public Manager getManager() {
744    if (manager != null) {
745      return manager;
746    }
747    manager = (Manager) getRequiredBean("Manager", Manager.class);
748    return manager;
749  }

Of cource, we have seen the bean definition for ‘Manager’ from applicationContext.xml

479  <bean id="Manager"
480        class="com.google.enterprise.connector.manager.ProductionManager">
481    <property name="instantiator" ref="Instantiator"/>
482    <property name="feedConnection" ref="FeedConnection"/>
483    <property name="documentFilterFactoryFactory" ref="DocumentFilterFactoryFactory"/>
484  </bean>

So far we have spent quite a lot time to check the application of document filter for lister and retriever model. Actually the best way to trace the caller/callee relationship among various classes is setup a Eclipse project with Google connector source code. The more you stare at the code, the better you will understand it.
Read the full series on our Google Technologies blog.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Subscribe to the Weekly Blog Digest:

Sign Up
Follow Us