In last installment, we carefully examined the usage of document filters in creating feed XML. Feed protocol is a push mechanism for the content source to send information to GSA.
Since GSA connector framework 3.0, GSA introduced a Lister/Retriever model, which was first implemented in File System Connector. The connector is no longer using the Diffing Connector model, where the connector compared what had changed between crawls to gather new information for indexing. Under lister/Retriever model the connector sends links (lists) to the GSA, which the GSA then can crawl using standard HTTP.
Conceptually, the lister/Retriever model is a mixed push/pull mechanism. During lister stage, connector pushes the information on behalf of content source system to GSA following feed protocol. Then later, GSA would pull information necessary from content source via connector to finish the retriever stage.
In this installment, we will further discussion the role document filter plays with Lister/Retriever model.
As we discussed in the second part of this series, DocumentFilterFactoryFactoryImpl::getDocumentFilterFactory(String connectorName) is linking the connector level filter with CM level filter factory. In order to make the call take effect, the method has to be called somewhere.
If you can recall the hint we gave during the second part of this series where we mentioned that the bean DocumentFilterFactoryFactory is referred by two other beans, PusherFactory and Manager.
The PushFactory is where the lister activities happen.
The action flow for lister stage
The call happens within DocPusherFactory::newPusher(String dataSource). The method is to create a new Pusher instance appropriate for the supplied dataSource, where dataSource, typically the name of a connector instance, is a data source for a Feed.
Here is an extraction of DocPusherFactory:
90 @Override 91 public Pusher newPusher(String dataSource) { 92 return new DocPusher(feedConnection, dataSource, fileSizeLimit, 93 documentFilterFactoryFactory.getDocumentFilterFactory(dataSource)); 94 }
The method DocPusherFactory::newPusher() is in turn called by DocumentAcceptorImpl::take(Document document). Here is the take() method from DocumentAcceptorImpl:
59 /** 60 * Takes an spi Document and pushes it along, presumably to the GSA Feed. 61 * 62 * @param document A Document 63 * @throws RepositoryException if transient error accessing the Repository 64 * @throws RepositoryDocumentException if fatal error accessing the Document 65 * @throws DocumentAcceptorException if a transient error occurs in the 66 * DocumentAcceptor 67 */ 68 public synchronized void take(Document document) 69 throws DocumentAcceptorException, RepositoryException { 70 try { 71 if (pusher.take(document) != PusherStatus.OK) { 72 waitForOkStatus(); 73 } 74 } catch (NullPointerException e) { 75 // Ugly, but avoids checking for null Pusher on every call to take. 76 if (pusher == null) { 77 // Opps. We need to get a new Pusher. 78 try { 79 pusher = pusherFactory.newPusher(connectorName); 80 this.take(document); 81 } catch (PushException pe) { 82 LOGGER.log(Level.SEVERE, "DocumentAcceptor failed to get Pusher", e); 83 throw new DocumentAcceptorException("Failed to get Pusher", e); 84 } 85 } else { 86 throw e; 87 } 88 } catch (PushException e) { ... ... 98 // Woke from sleep. Just return. 99 } 100 }
For connector implemented following lister/retriever model, bean Manager is the place where document filters got involved for retriever.
The action flow for Retriever stage
Here is a method getDocumentMetaData defined by ProductionManager:
210 @Override 211 public Document getDocumentMetaData(String connectorName, String docid) 212 throws ConnectorNotFoundException, InstantiatorException, 213 RepositoryException { 214 if (LOGGER.isLoggable(Level.FINER)) { 215 LOGGER.finer("RETRIEVER: Retrieving metadata from connector " 216 + connectorName + " for document " + docid); 217 } 218 Retriever retriever = instantiator.getRetriever(connectorName); 219 if (retriever == null) { 220 // We are borked here. This should not happen. 221 LOGGER.warning("GetDocumentMetaData request for connector " 222 + connectorName 223 + " that does not support the Retriever interface."); 224 return null; 225 } 226 Document metaDoc = retriever.getMetaData(docid); 227 if (metaDoc == null) { 228 LOGGER.finer("RETRIEVER: Document has no metadata."); 229 // TODO: Create empty Document? 230 } else { 231 if (documentFilterFactoryFactory != null) { 232 DocumentFilterFactory documentFilterFactory = 233 documentFilterFactoryFactory.getDocumentFilterFactory(connectorName); 234 metaDoc = documentFilterFactory.newDocumentFilter(metaDoc); 235 } 236 ... ... 256 return metaDoc; 257 }
At line 218, Retriever object is obtained and later used to get metadata for the object at line 226. The call to documentFilterFactoryFactory.getDocumentFilterFactory(connectorName) at line 232~233 is the place to trigger the association of connector level filter with CM level filter.
Communication protocol used for retriever stage
As we talked before, feed protocol is used during the lister stage for connector to send content to GSA. During retriever stage, the communication protocol is HTTP, where GSA would send request to connector to query information about a particular document that was sent to it earlier by connector during lister stage. The gateway to make it happen with a connector is class GetDocumentContent.
Not surprisingly, the class GetDocumentContent is actually an servlet, which will accept requests from GSA to achieve retriever activities.
Here is an extraction from GetDocumentContent. The method getMetadataHeader() showed the similar two-step process to obtain metadata for a given document during the retriever stage. At line of 290, the whole list of available properties are checked, and then during section of line of 299~312, each property is checked and its value is patched to form the http header for the response to a retriever request.
From the comments at line 284, Google engineers also indicated the similarity between the process for Lister and Retriever.
How SpringFramework links them together?
As we discussed before, class GetDocumentContent is handling retriever requests from GSA. The doGet() method first obtains ‘Manager’ by checking Context object at line 145.
135 /** 136 * Retrieves the content of a document from a connector instance. 137 * 138 * @param req 139 * @param res 140 * @throws IOException 141 */ 142 @Override 143 protected void doGet(HttpServletRequest req, HttpServletResponse res) 144 throws IOException { 145 doGet(req, res, Context.getInstance().getManager()); 146 }
The class Context just relied on SpringFramework to instantiate the Manager bean.
738 /** 739 * Gets the singleton {@link Manager}. 740 * 741 * @return the Manager 742 */ 743 public Manager getManager() { 744 if (manager != null) { 745 return manager; 746 } 747 manager = (Manager) getRequiredBean("Manager", Manager.class); 748 return manager; 749 }
Of cource, we have seen the bean definition for ‘Manager’ from applicationContext.xml
479 <bean id="Manager" 480 class="com.google.enterprise.connector.manager.ProductionManager"> 481 <property name="instantiator" ref="Instantiator"/> 482 <property name="feedConnection" ref="FeedConnection"/> 483 <property name="documentFilterFactoryFactory" ref="DocumentFilterFactoryFactory"/> 484 </bean>
So far we have spent quite a lot time to check the application of document filter for lister and retriever model. Actually the best way to trace the caller/callee relationship among various classes is setup a Eclipse project with Google connector source code. The more you stare at the code, the better you will understand it.
Read the full series on our Google Technologies blog.