Last time, we discussed the basics about document filter. In this installment, we will talk about practical aspects of document filter.
How to configure document filters?
Google had a document explaining the usage of document filters.
There are two ways you can configure document filters. The first is at Connector Manager level, specifically within <Tomcat>/webapps/connector-manager/WEB-INF/documentFilters.xml. The filters specified within this file will apply to any connectors configured under this Connector Manager.
The other is within the connector configuration page from GSA admin console. The advanced properties section provides a place to configure document filters, such as SharePoint connector, and File System connector.
In order for easy identifying these two types of document filters later, we will call the first type Connector Manager level filter, or CM level filter, or global document filter. The second type of filters will be called connector level filters, or connector filters.
Google connector JavaDoc also provides many configuration samples for the document filters implemented in Connector Manager. You have to click the link of each filter implmentation from the page to get to the samples.
How to pick up the configuration of document filter factories by connector manager?
The definition of connector level document filters will be processed by private method com.google.enterprise.connector.instantiator.InstanceInfo::getDocumentFilterFactory().
The code was also copied here:
/** * Looks for {@link DocumentFilterFactory} beans in the connector's * bean factory. * * @param beanFactory DefaultListableBeanFactory used to create the connector. * @return {@link DocumentFilterFactory} for the connector, or {@code null} * if the connector does not define a DocumentFilterFactory. */ private static DocumentFilterFactory getDocumentFilterFactory( DefaultListableBeanFactory beanFactory) throws BeansException { @SuppressWarnings("unchecked") Collection filters = beanFactory.getBeansOfType(DocumentFilterFactory.class).values(); if (filters == null || filters.size() == 0) { // No filters defined. return null; } else if (filters.size() == 1) { // If there is just one, return it. return filters.iterator().next(); } // More than one filter is defined. Look for a single DocumentFilterChain, // which hopefully encapsulates the rest. @SuppressWarnings("unchecked") Collection chains = beanFactory.getBeansOfType(DocumentFilterChain.class).values(); if (chains == null || chains.size() == 0) { // No chains defined, so I'll make one. But the order of the filters // should be considered random. return new DocumentFilterChain(Lists.newArrayList(filters)); } else if (chains.size() == 1) { // If there is just one, return it. return chains.iterator().next(); } else { // More than one filter chain is defined??? I will allow it, but... return new DocumentFilterChain(Lists.newArrayList(chains)); } }
From the comments and implementation, the code would take care of a few varieties of configurations, such as multiple document filter factories, single document filter chain, and multiple document filter chains.
The process of reading the configuration at CM level is achieved via Spring Framework. Here is an extraction of the definition of related beans from applicationContext.xml at CM level (the complete definition can be found from Google site):
… … 5 <import resource="documentFilters.xml"/> 7 <bean id="ApplicationContextProperties" class="java.lang.String"> 8 <constructor-arg value="/WEB-INF/applicationContext.properties"/> 9 </bean> 413 <bean id="DocumentFilterFactoryFactory" 414 class="com.google.enterprise.connector.instantiator.DocumentFilterFactoryFactoryImpl"> 415 <constructor-arg index="0" ref="DocumentFilters"/> 416 <constructor-arg index="1" ref="ConnectorCoordinatorMap"/> 416 </bean> 419 <bean id="PusherFactory" 420 class="com.google.enterprise.connector.pusher.DocPusherFactory"> 421 <constructor-arg index="0" ref="FeedConnection" /> 422 <constructor-arg index="1" ref="FileSizeLimitInfo"/> 423 <constructor-arg index="2" ref="DocumentFilterFactoryFactory"/> 424 </bean> 479 <bean id="Manager" 480 class="com.google.enterprise.connector.manager.ProductionManager"> 481 <property name="instantiator" ref="Instantiator"/> 482 <property name="feedConnection" ref="FeedConnection"/> 483 <property name="documentFilterFactoryFactory" ref="DocumentFilterFactoryFactory"/> 484 </bean>
Bean DocumentFilters is defined within documentFilters.xml, which was included at line 5 of applicationContext.xml. Line 413~416 defines bean DocumentFilterFactoryFactory, which is actually the global filter factory (i.e. CM level filter factory).
It’s interesting that the bean DocumentFilterFactoryFactory is referred by two other beans, PusherFactory and Manager. Later we will discuss the purpose of these two beans and their different use of the global filter factory.
So far we have seen how the two types of document filters are defined and processed, the next question would be how they are associated with each other?
How are the two types of filter factories linked together?
If you check the Source Code of method DocumentFilterFactoryFactoryImpl::getDocumentFilterFactory(String connectorName), it first checks to see if the connector instance (identified by connector name) has document filter factory defined or not. If it has, then checking if the global filter factory is defined. When both filter factories are defined for connector and CM, a new filter chain is created to link filter at connector level and CM level together. Please note that the connector level’s filter factories are put in front of the global filter factories.
/** {@inheritDoc} */ public DocumentFilterFactory getDocumentFilterFactory(String connectorName) { ConnectorCoordinator coordinator = (coordinatorMap == null) ? null : coordinatorMap.get(connectorName); if (coordinator != null) { try { DocumentFilterFactory connectorFilterFactory = coordinator.getDocumentFilterFactory(); if (connectorFilterFactory != null) { if (globalFilterFactory == null) { return connectorFilterFactory; } else { // Put the connector's filters before the global filters. return new DocumentFilterChain(Lists.newArrayList( connectorFilterFactory, globalFilterFactory)); } } } catch (ConnectorNotFoundException e) { LOGGER.log(Level.FINE, "Connector not found: {0}", connectorName); } } // No connector instance, return just the globalFilterFactory. return getDocumentFilterFactory(); }
Now we know more about how document filters are configured and processed by connector manager. Next time we will discuss how document filters are used internally with connector manager and how they achieve the features to manipulate document metadata.