Skip to main content

Experience Management

Document filter inside out(part 3): the anatomy working with feed

In this installment, we will discuss how the document filter is utilized by Connector Manager to achieve its functionality.
docfilter-anatomy
Sometimes it’s very easy to show what’s happening with the code itself. I duplicated many section from Google site, and also provide link to the original source. Since Google engineers keep updating their implementation, the observation and conclusion stated here could be rendered void by their any update and improvement.




How are document filters used in creating feed XML?
Feed is an important vehicle to send document content (along with permission information if applicable) to GSA. The use of document filters are closely tied to the creation of feed XML files following feed protocol.
Connector Manager defined a class DocPusher, which generates XML feed for a set of documents and send it to GSA. A method take(Document document) was defined to handle one document.
docfilter-docpusher
Here is a snippet from DocPusher.java:

 187 @Override
 188 public PusherStatus take(Document document)
 189     throws PushException, FeedException, RepositoryException {
 190   if (feedSender.isShutdown()) {
 191     return PusherStatus.DISABLED;
 192   }
 193   checkSubmissions();
 194
 195   // Apply any configured Document filters to the document.
 196   document = documentFilterFactory.newDocumentFilter(document);
 197
 198   FeedType feedType;
 199   try {
 200     feedType = DocUtils.getFeedType(document);
 201   } catch (RuntimeException e) {
 202     LOGGER.log(Level.WARNING,
 203         "Rethrowing RuntimeException as RepositoryDocumentException", e);
 204     throw new RepositoryDocumentException(e);
 205   }
…
 245    // Add this document to the feed.
 246     xmlFeed.addRecord(document);

Please pay close attention to line 195~196. That’s the place where document filter factory to create new filters for each of every given document. As we discussed in last post, when the code is reaching line 196, the documentFilterFactory was already the results of method DocumentFilterFactoryFactoryImpl::getDocumentFilterFactory(String connectorName), i.e. merging of both CM level filters and connector level filters.
Another very important point is that for each given document object, there is each filter instance created of every filter configured. For instance, if we have 3 filters configured, 1 at CM level and 2 at connector level, then there are 3 filter instance created for every document item.
Within the method, the given document is first processed by the documentFilterFactory. Then the document is added to xmlFeed via xmlFeed.addRecord(document) as shown in the sequence diagram below.
docfilter-addRecord
When tracing the code of xmlFeed.addRecord(document), it in turn calls xmlWrapRecord(document), where different document types (i.e. document record, or ACL record) are processed. It’s very interesting that many places connector manager explicitly uses ACL related filters to process the document specially.
docfilter-xmlWrapRecord
The last method displayed in above diagram showed call to method XmlFeed::xmlWrapDocumentRecord(), which was actually doing the heavy lifting to create the XML record.
Dissection of XmlFeed::xmlWrapDocumentRecord()
The full source code can be found here. I will split it into small sections to describe the function of them as necessary.

417  /*
418   * Generate the record tag for the xml data.
419   *
420   * @throws IOException only from Appendable, and that can't really
421   *         happen when using StringBuilder.
422   */
423  private void xmlWrapDocumentRecord(Document document)
424      throws RepositoryException, IOException {
425    boolean aclRecordAllowed = supportsInheritedAcls;
426    boolean metadataAllowed = (feedType != FeedType.CONTENTURL);
427    boolean contentAllowed = (feedType == FeedType.CONTENT);
428

Section of line 425~427 defined variables to control how the document is processed by the method later. Please note how the values are set for these variables and how they are used later.
Section of line 459~520 (please check source from Google site) would add several configuration related attributes to the XML record, such as lock, crawlImmediately, crawlOnce, as well as pageRank, mimetype, LastModified and authMethod.
Feeds Protocol Developer’s Guide (page 9~10) described the meaning of these attributes for a feed record.

530    if (metadataAllowed) {
531      xmlWrapMetadata(prefix, document);
532    }
533

Line 530~532 added metadata of the document to the XML record. This is controlled by the variable metadataAllowed. Please refer to line 426 for its value setting.
The two-step dance of getting metadata for a document
When metadataAllowed is set to true at line 530, the method xmlWrapMetadata() is called to populate the metadata for the given document.

747  /**
748   * Wrap the metadata and append it to the string buffer. Empty metadata
749   * properties are not appended.
750   *
751   * @param buf string buffer
752   * @param document Document
753   * @throws RepositoryException if error reading Property from Document
754   * @throws IOException only from Appendable, and that can't really
755   *         happen when using StringBuilder.
756   */
757  private void xmlWrapMetadata(StringBuilder buf, Document document)
758      throws RepositoryException, IOException {
759    boolean overwriteAcls = DocUtils.getOptionalBoolean(document,
760        SpiConstants.PROPNAME_OVERWRITEACLS, true);
761    buf.append('<').append(XML_METADATA);
762    if (!overwriteAcls) {
763      XmlUtils.xmlAppendAttr(XML_OVERWRITEACLS,
764          Value.getBooleanValue(false).toString(), buf);
765    }
766    buf.append(">\n");
767
768    // Add all the metadata supplied by the Connector.
769    Set propertyNames = document.getPropertyNames();
770    if ((propertyNames == null) || propertyNames.isEmpty()) {
771      LOGGER.log(Level.WARNING, "Property names set is empty");
772    } else {
773      // Sort property names so that metadata is written in a canonical form.
774      // The GSA's metadata change detection logic depends on the metadata to be
775      // in the same order each time in order to prevent reindexing.
776      propertyNames = new TreeSet(propertyNames);
777      for (String name : propertyNames) {
778        if (propertySkipSet.contains(name)) {
779          if (LOGGER.isLoggable(Level.FINEST)) {
780            logOneProperty(document, name);
781          }
782          continue;
783        }
784        Property property = document.findProperty(name);
785        if (property != null) {
786          wrapOneProperty(buf, name, property);
787        }
788      }
789    }
790    XmlUtils.xmlAppendEndTag(XML_METADATA, buf);
791  }

The first step happens at section of line 768~769, where the getPropertyNames() is called and that’s the chance where several default document filters from Connector Manager (i.e. AddPropertyFilter, CopyPropertyFilter, and DeletePropertyFilter) to secretly manipulate the existing property name list, and achieve what was designed to do. It’s no surprise that at section of line 778~783, certain properties are skipped. If you check the definition of propertySkipSet, you will notice many of the properties were already processed from section of line 459~520.
The second step happens from the section of line 777~788. It actually consists of small steps that loop for each surviving property and retrieve the value for it. The retrieval process will go through each filter (to call its findProperty() method) among the chain of document filters.
Overall, GSA is pretty much a black box to us. Document filter is one of a few places that customization can be applied to impact the behavior of GSA. Hopefully the discussion here can help you with the configuration and application of document filters.
Read part one of this series here and part two here.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Follow Us