In this installment, we will discuss how the document filter is utilized by Connector Manager to achieve its functionality.
Sometimes it’s very easy to show what’s happening with the code itself. I duplicated many section from Google site, and also provide link to the original source. Since Google engineers keep updating their implementation, the observation and conclusion stated here could be rendered void by their any update and improvement.
How are document filters used in creating feed XML?
Feed is an important vehicle to send document content (along with permission information if applicable) to GSA. The use of document filters are closely tied to the creation of feed XML files following feed protocol.
Connector Manager defined a class DocPusher, which generates XML feed for a set of documents and send it to GSA. A method take(Document document) was defined to handle one document.
Here is a snippet from DocPusher.java:
187 @Override 188 public PusherStatus take(Document document) 189 throws PushException, FeedException, RepositoryException { 190 if (feedSender.isShutdown()) { 191 return PusherStatus.DISABLED; 192 } 193 checkSubmissions(); 194 195 // Apply any configured Document filters to the document. 196 document = documentFilterFactory.newDocumentFilter(document); 197 198 FeedType feedType; 199 try { 200 feedType = DocUtils.getFeedType(document); 201 } catch (RuntimeException e) { 202 LOGGER.log(Level.WARNING, 203 "Rethrowing RuntimeException as RepositoryDocumentException", e); 204 throw new RepositoryDocumentException(e); 205 } … 245 // Add this document to the feed. 246 xmlFeed.addRecord(document); …
Please pay close attention to line 195~196. That’s the place where document filter factory to create new filters for each of every given document. As we discussed in last post, when the code is reaching line 196, the documentFilterFactory was already the results of method DocumentFilterFactoryFactoryImpl::getDocumentFilterFactory(String connectorName), i.e. merging of both CM level filters and connector level filters.
Another very important point is that for each given document object, there is each filter instance created of every filter configured. For instance, if we have 3 filters configured, 1 at CM level and 2 at connector level, then there are 3 filter instance created for every document item.
Within the method, the given document is first processed by the documentFilterFactory. Then the document is added to xmlFeed via xmlFeed.addRecord(document) as shown in the sequence diagram below.
When tracing the code of xmlFeed.addRecord(document), it in turn calls xmlWrapRecord(document), where different document types (i.e. document record, or ACL record) are processed. It’s very interesting that many places connector manager explicitly uses ACL related filters to process the document specially.
The last method displayed in above diagram showed call to method XmlFeed::xmlWrapDocumentRecord(), which was actually doing the heavy lifting to create the XML record.
Dissection of XmlFeed::xmlWrapDocumentRecord()
The full source code can be found here. I will split it into small sections to describe the function of them as necessary.
417 /* 418 * Generate the record tag for the xml data. 419 * 420 * @throws IOException only from Appendable, and that can't really 421 * happen when using StringBuilder. 422 */ 423 private void xmlWrapDocumentRecord(Document document) 424 throws RepositoryException, IOException { 425 boolean aclRecordAllowed = supportsInheritedAcls; 426 boolean metadataAllowed = (feedType != FeedType.CONTENTURL); 427 boolean contentAllowed = (feedType == FeedType.CONTENT); 428
Section of line 425~427 defined variables to control how the document is processed by the method later. Please note how the values are set for these variables and how they are used later.
Section of line 459~520 (please check source from Google site) would add several configuration related attributes to the XML record, such as lock, crawlImmediately, crawlOnce, as well as pageRank, mimetype, LastModified and authMethod.
Feeds Protocol Developer’s Guide (page 9~10) described the meaning of these attributes for a feed record.
530 if (metadataAllowed) { 531 xmlWrapMetadata(prefix, document); 532 } 533
Line 530~532 added metadata of the document to the XML record. This is controlled by the variable metadataAllowed. Please refer to line 426 for its value setting.
The two-step dance of getting metadata for a document
When metadataAllowed is set to true at line 530, the method xmlWrapMetadata() is called to populate the metadata for the given document.
747 /** 748 * Wrap the metadata and append it to the string buffer. Empty metadata 749 * properties are not appended. 750 * 751 * @param buf string buffer 752 * @param document Document 753 * @throws RepositoryException if error reading Property from Document 754 * @throws IOException only from Appendable, and that can't really 755 * happen when using StringBuilder. 756 */ 757 private void xmlWrapMetadata(StringBuilder buf, Document document) 758 throws RepositoryException, IOException { 759 boolean overwriteAcls = DocUtils.getOptionalBoolean(document, 760 SpiConstants.PROPNAME_OVERWRITEACLS, true); 761 buf.append('<').append(XML_METADATA); 762 if (!overwriteAcls) { 763 XmlUtils.xmlAppendAttr(XML_OVERWRITEACLS, 764 Value.getBooleanValue(false).toString(), buf); 765 } 766 buf.append(">\n"); 767 768 // Add all the metadata supplied by the Connector. 769 Set propertyNames = document.getPropertyNames(); 770 if ((propertyNames == null) || propertyNames.isEmpty()) { 771 LOGGER.log(Level.WARNING, "Property names set is empty"); 772 } else { 773 // Sort property names so that metadata is written in a canonical form. 774 // The GSA's metadata change detection logic depends on the metadata to be 775 // in the same order each time in order to prevent reindexing. 776 propertyNames = new TreeSet(propertyNames); 777 for (String name : propertyNames) { 778 if (propertySkipSet.contains(name)) { 779 if (LOGGER.isLoggable(Level.FINEST)) { 780 logOneProperty(document, name); 781 } 782 continue; 783 } 784 Property property = document.findProperty(name); 785 if (property != null) { 786 wrapOneProperty(buf, name, property); 787 } 788 } 789 } 790 XmlUtils.xmlAppendEndTag(XML_METADATA, buf); 791 }
The first step happens at section of line 768~769, where the getPropertyNames() is called and that’s the chance where several default document filters from Connector Manager (i.e. AddPropertyFilter, CopyPropertyFilter, and DeletePropertyFilter) to secretly manipulate the existing property name list, and achieve what was designed to do. It’s no surprise that at section of line 778~783, certain properties are skipped. If you check the definition of propertySkipSet, you will notice many of the properties were already processed from section of line 459~520.
The second step happens from the section of line 777~788. It actually consists of small steps that loop for each surviving property and retrieve the value for it. The retrieval process will go through each filter (to call its findProperty() method) among the chain of document filters.
Overall, GSA is pretty much a black box to us. Document filter is one of a few places that customization can be applied to impact the behavior of GSA. Hopefully the discussion here can help you with the configuration and application of document filters.
Read part one of this series here and part two here.