UPDATE 2014-07-02: Adaptors are not really evil. We are currently developing both Connectors and Adaptors. The point of these articles is to help you draw your own conclusions. : )
Previously I discussed the Plexi Adaptor framework for the Google Search Appliance. Adaptors can provide a simple and elegant way to index a content repository. An Adaptor sits in front of a repository, making it behave like a web site from the GSA’s perspective. This off-loads the state-management and queue-management to the GSA’s built-in web crawler, simplifying the implementation.
But I mentioned that I was leery of Adaptors. I have written dozens of connectors and I like the control that the Connector Manager framework affords. I have written Connectors that do not seem transferable to the Adaptor methodology. I would like to discuss a few reasons that I am not completely letting go of Connectors.
Change Detection
When I implement connectors with complex hierarchical ACL’s or that require joining multiple database tables, I often have to track the state of multiple objects to do change detection. For example, the ACLs in our Atlassian JIRA Connector take into consideration various objects, including the Project, Permission Schemes, Issue Schemes and Custom Attributes, to compute the ACL for a single Issue. Changes could occur in any of these objects, with ripple effects throughout the entire repository. Implementing stateless change-detection with a Retriever would be very difficult because these permission objects, and the complex interactions between them, do not have timestamps to reveal modifications. Instead, we store a snapshot of the permissions for each object in our Connector, and that allows us to quickly check for even subtle changes to the permissions. There is nothing that prevents an Adaptor from storing this kind of state information, but it goes against the recommended design.
Sequential Iteration
Some API’s, like the Seedlist API used in IBM WCM, Quickr and Connections, are designed to be iterated sequentially. There is no way to jump back to an arbitrary item after it has been retrieved. While the Seedlist API would make the Lister interface easy to implement, it would make the Retriever interface much more difficult. Entirely different API’s would have to be used to retrieve data for a specific item on demand, which is wasteful, since all the data was present in the initial Seedlist responses.
I have run across several applications that have this sort of firehose API, where you are pushed or pull all incremental changes as they occur. This is a natural fit for a traditional Connector, where metadata, content, and ACLs are all pushed to the GSA simultaneously. Our IBM Connectors retrieve the items from each incremental Seedlist response, and then deliver them in their entirety to the GSA according to the batch size dictated by the Connect Manager. There is a little added complexity here, because the Seedlist pagination does not always match the traversal cycle and batch size, but we developed a buffer mechanism that makes this a non-issue.
API Quotas
Some API’s, like Salesforce.com or JIVE, have quotas that can be exceeded or incur additional cost. Like the problem with sequential iteration, being able to fetch all of the information at one time can reduce the number of API calls to the application (note: the same problem can affect database connectors – running one large query vs. many small queries). With a Lister/Retriever, we have to make one API call to retrieve the list of records, and then at least one additional query for every record when the Retriever is invoked. And if the Lister does not use the crawl-once feature so it can push its own incremental updates, then additional API calls we be used for every subsequent recrawl of every single item, at the GSA’s discretion.
With a traditional connector, like our Salesforce.com Connector, we can run a query that will retrieve all of the metadata and content for hundreds of items at a time in a single API call. The same is true for incremental updates. (note: I am ignoring the fact that for large numbers of items or updates, we might have to make multiple API calls to fetch them all)
Many of our connectors push the limits of what can be retrieved in a single API call or a single SQL query, but the performance benefits are huge — for both traversal and change detection.