Experience Management

Google Search Appliance Connector Development: Discovery Phase

Sorry… this post is not going to teach you how to build a connector.  Maybe next time.
sortedInstead, I am going back a little earlier in the process.  I want to describe the process for qualifying our ability to build a new connector in the first place.  How do we estimate the level of effort?  How do we ensure that we will have sufficient access to the content or that the necessary APIs exist?
First, not all integrations or connections are alike – there can be wide variations in complexity or level of effort.  When we run through this qualification process, we always evaluate our options in order, from easiest to most difficult.  For each option, there is a set of criteria that must be met in order to be successful.  If any criteria cannot be met, we move on to the next-most-complex option.
Here is the decision tree that we use when scoping a new connector.  We start at the top and work our way to the bottom until we find a viable approach.

  • Can the system be accessed using one of Google’s existing connectors, or a partner-sourced connector?  For example, Documentum, SharePoint, FileNet, OpenText, File Shares, Relational Database, SalesForce, IBM WCM, IBM Connections, JIVE, Confluence, JIRA, etc.?
  • Is the system web-crawlable?
    • Can you crawl all the pages, and each page only once, through a spidering approach?  Every page must have a unique, single URL that is reachable from the starting point.
    • What security mechanism is protecting the web content (Cookie?  SSO?  NTLM?  HTTP Basic?).  Can the GSA crawler handle this type of security?
    • Does the content need to be security trimmed at search-time?  Or can all secure users view the content?
    • Is late-binding / head-request trimming acceptable?
    • If early-binding is required,is is possible to construct an ACL (users/groups/roles) ahead of time for each item?
      • Can an ACL be described in the metadata, either in page meta tags or in the HTTP Headers?
      • Does the namespace of the users and groups match the Authentication mechanism configured on the GSA?
  • Is the system backed by a relational database?
    • Can you write a query that will return all of the content to be indexed, or use the database to generate a distinct list of URLs to crawl?
    • Does the database preserve and flag deleted items, or are they purged from the database immediately upon deletion?
    • Does the database contain all metadata about each item?
    • Does the database contain the contents of any binary files, or a URL/path to the binary file?
    • Does the content need to be security trimmed at search-time, or can all users view the content?
    • If early-binding security is required, can an ACL (users/groups/roles) for each document be constructed using information from the database?
      • What is the namespace of the users and groups?  AD/LDAP?  Local Application-specific?  Other?
        • If local groups or roles are used in the ACL, can a list of group or role-memberships be queried upon demand for an arbitrary user?
  • Does the system store its content in a file system?
    • Can you traverse the file system and find all the items that need to be indexed?
    • Is all metadata present in the file (like an XML file) or in an adjacent file (sidecar)?
    • Does the content need to be security trimmed at search-time, or can all users view the content?
    • If early-binding security is required, can an ACL (users/groups/roles) for each document be constructed using information from the file system?
      • What is the namespace of the users and groups?  AD/LDAP?  Local Application-specific?  Other?
  • Does the system have a content-retrieval API?
    • Does the API provide a full retrieval so that the connector can populate the index initially?
    • Does the API allow incremental retrieval thereafter, including updated *and* deleted items?
    • Is the incremental API push (event-driven) or pull (batched by date or sequence id)?
    • Does the API provide all metadata about each item?
    • Does the API provide the contents of any binary files, or a URL/path to the binary file?
    • Does the content need to be security trimmed at search-time?  Or can all users view the content?
    • If early-binding security is required, can an ACL (users/groups/roles) for each document be constructed using information from the API?
      • What is the namespace of the users and groups?  AD/LDAP?  Local Application-specific?  Other?
        • If local groups or roles are used in the ACL, can a list of group or role-memberships be queried upon demand for an arbitrary user?

Based on the answers to the following questions, we can find the easiest, most efficient route to index a new content system.  The order of the options is partially based on the ability to reuse or modify existing connector source code.  For example, systems backed by a relational database can likely use Google’s or Perficient’s database connector as a starting point.  Systems based on a file system can start with Google’s File System connector as a starting point (with modifications as necessary).  For the API approach, we might be able to start from an existing connector that uses a similar type of API (REST vs. JSON vs. Seedlist, etc.).  So, even within a certain approach, the level of effort can vary.  The more we understand about the integration approaches that are available, the more likely we are to find an optimal solution.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Chad Johnson

Chad is a Principal of Search and Knowledge Discovery at Perficient. He was previously the Director of Perficient's national Google for Work practice.

More from this Author

Subscribe to the Weekly Blog Digest:

Sign Up
Follow Us
TwitterLinkedinFacebookYoutubeInstagram