If you read Chad Johnson’s recent post, Why do I need metadata?, you may be thinking: “Metadata sounds great in theory but the level of effort is too high!”
I won’t deny the value of metadata entered carefully by hand, but it can be very labor-intensive to do this for large body of content.
Fortunately, with Google Search Appliance (GSA) there are effective options to automatically identify search metadata that may already exist implicitly or explicitly in your knowledge repositories. Read on to learn about these options:
Metadata from existing Repositories
Document management or collaboration platforms such as SharePoint, Lotus Notes, Documentum, Livelink, and Filenet already have their own metadata capabilities. Your organization may already have good metadata in these platforms, but how does one capture the metadata for searching? The answer is in the GSA Connectors. Connectors are available for indexing these and other common applications, and they each have capabilities to bring metadata from these applications into your GSA index.
Entity Recognition
While your GSA is indexing content, can be on the lookout for words and phrases to create metadata. This Entity Recognition was introduced in GSA version 7.0 and enhanced in 7.2.
The simplest configuration for Entity Recognition is to name a metadata field and provide a list of words that will populate the field. For example, you might add a field named “Country” and list terms like “United States”, “India”, “Brazil”, and “Italy”. Then, whenever GSA identifies one of these words or phrases in a document, it will add the “Country” metadata field with the appropriate term or terms.
Your documents may use different variations on a term – for example “USA”, “United States of America”, or “US”. You can provide variations and roll them up into a single value by using an XML Format Dictionary. There are more options – you can use regular expressions to create complex wildcard matches. If your URL patterns indicate important information about the documents, you can use entity recognition against the URLs themselves.
Using Composite metadata to mix and match
What if different content sources use different field names for similar metadata? What if you need to put several fields together to create the new field you ultimately need? In these situations, use Composite Entities to tell GSA how to logically combine separate metadata.
Using Filters to Change Metadata
Sometimes metadata from other applications doesn’t flow into GSA the way you want it to. If you are using a connector, you can intercept the metadata fields and change them. GSA Connectors have a “Filter” interface to manipulate metadata. You can use existing filters from Google, or you can write Java code against the interface to perform any processing of your choice.
Integrating Third-Party Metadata tools
While GSA has a strong entity recognition capability, you may wish to use another off-the-shelf or custom application. For example, we have integrated with an application which uses semantic technologies to extract metadata. This was achieved using the Filter technology for connector-fed information, and a custom proxy server for GSA’s crawling of web sites. The filter interface allows for editing metadata, and the proxy server adds a header “X-GSA-External-Metadata header” which the GSA will recognize and convert directly to metadata. Together these techniques allows you to send your document information through any standard or custom process that will identify relevant metadata for your organization.
What are your organization’s goals or challenges for search metadata? What techniques do you use? Please share your comments below to continue the conversation.