In my last post, I discussed several ways that metadata can be used in a Google Search Appliance solution. But where does metadata come from?
Let’s hold on to that question for just a second, and start with an even simpler one.
What is metadata?
Every record in a Google Search Appliance (such a document, a web page) can have an arbitrary set of name / value pairs called metadata. The name and value can have pretty much any value, and you can include as many different metadata fields as you want. You do not have to define any schema ahead of time. Metadata values are interpreted as strings, but the GSA is also smart enough to recognize dates and numbers automatically. Metadata fields can include multiple values by either repeating the same field several times, or by using a multi-value separator, like a comma or pipe.
So now back to the question of where metadata comes from. The following list represents all the different ways metadata can be loaded into records in the Google Search Appliance:
Where does metadata come from?
- Meta tags: If you are crawling a website, <meta> tags in the HTML pages are loaded as metadata in the record.
- Http Headers: The X-GSA-External-Metadata HTTP response header can be used to load metadata during a web crawl, but without physically modifying the HTML source code. This response header can also be supplied when serving non-HTML files that don’t support <meta> tags. The syntax can be found here: http://www.google.com/support/enterprise/static/gsa/docs/admin/72/gsa_doc_set/metadata/metadata.html#1075058
- Document Properties: For non-HTML files, like PDF or Office documents, the GSA extracts metadata from the file’s native properties, like author, company, or creator. The exact properties vary from format to format. Google does not publish the complete list of document properties that are extracted as metadata, but you can index a file experimentally and see the results. The list of supported file formats can be found here: http://www.google.com/support/enterprise/static/gsa/docs/admin/72/gsa_doc_set/file_formats/file_formats.html#1073282
- XML Feeds: You can add metadata to new or existing records in the index by uploading an XML Feed to the GSA. There are two types of XML Feeds, Web/Metadata-and-URL Feeds and Content Feeds. Metadata can be specified in either type of feed. See the documentation for details: http://www.google.com/support/enterprise/static/gsa/docs/admin/72/gsa_doc_set/feedsguide/feedsguide.html
- Entity Recognition: The GSA can be configured to look for words, phrases, or regular expression patterns in the content of every document or web page as it is being indexed. If a match occurs, either the matching text, or a specified label, will be added to the record as metadata. Entity Recognition can be used to detect and add metadata such as locations, ID numbers, products or names.
Extra Credit
In general, you can think of each record in the GSA as having a single, indivisible “set” of metadata (name / value pairs). But actually there are several different buckets of metadata on each record. First, there is a set of metadata that is parsed from the HTML meta tags or from the document properties. Second, there is metadata calculated by Entity Recognition. And third, there is metadata loaded through an XML feed.
Why is this important? The implication of these multiple buckets is that you can use Entity Recognition or XML Feed to supplement a record with additional metadata without overwriting the metadata from the first bucket. For example, submitting an XML Feed after a web page has been crawled and indexed can add more metadata to the record without destroying or overwriting any metadata that was parsed from the meta tags or document properties. This is a common technique to supplement a record with additional metadata.
However, within a certain bucket, all the metadata can only be overwritten as a whole – not modified or merged. For example, when the GSA recrawls a web page, all metadata from the meta tags will be replaced with the data from the new meta tags. The same is true for Entity Recognition and XML Feed metadata. A new XML Feed can replace a record’s metadata from a previous XML Feed, but it completely overwrites it, as opposed to merging or updating. This means that an XML Feed must resend all metadata fields for a record even if it just need to add or update a single value.