Skip to main content

Cloud

FAST Search Server 2010 for SharePoint: Creating a Custom Document Processing Step

FAST Search Server 2010 for SharePoint includes the capability to extend the document processing pipeline by adding additional steps. These steps are implemented as executable assemblies that accept, at minimum, two command line arguments: the path to an input file and the path to an output file. The input file is created by the document processing pipeline prior to calling the custom executable. It contains the crawled properties of the document currently being processed but is limited to those properties that are specified in the Input section of the Pipeline Extensibility configuration file. Using the input properties, the custom step executes whatever logic is necessary to generate the desired output and then writes the output to a file. The output file must contain the crawled properties that are specified in the Output section of the configuration file.
The executable runs in a sandboxed execution environment and, therefore, is limited in what resources it can access. For example, the executable cannot access any file system locations other than those passed in as the input and output paths. In my example, I will demonstrate that it is possible to connect to a SQL Server database using SQL Server authentication. However, I was not able to access the database using Windows authentication. In addition to accessing a database, the documentation from Microsoft indicates that it is possible to make web service calls from a custom step. I have not attempted to confirm this.
One challenge you will come across when trying to connect to a database or web service is how to pass in the connection information. One less than ideal approach would be to hard code this information into your assembly, but this is problematic for obvious reasons. A better approach is to use additional command line parameters and then specify the connection information in the configuration file. I will show this in my example.
Another challenge is knowing the identity under which the code will execute. I was able to confirm that the executable runs under the identity of the FAST service account. However, as I stated above, this knowledge did not do me much good since I was not able to use this identity to connect to a database or write to a file. Perhaps this identity will be more helpful in connecting to a web service.
The biggest challenge I encountered was debugging. My first thought here was to try to use the Visual Studio debugger. However, I was not able to find a way to attached the debugger to the executing process. My next thought was to write logging information to a file, but this does not appear to be possible either because of the limitations of the sandbox. I ultimately settled on writing logging information to a database table.
The only guidance I could find from Microsoft related to debugging mentioned two options. First, debug the executable outside of the document processing pipeline by creating an input file that matches the format specification. This is a great suggestion for ensuring that your logic does what you think it is supposed to do; however, it doesn’t help with observing what happens inside your code when it is running in the pipeline. The second suggestion is to set an exit code other than zero and write to the standard error stream. The message you write will be visible in the SharePoint Crawl Log.
In the example below I will walk through the steps to create, configure, and deploy the custom processing step as well as to utilize the managed property populated by the custom step as a search result refiner. My example showcases another feature of FAST for SharePoint called entity extraction. By default, the FAST document processing pipeline will attempt to extract entities that represent company names and place names. I am going to extend this functionality by mapping the extracted company names to the location of the company’s world headquarters. The company location will be stored in a new crawled property called CompanyLocation which is mapped to a new managed property also called CompanyLocation. I will use this managed property as a custom refiner on my search page.
Here are the steps to make this happen.
1) Create an executable that takes two arguments: an input file path and an output file path
I divided my code into three classes. The main Program class contains most of the logic; the DBLogger class helps with writing logging info to the database; and the CompanyLookup class gets the location value from the database. The code of all three classes is fairly straight forward so I will not explain it in detail. One thing I do want to point out, however, is the use of the u2029 character to delimit multi-valued crawled properties. Since the companies property can have multiple companies extracted to it, I need to split on this character when reading the companies and then delimit on this character when writing to the new managed property.
using System.Linq;
using System.Text;
using System.Xml.Linq;
namespace FASTCompanyLookupStep
{

class Program
{

static void Main(string[] args)
{

string input = args[0];
string output = args[1];
string connection = args[2];

DBLogger logger = new DBLogger(connection);
logger.Write(“This is a test to prove that I can write to a database table.”);
XDocument inDoc = XDocument.Load(input);
string[] companies = inDoc.Descendants(“CrawledProperty”).First(e => e.Attribute(“propertyName”).Value == “companies”).Value.Split(‘u2029’);
XDocument outDoc = new XDocument();
XElement docElement = new XElement(“Document”);
outDoc.Add(docElement);
XElement cpElement = new XElement(“CrawledProperty”);
docElement.Add(cpElement);
cpElement.Add(new XAttribute(“propertySet”, “1104E7BE-46CA-4b31-975F-4F37FB8303BE”));
cpElement.Add(new XAttribute(“varType”, “31”));
cpElement.Add(new XAttribute(“propertyName”, “CompanyLocation”));
StringBuilder sb = new StringBuilder();
CompanyLookup lookup = new CompanyLookup(connection);
foreach (string company in companies.Select(c => c.Trim().TrimEnd(‘u2029’)))
{
if (sb.Length > 0)
{
sb.Append(‘u2029’);
}
sb.Append(lookup.GetLocation(company));
}
cpElement.Value = sb.ToString();
outDoc.Save(output);

}

}

}
using System.Data.SqlClient;
namespace FASTCompanyLookupStep
{

public class CompanyLookup
{

private string ConnectionString { get; set; }

public CompanyLookup(string connectionString)
{

ConnectionString = connectionString;

}

public string GetLocation(string companyName)
{

using (SqlConnection conn = new SqlConnection(ConnectionString))
{

conn.Open();
SqlCommand cmd = new SqlCommand(string.Format(“SELECT Location FROM CompanyLocation WHERE CompanyName ='{0}'”, companyName));
cmd.Connection = conn;
return cmd.ExecuteScalar() as string;

}

}

}

}
using System;
using System.Data.SqlClient;
namespace FASTCompanyLookupStep
{

public class DBLogger
{

private string ConnectionString { get; set; }

public DBLogger(string connectionString)
{

ConnectionString = connectionString;

}

public void Write(string message)
{

using (SqlConnection conn = new SqlConnection(ConnectionString))
{

conn.Open();
SqlCommand cmd = new SqlCommand(
string.Format(“INSERT INTO Logging (ProcessName, LoggedTime, [Message], UserIdentity) VALUES (‘{0}’, ‘{1}’, ‘{2}’, ‘{3}’)”,
System.Diagnostics.Process.GetCurrentProcess().ProcessName, DateTime.Now, message, Environment.UserName));
cmd.Connection = conn;
cmd.ExecuteNonQuery();

}

}

}

}
After compiling the assembly, you deploy to the C:FASTSearchbin folder on all indexing servers.
The executable will work with an input file in the format below. The file will contain the crawled properties that are specified in the configuration file. If a crawled property contains multiple values, the values will be separated by the u2029 character. After performing the processing logic of the step, the executable will generate an output file in the same format as the input file which includes the crawled properties specified in the Output section of the configuration file.

<?xml version="1.0" encoding="utf-8"?> 
<Document> 
<CrawledProperty propertySet="48385c54-cdfc-4e84-8117-c95b3cf8911c" varType="31" propertyName="companies">test1test2test3</CrawledProperty> 
</Document> 

The executable will execute before crawled properties are mapped to managed properties.
2) Update C:FASTSearchetcpipelineextensibility.xml to add your custom step to the pipeline
The next step is to update the pipeline extensibility configuration file. This needs to be done on all indexing servers. The configuration file contains a <Run> element for all custom steps that are added to the pipeline. This element includes the command line to be used to call the executable as well as the input and output crawled properties. In our example, the output crawled property will be created in the next step.
The first two command line arguments are placeholders for the input and output file paths that will be provided by the pipeline. The third argument is the database connection string that will be used for logging and lookup.

<Run command="fastcompanylookupstep.exe %(input)s %(output)s Server=(local);Database=FASTCompanyLocation;uid=Test;pwd=Test;"> 
<Input> 
<CrawledProperty propertySet="48385c54-cdfc-4e84-8117-c95b3cf8911c" varType="31" propertyName="companies"/> 
</Input> 
<Output> 
<CrawledProperty propertySet="1104e7be-46ca-4b31-975f-4f37fb8303be" varType="31" propertyName="CompanyLocation"/> 
</Output> 
</Run> 

After updating the file, you will have to run

psctrl reset

from the command line to reset the document processors. Otherwise, the configuration changes will not take effect.
3) Run a PowerShell script from the FAST admin prompt to add new crawled properties and managed properties to the FAST configuration
The next step is to create the crawled and managed properties that will store the company locations for the document. The PowerShell script below will take care of this for you.

$guid = "1104E7BE-46CA-4b31-975F-4F37FB8303BE" 
$cat = New-FASTSearchMetadataCategory -name "Custom Pipeline Properties" -propset $guid 
$cp = New-FASTSearchMetadataCrawledProperty -name CompanyLocation -varianttype 31 -propset $guid 
$mp = New-FASTSearchMetadataManagedProperty -name CompanyLocation -type 1 
$mp.RefinementEnabled = $true 
$mp.MergeCrawledProperties = $true 
$mp.Update() 
New-FASTSearchMetadataCrawledPropertyMapping -CrawledProperty $cp -ManagedProperty $mp 

4) Create a new FAST Search Center site
Next, create a FAST Search Center site in a site collection that belongs to a web application that has been configured to use FAST as its search provider. Creating the search center will give you the ability to edit the web part properties on the search results page.
5) Update the Refinement Panel web part on the search results page to include your new managed property
The last step is to update the Refinement Panel web part to include your new managed property. Adding in the <Category> element shown below will include the CompanyLoocation property as a refiner. To do this, you need to update the Filter Category Definition property under Refinement. Be sure also to uncheck the Use Default Configuration property otherwise your changes will not take effect.

<FilterCategories> 
<Category Title="Company Location" Description="The location of the company" Type="Microsoft.Office.Server.Search.WebControls.ManagedPropertyFilterGenerator" MetadataThreshold="1" NumberOfFiltersToDisplay="4" MaxNumberOfFilters="20" ShowMoreLink="True" MappedProperty="companylocation" MoreLinkText="show more" LessLinkText="show fewer" ShowCounts="Count" />
...
</FilterCategories> 

To finish up my example, I created three Word documents in the Share Documents library to represent case studies about three different companies (Ford, Microsoft, and Starbucks). These companies also happen to be the same ones whose headquarters locations I have mapped in my database table. After running a crawl. I can execute a search on the phrase “Case Study” which is found in the title of the all three documents and my search results will include the Company Location refiner populated with the three headquarters locations.

 

One last thing to note. Your new managed property and refiner are available both via the search UI and the search web service. If you execute the following search against the web service using the QueryEx method, you will see the company location information come back as a refiner.

<QueryPacket> 
<Query> 
<Context> 
<QueryText>Case Study</QueryText> 
</Context> 
<IncludeRefinementResults> 
<Refiners> 
<Refiner>companylocation</Refiner> 
</Refiners> 
</IncludeRefinementResults> 
</Query> 
</QueryPacket> 
The results XML for the refiner will look something like this: 

Tags

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Jeff Monnette

More from this Author

Follow Us