ESP comes with standard synonym dictionaries for several languages which can be used to provide synonym support during search. However, these dictionaries, while being pretty extensive are fairly generic and only contain the terms that are common for a specific language. Often, there is a need to support synonyms that are specific to a subject domain like medicine, real estate or construction, you will need to handle custom synonyms during search. ESP handles synonymy through something called synonym expansion and there are two ways to achieve this in ESP – document (index) side synonym expansion and query side synonym expansion.
Document Side Synonym Expansion
Document side synonym expansion adds synonyms to the document as its being document processed (fed) so that the indexed document will have the original terms and all its synonyms. The synonym expansion is enabled as a custom stage in the document processing pipeline. The custom stage will point to the document side synonym dictionary that is to be used for synonym expansion.
For example, with document side synonym expansion enabled, when someone searches for “heart attack”, they will also get hits for “myocardial infraction” and vice versa. This happens because during document processing, ESP added “myocardial infraction” to the document that contained “heart attack” and added “heart attack” to the document that contained “myocardial infraction”. This is called a two-way expansion meaning that a term will get expanded to include its synonyms and a synonym will get expanded to include its term.
Document side synonym expansion slows the feeding and indexing rate and also creates a larger index so it requires more disk space on the search servers. Also, if the document side synonym dictionary is modified, all existing documents will have to be re-fed and indexed so that those documents have the latest version of the synonyms.
Query Side Synonym Expansion
Query side synonym expansion is a lot more flexible because the synonym expansion happens to the search terms at query time and not involve injecting the synonyms into the document at document processing time – think of it as runtime synonym expansion. When the query server gets a request with query side synonym expansion enabled, it uses the associated synonym dictionary to expand the search terms that the user entered.
For example, if “myocardial infarction” is a synonym for “heart attack” and the user entered “heart attack” as the search terms, the query server will transform the query to “heart attack myocardial infarction”. This way, to the search server, it appears that the user typed in both set of terms and the search is performed for any document that has “heart attack” or “myocardial infarction”.
Once the custom query side synonym dictionary has been compiled, it has to “registered” in the qrserver’s config file (qtf_config.xml). You an add the dictionary to the default synonym query pipeline stage named synonym or you can create your own custom synonym query pipeline stage. Applying the changes to the qtf_config.xml file will require a short (less than minute) qrserver downtime while the config file is deployed because the qrserver has to be restarted. Once the qrserver config is active, the qtf_synonym:querysynonyms=true query parameter must supplied with the query to turn on query side synonym expansion.
Below is an example of how to add your custom query side synonym dictionary to the default synonym query pipeline stage.
<instance name=”synonym” type=”external” resource=”qt_synonym”>
<parameter name=”enable” value=”1″/>
<parameter name=”synonymdict1″ value=”resources/dictionaries/synonyms/qt/short_spellvars.aut”/>
<parameter name=”synonymdict2″ value=”resources/dictionaries/synonyms/qt/short_wordnet.aut”/>
<parameter name=”synonymdict3″ value=”resources/dictionaries/synonyms/qt/my_custom_syn.aut“/>
Below is an example of how to add your custom query side synonym dictionary to a custom query synonym stage. In the example below, two different dictionaries are being employed by this stage and therefore synonyms from both dictionaries will be used during query time synonym expansion.
<instance name=”customsynonym” type=”external” resource=”qt_synonym”>
<parameter name=”enable” value=”1″/>
<parameter name=”synonymdict1″ value=”resources/dictionaries/synonyms/qt/my_custom_syn_1.aut“/>
<parameter name=”synonymdict2″ value=”resources/dictionaries/synonyms/qt/my_custom_syn_2.aut”/>
If you’ve defined a custom query synonym stage in the qtf_config.xml file, then the query parameter to use it at query time would look like this: qtf_<YOURSTAGENAME>:querysynonyms=true. Using the above example, the query parameter would be If you’ve defined a custom query synonym stage in the qtf_config.xml file, then the query parameter would look like this: qtf_customsynonym:querysynonyms=true
The advantage to query side expansion is that you can specify which synonym dictionary is to be used at query time and also you can change your synonyms (via a new or updated dictionary) on the fly without having to re-index all the documents.
The downside to query side expansion is that it could have a very significant impact to QPS, the performance of the query server and the search server in terms of search latency. While working on a project for one of our clients, we initially started with query side synonym expansion. However, we quickly found out that our query performance was so bad that we couldn’t get more an 2 QPS out of the qrserver on the production server. We subsequently abandoned query side expansion and went with document side expansion. There were probably several factors to the horrible query performance with query side synonym expansion. Our query synonym dictionary was quite large – 370,000 terms, each with an average of 12 synonyms – which required the qrserver to spent a considerable amount of time transforming the query. We also had a very large and complicated XRANK query in play which added to the problem by making the search latency go higher. Lastly, I don’t think we had the right hardware infrastructure to handle the large synonym dictionary and the XRANK – we might have needed several qrservers and more search rows.
While query side synonym expansion provides you the flexibility of not “hard coding” the synonyms into your documents, you have to consider the performance implications. If you are talking about a very small dictionary (say 1000 terms, each with 6 synonyms), then you might be okay. The only way to know is to have benchmarks and performance test the query and search performance with query side synonym expansion enabled and disabled.
Custom synonym dictionaries have to be complied using the dictman or dictcompile command line tools or the Linguistics Studio tool. Instruction on how to use either of these tools are available in the ESP Advanced Linguistics Guide. If you are using the dictman or dictcompile command line tools, an import file containing the list of synonyms is required to compile the dictionaries. The Linguistics Studio tool is a Java application with an UI to create and maintain different types of dictionaries but I’ve found that using an import file and the dictman or dictcompile tools to be more flexible. Also, dictionaries are language specific so if you are dealing with multiple languages, you’ll need a dictionary for each language and therefore you have to have distinct import files for each language.
The query and document side dictionaries have different import file formats. The query side import format has a lot of options for things like whether the original term should be replaced with the synonyms (rewrite), whether the synonym expansion goes in both directions (symmetric) and the weight of the synonym. The query side format shown below is for no-rewrite, expansion in both directions and using the default weight of 100. If you are interested in all the options, refer to the ESP Advanced Linguistics Guide which list all the options and what they mean.
NOTE: The formats below are specific to import files used with the dictman or dictcompile command line tools. Linguistics Studio allows import files in CSV, Excel, and tab separated formats. It also allows you to import a dictman/dictcompile import file by choosing the Tabulator option in the Import Wizard. Please refer to the Linguistics Studio documentation regarding its import file formats.
Query Side Import Format
Import Syntax: term<TAB CHARACTER>[[<rewrite flag>,[synonym 1,<weight>,<symmetric flag>],[synonym 2,<weight>,<symmetric flag>],……[synonym N,<weight>,<symmetric flag>]], ]
Example: heart attack [[false,[myocardial infarction,100,true],[infarction of heart,100,true],[cardiac infarction,100,true]],]
Document Side Import Format
Import Syntax: term<TAB CHARACTER>synonym 1,synonym 2,……synonym N
Example: heart attack [myocardial infarction,infarction of heart,cardiac infarction]
Keep in mind that you will have to scrub the values for the original terms and its synonyms for the occurrence of commas and the open and close square brackets. Having a comma or either of the square brackets in one of the values is going to result in an error during the import process. Below are examples of invalid entries for the query and document side import files.
Invalid Query Side Import Item: heart [attack] [[false,[myocardial, infarction,100,true],[infarction [of] heart,100,true],[cardiac infarction,100,true]],]
Invalid Document Side Import Item: heart [attack] [myocardial, infarction,infarction [of] heart,cardiac infarction]
In a nutshell, its pretty straight forward to enable custom synonymy in FSIS/ESP. It involves building a dictionary and using either document side synonym expansion or query side synonym expansion. Both methods have advantages and drawbacks and therefore it depends on which method will work best in your environment. I would start with query side synonym expansion and if query and search performance becomes an issue, switch to document side synonym expansion.
If you have any questions regarding this blog post, please feel free to email me at email@example.com. I welcome feedback on this content and also greatly appreciate suggestions for grammatical and/or spelling errors.