Choosing a Search Engine for the Enterprise
In July 2006, ei Magazine published Choosing a Search Engine for the Enterprise (vol 2, iss. 12). The text of this article has been reproduced below.
What are the factors to consider when choosing and configuring an enterprise search system?
Search technologies have matured in recent years to offer advanced features and robust scalability. Knowledge of these capabilities informs the creation of appropriate functional specifications for business-specific requirements, a crucial step in the process of deciding which system to adopt. This article therefore aims to explore the landscape of search technology, including a discussion of basic, advanced and XML-specific features. In addition, we focus on the important things to consider when choosing and implementing a search engine.
What Is Search?
Search engines provide the means for a specific document within a large document collection to be found by someone who does not know its title, identifier code or location. Instead, potentially relevant documents are found by comparing the content of the documents in the collection against terms entered into a search query.
The most basic feature that any search engine will provide is exact match searching. For example, a search for "cafe" will find all documents that contain at least one instance of that word. It is also typical for search engines to tolerate minor variants, such as "Cafe" (ignoring letter-case), "cafes" (allowing for plurals and other variants) and "café" (ignoring the presence or absence of character accents). In addition, exact phrases such as "French Café" may be found, as well as words in close proximity such as "French AND Café", which would find a document containing "The Café is French".
A search system actually performs a number of tasks. It retrieves and accepts new content, analyses and modifies content to detect its language and text encoding scheme (and perhaps to specify its relevancy ranking), indexes content to facilitate fast searching, accepts search queries (and maybe modifies them, such as to correct spelling errors), and performs fast searches using the index, and (of course) presents a result lists that shows the documents that most closely match the query. It may also rank or group search results according to pre-defined algorithms or business rules.
Search engines can be given content to index in several ways. Those that offer a file traversing feature may be able to detect new, updated and deleted files. Similarly, those that offer a web-crawling feature will index content on designated Web servers by following hypertext links between pages, periodically monitoring the sites to detect new, modified and deleted pages. Advanced search engines also offer modules for connecting to other content sources, including databases and content management systems. Finally, these systems also offer an API to allow third-party applications to push content into the system.
A search query may be analysed and modified before it is used to find documents. For example, if the query is "contetn", this might be corrected to "content" without the user being aware of it (or the prompt "did you mean 'content'" might be shown to the user instead). Automatic correction might also include de-phrasing, which involves removal of redundant phrases, such as "what is the ...". Other automatic corrections to the query might include additions to catch documents that include slight variants of the query terms, such as "boats" or "boating" instead of just "boat", as well as synonyms of the terms, such as "ship". Queries might also include an instruction to limit searching to specified fields of the content. For example, the title of an HTML document will be in a separate field, and documents that contain the search terms within their titles are obviously very likely to be relevant.
When documents are split into a number of fields, it should be possible to specify which of the fields are to be shown, and whether or not to include an automatically generated summary of the content of one or more of them. Dynamic summaries are representative fragments of the document that include the term being searched for (and are useful for helping users to decide if the term is being used in a relevant context to their needs).
A query may be too vague to isolate just a few documents. When hundreds or thousands of documents appear in the results list, the only hope is that the documents most likely to be relevant will be at the top of the list. A number of factors may combine to produce a ranking score for each matching document, and the list of documents can then be sorted in descending order of their scores. Ranking factors might include:
- the age of the document - assuming more recent documents are the most relevant
- its authority - unlike the internet search engines, this is generally not determined by how many other documents link to it, but by other factors
- its quality - perhaps determined from specific metadata values
- the proximity of query terms - if close together, they are more likely to have been used in the relevant context,
- the position of query terms - the earlier in a document, the more significant they are likely to be
- zone-specific matching - see the XML features later for more on this topic.
While most of the features described above will be familiar to almost any casual user of the Web, advanced search engines have other, less obvious features that can greatly improve the effectiveness of search technology within the enterprise.
Content Retrieval Features
Database connectors might include the ability to use SQL queries to allow for flexible definitions of what a document within the search system will actually contain. For example, there could be separate documents for each table in the database, or tables could be merged to create larger documents that are easy to find and read (at the cost of some data duplication).
Information that is currently unsearchable because it is locked within bespoke applications may be search-enabled through the use of available APIs. This allows an organisation to find all of the information relating to a particular customer or issue. This is particularly useful when dealing with compliance issues.
Improving Query Effectiveness
Advanced search technologies include sophisticated techniques for improving the effectiveness of queries, including:
- synonym and variant spelling detection
- proper name and phrase detection
- wildcard searches
- duplicate document removal
If queries are expected to find documents that use synonyms or variant spellings of a word, then dictionaries containing the synonyms and the variant spellings may be created for this purpose.
A list of proper names and phrases is useful for avoiding inadvertent matches. For example, it might not be desirable that a search for "Guardian" (the UK newspaper) would find a document containing "guard" or "guardians". Similarly, standard phrases such as "management consultant", when entered as a query, should not find documents that contain variant words such as "manage" or "consulting". In addition, identified names and phrases should be easily detected when used in a query, and simple spelling errors corrected automatically.
The ability to use "*" to represent zero or more characters, and "?" to represent one unknown character, can be useful in special circumstances. While they might conceivably be used to avoid the need to key complete words, such as "int*ion" to find "internationalisation", they might also make unintended matches like "interruption". This feature is really intended for special circumstances, such as to find all product codes that begin or end with a particular sequence of characters.
Duplication removal is a useful feature, especially for web site content, where the same page content may appear on multiple HTML pages, just with different menu items selected, or to detect when the same page is accessed by different redirected URLs.
Result List Navigation
The most obvious way to navigate the results list is simply to scroll- and page-down the sorted or ranked list, starting with the first and hopefully most relevant documents. However, advanced systems will offer alternative strategies, including:
- taxonomy navigation
- navigator navigation
- cluster navigation
- result promotion
A taxonomy is a classification scheme that groups documents into categories. It may be a simple list, or it may have a hierarchical structure that bundles sub-categories under higher level concepts. Taxonomies allow users to quickly select a sub-set of the results list that falls within a category of interest. Some systems may also include an automated, on-the-fly taxonomy generation feature that utilises analysis of significant words in the documents to determine the categories (an approach that is less precise than fixed taxonomies, but involves far less setup effort).
A navigator is a list of possible values, or ranges of values, that helps users to build relevantly targeted secondary queries. This feature is of most relevance to highly structured data, such as content derived from database records, and provides users with a technique that complements the free-text searching approach central to search engines. For example, if one field in the documents holds the name of a currency, then a navigator could be built and shown to the user that lists all of the currencies. Numeric values are usefully grouped, such as "less than 5 / 5 to 30 / more than 30", with the value boundaries set manually or by automated analysis of the actual distribution of values within the documents.
A cluster is a group of documents that are deemed by the search engine to be similar to each other. Result sets can be sorted into such groups, with automatically assigned group names. The significant aspect of this feature is that the clusters are determined automatically by the search engine. The documents in a cluster do not necessarily contain a similar set of terms, so a query may return just one document from a cluster, but a user interested in that document could then benefit from a "find similar documents" feature to retrieve the related documents.
For certain queries, some of the relevant documents may be considered sufficiently important that they should always be promoted higher in the results list than they would naturally appear. It might be possible to specify that the document must always appear first, or always within the top ten.
Search engines that recognise the XML data format should be able to "hide" the XML tags, so that a search for "square" does not find documents that contain an element tag called "<square>" or an attribute of the same name. Another question to ask of such products is whether they can (or must) include attribute values in the index, which can be either useful or distracting, depending on the requirements.
The content of specified XML elements might be flagged as sort fields. The alphabetic or numeric content of these elements is then used to decide results list ordering.
Some search engines allow searches to be confined to specified portions, or "zones", of a document. This is used to filter-out documents that contain the word or phrase of interest but in the wrong context. A search for "Wellington" or "Sandwich" might be intended to find documents that refer to these historical characters, but a zone search within an XML element called <PersonName> would prevent the search from also finding documents that mention particular kinds of footwear or snacks. If hierarchical zoning is available, some further subtleties need to be explored. For example, it might be necessary to include all levels of the hierarchy, or it may be possible to use wildcards at some levels.
Things To Think About
Having looked at some of the features of advanced search engines, it is time to think about the things that need to be considered when choosing a system, including:
- data format support
- language support
- database content integration
- Web content integration
The binary data formats used by the majority of word-processors, spreadsheets, and other software applications that support potentially searchable text, are not easy to interrogate in order to extract the text to be indexed. Candidate search engines must support all of the software applications, maybe including older versions, that hold the content of interest.
Similarly, does the search engine support the languages you use, especially if they include such languages as Arabic, Japanese, Chinese or Russian? Such languages require special handling, so support for them cannot just be assumed.
While databases include their own query tools, there is undoubted benefit to providing a single results list, within a single user interface, when searching both documents and database records. If you have databases with relevant content, does the candidate system include connectors to extract data from them? Is it more suitable to use XML as an intermediate format? Or is it more appropriate to develop tools to update changes to the database records instantly using the API? There are pros and cons to each approach.
Similarly, use of a single system and query to find both Web and local content is of great benefit to users, though Web content must be restricted to relevant web sites (indexing the entire Web should be left to Google et al).
Performance issues need careful consideration if systems are not to fail and users are not to be frustrated. Significant issues include:
- catering for search term variants
- catering for the wildcard feature
For large-scale installations, with millions of documents or a very high query frequency, it is often necessary to distribute several components of the search engine, and several data stores, across a number of computers. Load balancing may be necessary both for indexing and for querying of the index.
Finding documents that contain variants of the search terms can be done either by adding the variants to the index, or by expanding each query. There is, however, a conflict between index size, query speed and relevancy determination. If variants of each word are generated for the index, this improves query performance at the expense of index size. A trade-off between the monetary cost of data storage and the frustration cost of waiting for results may need to be made.
With the wildcard feature (the "*" and "?" stand-in symbols) there are performance issues related to the amount of additional work required at query-time, requiring configuration tasks to overcome. As such, it is generally only used for small, selected fields of a document.
Implementing Search Technology
Choosing and installing a search system is not the final step. There remain some considerable implementation issues, including:
- access control
- system configuration
- testing and refining
- post-implementation review
When implementing a search system, a decision needs to made about the documents to index, and which of the indexed documents to display when specific users carry out a query. While it may be tempting to index all of the documents available, some of these may contain sensitive information (such as salary information or strategy discussions). One approach is to declare "off-limits" files or URLs from certain locations. In addition, or alternatively, results may be filtered to exclude those documents that the identified user does not have the rights to access.
Customising is a major implementation task. Most search engine systems will allow some or all of the following customisations to be made in order to tailor the system to particular requirements. These take time and careful thought to perform correctly:
- specification of the arrangement of search engine modules and resources, possibly across multiple servers
- identification of the content that is going to be indexed
- definition of multiple field definitions for different content types, and a superset definition to allow for combined result lists
- configuration of database extractors to build normalised or non-normalised documents from multiple tables, and map data types to the nearest equivalent search repository data types
- modification of the spell-checking dictionaries (and the override list of words that are not to be corrected)
modification of the proper names and phrases list
- modification of the synonym and spelling variation dictionary
- creation of an anti-phrase dictionary
- creation of a taxonomy structure
- enabling or disabling removal of accents from accented characters
- identifying fields for the navigator feature
- specifying result entry promotion rules
In addition, any out-of-the-box GUI may be deemed inappropriate for any of a variety of reasons, such as requiring local site branding or support for specialised workflow requirements, and a new front-end may be required that interacts with the search engine through the API. Additionally, it may be desirable to integrate the search system with applications already in use.
Large scale testing of the system after content loading and system customisation often reveals unexpected problems that can be fixed by further fine-tuning of some of the configuration files mentioned above. Sourcing representative content and queries for such testing is vital for the success of testing.
The single biggest error that most organisations make when implementing a search engine is failure to carry out regular post-implementation reviews. The content landscape within most organisations constantly changes. Without a regular review of the content being loaded:
- - new sources will not be indexed
- duplicate content will begin to appear (through documents being stored in different locations or web-pages containing the same content in different frames)
- sources of content may no longer be relevant
These factors lead to a reduction of the value of the results being returned, which can then lead to users no longer trusting (or even using) the system.
Choosing and implementing an enterprise search engine is not a trivial exercise. You need to audit your content sources, establish your functional requirements, interrogate search engine suppliers or system integrators to determine whether their offerings are fit for purpose, manage the installation, configure the system, test and refine the configuration, and finally train users in how to use it effectively. But the rewards can be enormous. Search technologies are now mature and sophisticated - when approached and treated seriously, they can be a major asset to any large business.