...
- Use a web analytics tool to generate reports of search queries. XperienCentral usually generates search queries where the search terms (keywords) are included in the URL. Most web analytics tools provide a way to search for specific URLs. Search for URLs with the string ‘
&keyword=
’ and generate some reports: for the last month and for the last year.
Example: in Google Analytics this can be done by navigating to ‘Content > Top Content’ and then entering ‘&keyword=’ in the ‘Search for URLs containing’ box. - Listen to your visitors. If you have the chance, then talk to your website visitors. If they complain about the search engine then call them or email them and ask the two main questions: “What were you looking for?” and “What did you expect?”. Also ask which queries they used.
...
- Which fields are actually indexed and what information is stored in the fields? This is by far the most important question. One way to describe it is "garbage in = garbage out". In other words if the structure, fields and/or contents of the indexed documents is incorrect then the search index will never be of high quality. Use a good tool to analyze the index.
- What is the ratio between HTML documents, Word/PDF documents and other content? The reason to check this is that large amounts of Word documents can drastically lower the relevance of normal HTML pages.
- What is the average size and standard deviation in size of all these documents? Large documents could mean lower relevance.
Analyze the Search Index
One of the basic tasks of working with the search engine is knowing what is indexed and how content is indexed. Therefore it can be helpful to have full access to the search index. There are several ways to do this:
Local machines
On local machines or developer PCs a tool like Luke can be used to analyze the search index. You can download Luke at http://www.getopt.org/luke/. This is a Java program that can be started as a .jar file by opening it in a Java runtime environment or right-clicking on the .jar file and then choosing open with > Java™ platform SE binary. When it opens you can select the search index directory and browse through all the fields.
Servers
On remote servers there are two ways to analyze the index. The first is to (g)zip or tar the index directory, copy it to your computer and use the method described in the previous paragraph. The second is to use a client connection to the search engine. On Windows servers the batch file georgeclient.bat
can *** be used. On Unix servers a connection can be set up with the following command from the search engine directory:
Code Block | ||
---|---|---|
| ||
java -Djava.security.policy=conf\grantall_policy -jar ... lib/webmanager-searchengine-client-10.12.0.jar ... rmi://localhost:1099/indexer-wm |
The first marked string is the XperienCentral version. The second marked string is the RMI connection string which is configured in the properties.txt
file. Once a connection has been made to the RMI client, commands can be entered. To see a list of all commands enter ?<Enter>
. Two useful commands are:
...
Analysis
Once you have information about the queries that people use, it is time to sit down with some people who have in depth knowledge of the content on the website. It’s important to realize that analyzing search queries and search behavior is not a technical process, but that it’s all about linking website content to the website visitors. Therefore the best way to analyze the results of your research is to sit down with several content owners, editors, domain experts or any other role within the organization that can assist in this process.
One way to do this is to organize a session where someone presents the top search queries and summarizes the feedback from search engine users. In this session an attempt could be made to link the top queries to the most relevant pages. Make sure that everyone is aware that it’s not only about what visitors want to find, but also what the organization wants them to find. This is not only important for companies who want to sell products or services, but for any organization. It’s all about conversion, and conversion is not only about leading visitors to ordering the most expensive product, but also about answering questions for citizens (for governments), finding the self-service page etc. Write down the links between top queries and pages, plus other suggestions from the content experts.
In the same session try to get answers to fundamental questions such as:
- Do visitors always want the most recent documents first, or is the relevance more important than the document date? Is this for all documents, or only some documents?
- Do visitors expect a) answers, b) links or c) direct information? This is different from asking “do we want to provide our visitors with answers/links/direct information?”. Choosing a) for example can have a lot of implications for your information structure and search engine, but if that is what visitors expect than this should be the goal.
- Do we really need to include our 10.000 document management system in the search index? Does it bring anything extra, or does it just lower the relevance of our normal web pages?
Implement a Category Search
Large websites can contain 100.000s of pages and documents so when it comes to searching it can help a lot if visitors can search in parts of a website. There are several ways to split up a website in several parts, or ‘categories’ as the generic term is.
Method 1: Top Level Categories
With this method visitors get an extra dropdown lists next to the normal search field. This dropdown list contains the pages right below the homepage, or in other words the top level pages. If the site has the following structure:
- Home
- Tour
- Examples
- Developerweb
- About
Then users have the option to search in ‘Tour’, ‘Examples’, ‘Developerweb’ or ‘About. When the search engine is indexing the pages of a website it’s also indexing the location and the paths of every page. These page names and page paths are stored in separate fields in the index. For example a page located below the ‘Developerweb’ page contains the following fields:
Pagepath_00_name
: Home
Pagepath_00_url
: http://127.0.0.1:8080/web/Home.htm
Pagepath_01_name
: Developerweb
Pagepath_01_url
: http://127.0.0.1:8080/web/DeveloperWeb.htm
Pagepath_02_name
: Documentation
Pagepath_02_url
: http://127.0.0.1:8080/web/DeveloperWeb/Documentation.htm
The normal query is extended with an extra filter on the pathname_01_name field, for example: (keywords) AND pathname_01_name:Developerweb. The advancedSearch.jspf that is included in the default GX WebManager presentation (version 9.6 and newer) contains example code to implement this.
Method 2: URL Filters
The search engine contains a configuration file called meta.txt
. This file can contain extra parsers for URLs which result in extra fields and information in the index. For example: enter the following code in the meta.txt
file:
Code Block | ||
---|---|---|
| ||
.*/Examples/.* sitepart examples
.*/Forum/.* sitepart forum
|
When the crawler encounters a URL that contains "Examples" it adds a field sitepart
to the index with the value "examples". Same for ‘Forum’, this will result in a value "forum" in the sitepart
field. Using this sitepart
field is similar to the previous method: extend the query with an additional string containing the filter: (keywords) AND sitepart:examples.
When the crawler encounters a URL that contains Examples, it adds a sitepart
field to the index with the value "examples". The same thing happens with Forum - this will result in a value ‘forum’ in the sitepart field. Using this ‘sitepart’ field is similar to the previous method: extend the query with an additional string containing the filter: (keywords) AND sitepart:examples.
Index an External Website
Indexing an external website involves two steps:
...
Code Block | ||
---|---|---|
| ||
index http://localhost:8080/web/webmanager/id=39016 1 127.0.0.1,localhost [5 0 * * *]
index http://www.gxsoftware.com/ 2 www.gx.nl [0 2 * * *]
|
...
Code Block | ||
---|---|---|
| ||
http://www.gxsoftware.com/.* webid 26098
http://www.gxsoftware.com/.* langid 42
|
Documents indexed from www.gxsoftware.com will then have a valid webid
and langid
and will therefore be returned in the search results.
Implement a "Best Bets" Search
A "best bets" search is an addition to an existing search engine where results are returned on top of the normal search results. These results are handpicked by editors and are usually more relevant because they are handpicked.
Implementing a best bets algorithm in XperienCentral can be done by using the keywords fields. There is one important precondition: keywords must not be used for other things than the best bets, or otherwise the results cannot be predicted.
An example: assume that the top 10 queries overview shows that 10% of the visitors search for ‘download’ and the actual download page is at place 7 in the search results. This means the search term ‘download’ needs to be boosted to get a higher relevance. In order to do this the following steps have to be taken:
...
Code Block | ||
---|---|---|
| ||
<xsl:if test="contains($authorization, '1')">
<dt>
<xsl:variable name="bestBetsMinimumScore" select="150"/>
<xsl:variablename="nrOfBestbets" select="count(/root/system/searchresults/entry/score[text() > $bestBetsMinimumScore])"/>
<xsl:if test="(score > $bestBetsMinimumScore) and (position()=1)" >
Recommended:<br/>
</xsl:if>
<xsl:if test="(position() = $nrOfBestbets + 1) and ($nrOfBestbets > 0)">
<br/><br/>Other search results:<br/>
</xsl:if>
<!-- Show position -->
<xsl:if test="$showordernumbers" >
|
...
Excluding Pages from the Search Index
There are two ways to exclude pages or other content types from the index:
- Clear the "Include in search index" option property for pages which will result in an extra meta tag
<meta name="robots" content="noindex" />
in the HTML of the page. - Create a
robots.txt
file.
The search engine looks for a robots.txt
file before indexing a website. More information about robots.txt files can be found here: http://www.robotstxt.org/robotstxt.html The robots.txt
file must be stored in the root of the website in the statics
folder in order for it to be accessible at the URL. Make sure that the robots.txt
file is not blocking normal search engines like Google when a site goes live. Use the User-agent
string parameter to prevent this.
Some example of robots.txt
files:
Don’t allow any search engine to index the website:
User-agent: *
Disallow: /
Don’t allow the XperienCentral search engine to index the website:
User-agent: george
Disallow: /
Don’t allow the XperienCentral search engine to index the pages with URL */web/Examples/* or the login page:
...
Improvements
After this session it’s time for some homework again, because then it’s important to find out why a certain top 10 query leads to page X and not to page Y, which is the best page according to the content experts. The Setup tool can be used to get more information about the relevance score for the queries.
There are two main reasons why a page is less relevant than another page:
- the number of words on a page combined with the index factor of the fields is lower
- the size of the page is larger, i.e. smaller pages tend to be more relevant than long documents
The first is by far the most important factor. Luckily there are several ways to get higher scores. First of all make sure the page is indexed properly. Make sure that all fields content, langid, keyword(s), summary, title and webid are filled correctly. Use a tool to inspect the index (see chapter 5.1)
The index factor settings can lead to wrong results. The index factors for fields are specified in the properties.txt file. When not specified the default settings are:
factor.title=10
factor.description=5
factor.keyword=10
factor.location=1
These factors tell the search engine how important a field is. By default the title is 10 times more important than the location. If the ‘keyword’ field is filled with non-relevant results, or not filled at all, then it might be smart to clean up all keyword settings or set the keyword factor to a lower value (5 or 1). Tip: Don’t use the default meta keywords settings. Leave it empty.
The same goes for the title
field. In most (default) presentations the title field is prefixed with the same value on every page, for example the company name. If it’s import that the company name leads to a specific page then this is almost impossible, unless you remove it from the title or lower the factor.title setting.
These checks will help to improve the relevance for certain pages by analyzing what’s actually indexed and removing unnecessary information or changing the index factor. After changing index factors or changing the presentation it’s necessary to fully re-index the website again because documents are not indexed alone, but in relation to the other documents. To get accurate results this is necessary.
Improvement Suggestions
Besides carefully analyzing what is indexed and tuning all the fields and indexing factors, there are many other improvements that can be implemented. The best approach is to try to improve the search results first by analyzing the fields as mentioned in the previous paragraph. But luckily there are several other improvements that are known to work well to improve the search experience. Here is a top 5:
- Divide your website in several parts, categories or other logical parts. For example if you have a forum or document database on your website then index these sources with additional metadata that can be used to create additional filters. This is explained in the ‘How-to’ chapter.
- Offer advanced search and filtering options, such as searching in site categories, document types, date ranges and sorting by date instead of relevance. Optionally create a simple search interface and an advanced search interface.
- Provide search tips to website visitors. Show example queries, preferably from the query top 10 of course. Explain how to use the advanced search options, if they are implemented.
- Remove pages or documents from the index. By removing documents the total amount of documents is lower and therefore the relevance of the remaining documents automatically increases. Usually there are groups or types of documents that may be nice to index, but are not relevant at all for the average user. This could also be a bit tricky, so be careful not to remove too many documents.
- Implement a "best bets" search.