Search Engine Troubleshooting

The default configuration of the XperienCentral search engine provides a basic search implementation for generic websites, however, every website is different. Different content, content types, structure, visitors, tasks and several other differences can totally change the perception of the search engine for the website visitor. This topic provides more information for content owners, webmasters and developers about how to measure, optimize and improve the quality of the search engine.

In This Topic

Measuring

Search Behavior

Before radically changing the search engine it is a must to known as much as possible about the visitors of your website and their search behavior. There are two main questions that have to be answered:

What are people searching for?
What do they expect?

Here are some steps to gain more insight in the search behavior of your visitors:

Use a web analytics tool to generate reports of search queries. XperienCentral usually generates search queries where the search terms (keywords) are included in the URL. Most web analytics tools provide a way to search for specific URLs. Search for URLs with the string ‘&keyword=’ and generate some reports: for the last month and for the last year.
Example: in Google Analytics this can be done by navigating to ‘Content > Top Content’ and then entering ‘&keyword=’ in the ‘Search for URLs containing’ box.
Listen to your visitors. If you have the chance, then talk to your website visitors. If they complain about the search engine then call them or email them and ask the two main questions: “What were you looking for?” and “What did you expect?”. Also ask which queries they used.

Index Quality and Index Characteristics

Keeping your search index up to date is really important. Not only indexing new documents as quickly as possible (at least within 24 hours), but also removing documents to avoid dead links. It is important to check periodically if the index is up to date, or maybe even fully re-index the site.

Knowing what is actually indexed is also important, especially for the analysis and approach for tuning the search engine. Relevant questions are:

Which fields are actually indexed and what information is stored in the fields? This is by far the most important question. One way to describe it is "garbage in = garbage out". In other words if the structure, fields and/or contents of the indexed documents is incorrect then the search index will never be of high quality. Use a good tool to analyze the index.
What is the ratio between HTML documents, Word/PDF documents and other content? The reason to check this is that large amounts of Word documents can drastically lower the relevance of normal HTML pages.
What is the average size and standard deviation in size of all these documents? Large documents could mean lower relevance.

Analyze the Search Index

One of the basic tasks of working with the search engine is knowing what is indexed and how content is indexed. Therefore it can be helpful to have full access to the search index. There are several ways to do this:

Local machines

On local machines or developer PCs a tool like Luke can be used to analyze the search index. You can download Luke at http://www.getopt.org/luke/. This is a Java program that can be started as a .jar file by opening it in a Java runtime environment or right-clicking on the .jar file and then choosing open with > Java™ platform SE binary. When it opens you can select the search index directory and browse through all the fields.

Servers

On remote servers there are two ways to analyze the index. The first is to (g)zip or tar the index directory, copy it to your computer and use the method described in the previous paragraph. The second is to use a client connection to the search engine. On Windows servers the batch file georgeclient.bat can *** be used. On Unix servers a connection can be set up with the following command from the search engine directory:

java -Djava.security.policy=conf\grantall_policy -jar ... lib/webmanager-searchengine-client-10.12.0.jar ... rmi://localhost:1099/indexer-wm

The first marked string is the XperienCentral version. The second marked string is the RMI connection string which is configured in the properties.txt file. Once a connection has been made to the RMI client, commands can be entered. To see a list of all commands enter ?<Enter>. Two useful commands are:

Search - enter a search string and see the results. More fields are displayed than with a regular query, but not all fields are displayed.
Urls - enter a pattern such as *News* or * to see indexed URLs

Back to top

Implement a Category Search

Large websites can contain 100.000s of pages and documents so when it comes to searching it can help a lot if visitors can search in parts of a website. There are several ways to split up a website in several parts, or ‘categories’ as the generic term is.

Method 1: Top Level Categories

With this method visitors get an extra dropdown lists next to the normal search field. This dropdown list contains the pages right below the homepage, or in other words the top level pages. If the site has the following structure:

Home
- Tour
- Examples
- Developerweb
- About

Then users have the option to search in ‘Tour’, ‘Examples’, ‘Developerweb’ or ‘About. When the search engine is indexing the pages of a website it’s also indexing the location and the paths of every page. These page names and page paths are stored in separate fields in the index. For example a page located below the ‘Developerweb’ page contains the following fields:

Pagepath_00_name: Home

Pagepath_00_url: http://127.0.0.1:8080/web/Home.htm

Pagepath_01_name: Developerweb

Pagepath_01_url: http://127.0.0.1:8080/web/DeveloperWeb.htm

Pagepath_02_name: Documentation

Pagepath_02_url: http://127.0.0.1:8080/web/DeveloperWeb/Documentation.htm

The normal query is extended with an extra filter on the pathname_01_name field, for example: (keywords) AND pathname_01_name:Developerweb. The advancedSearch.jspf that is included in the default GX WebManager presentation (version 9.6 and newer) contains example code to implement this.

Method 2: URL Filters

The search engine contains a configuration file called meta.txt. This file can contain extra parsers for URLs which result in extra fields and information in the index. For example: enter the following code in the meta.txt file:

.*/Examples/.*   sitepart   examples
.*/Forum/.*      sitepart   forum

When the crawler encounters a URL that contains "Examples" it adds a field sitepart to the index with the value "examples". Same for ‘Forum’, this will result in a value "forum" in the sitepart field. Using this sitepart field is similar to the previous method: extend the query with an additional string containing the filter: (keywords) AND sitepart:examples.

When the crawler encounters a URL that contains Examples, it adds a sitepart field to the index with the value "examples". The same thing happens with Forum - this will result in a value ‘forum’ in the sitepart field. Using this ‘sitepart’ field is similar to the previous method: extend the query with an additional string containing the filter: (keywords) AND sitepart:examples.

Back to top

Index an External Website

Indexing an external website involves two steps:

Create a new entry in the search engine cronjob. (For more information on cronjobs see *** chapter 4.2) The crontab.txt can be extended to index one or more external websites. The task can contain a different time schedule and the depth and valid hosts can be specified. This is a cronjob that indexes a local website at 5 past midnight, but also the www.gxsoftware.com website at 2AM. The homepage will be indexed plus all linked pages with a maximum depth of 2.
```
index http://localhost:8080/web/webmanager/id=39016 1   127.0.0.1,localhost [5 0 * * *]
index http://www.gxsoftware.com/ 2 www.gx.nl [0 2 * * *]
```
As an alternative to a cronjob, the external website can be indexed manually on the Search Tools tab in the Setup Tool.
Change the meta.txt file to map the external website to the right search index. The queries that are executed from a normal search element will filter on webID and langid. Therefore to include the search results on a certain website, the website’s webid and langid have to be included during indexing. This can be done by extending the meta.txt file:
```
http://www.gxsoftware.com/.* webid   26098
http://www.gxsoftware.com/.* langid  42
```
Documents indexed from www.gxsoftware.com will then have a valid webid and langid and will therefore be returned in the search results.

Back to top

Implement a "Best Bets" Search

A "best bets" search is an addition to an existing search engine where results are returned on top of the normal search results. These results are handpicked by editors and are usually more relevant because they are handpicked.

Implementing a best bets algorithm in XperienCentral can be done by using the keywords fields. There is one important precondition: keywords must not be used for other things than the best bets, or otherwise the results cannot be predicted.

An example: assume that the top 10 queries overview shows that 10% of the visitors search for ‘download’ and the actual download page is at place 7 in the search results. This means the search term ‘download’ needs to be boosted to get a higher relevance. In order to do this the following steps have to be taken:

Find the current score for the query ‘download’ by entering the query ‘download’ in the Search tools tab in the Setup tool. The score is between brackets () between the position and the date, for example ‘(30)’
Go to Configure > Web initiative configuration > [General] and make sure the field ‘Default meta keywords’ is empty.
Navigate to the ‘Download’ page in the edit environment
Choose File > Properties > Meta keywords
Enter the keywords ‘Download’ and ‘Downloads’ in the keywords field and click [Apply]
Change the search engine configuration file properties.txt in the /conf directory and add a new property: factor.keyword=500, or if this parameter already exists change the current value to 500.
Restart the search engine
For the best results: re-index the entire website, or if the website is really large re-index the page by entering the URL in the Setup tool.
Navigate to the Setup tool, go to the [Search tools] tab and search for ‘download’ again. The score should now be considerably higher.

Depending on your wishes you can change the search results presentation to reflect the score. A simple script that divides the ‘best bet’ search result(s) from the normal search result is:

<xsl:if test="contains($authorization, '1')">
<dt>
   <xsl:variable name="bestBetsMinimumScore" select="150"/>
   <xsl:variablename="nrOfBestbets" select="count(/root/system/searchresults/entry/score[text() &gt; $bestBetsMinimumScore])"/>  
   <xsl:if test="(score &gt; $bestBetsMinimumScore) and (position()=1)" >
      Recommended:<br/>
   </xsl:if>                                                                                   
   <xsl:if test="(position() = $nrOfBestbets + 1) and ($nrOfBestbets &gt; 0)">
      <br/><br/>Other search results:<br/>
   </xsl:if>
<!-- Show position -->
<xsl:if test="$showordernumbers" >

The marked code is existing code, starting from the <dt> tag in the file website/xslStyleSheetSearchResults.jspf.

Back to top

Excluding Pages from the Search Index

There are two ways to exclude pages or other content types from the index:

Clear the "Include in search index" option property for pages which will result in an extra meta tag <meta name="robots" content="noindex" /> in the HTML of the page.
Create a robots.txt file.

The search engine looks for a robots.txt file before indexing a website. More information about robots.txt files can be found here: http://www.robotstxt.org/robotstxt.html The robots.txt file must be stored in the root of the website in the statics folder in order for it to be accessible at the URL. Make sure that the robots.txt file is not blocking normal search engines like Google when a site goes live. Use the User-agent string parameter to prevent this.

Some example of robots.txt files:

Don’t allow any search engine to index the website:

User-agent: * Disallow: /

Don’t allow the XperienCentral search engine to index the website:

User-agent: george
Disallow: /

Don’t allow the XperienCentral search engine to index the pages with URL */web/Examples/* or the login page:

User-agent: george
Disallow: /web/Examples/
Disallow: /web/Login.html

Back to top