...
- Which fields are actually indexed and what information is stored in the fields? This is by far the most important question. One way to describe it is "garbage in = garbage out". In other words if the structure, fields and/or contents of the indexed documents is incorrect then the search index will never be of high quality. Use a good tool to analyze the index.
- What is the ratio between HTML documents, Word/PDF documents and other content? The reason to check this is that large amounts of Word documents can drastically lower the relevance of normal HTML pages.
- What is the average size and standard deviation in size of all these documents? Large documents could mean lower relevance.
Analyze the Search Index
One of the basic tasks of working with the search engine is knowing what is indexed and how content is indexed. Therefore it can be helpful to have full access to the search index. There are several ways to do this:
Local machines
On local machines or developer PCs a tool like Luke can be used to analyze the search index. You can download Luke at http://www.getopt.org/luke/. This is a Java program that can be started as a .jar file by opening it in a Java runtime environment or right-clicking on the .jar file and then choosing open with > Java™ platform SE binary. When it opens you can select the search index directory and browse through all the fields.
Servers
On remote servers there are two ways to analyze the index. The first is to (g)zip or tar the index directory, copy it to your computer and use the method described in the previous paragraph. The second is to use a client connection to the search engine. On Windows servers the batch file georgeclient.bat
can *** be used. On Unix servers a connection can be set up with the following command from the search engine directory:
Code Block | ||
---|---|---|
| ||
java -Djava.security.policy=conf\grantall_policy -jar ... lib/webmanager-searchengine-client-10.12.0.jar ... rmi://localhost:1099/indexer-wm |
The first marked string is the XperienCentral version. The second marked string is the RMI connection string which is configured in the properties.txt
file. Once a connection has been made to the RMI client, commands can be entered. To see a list of all commands enter ?<Enter>
. Two useful commands are:
- Search - enter a search string and see the results. More fields are displayed than with a regular query, but not all fields are displayed.
- Urls - enter a pattern such as *News* or * to see indexed URLs
Implement a Category Search
Large websites can contain 100.000s of pages and documents so when it comes to searching it can help a lot if visitors can search in parts of a website. There are several ways to split up a website in several parts, or ‘categories’ as the generic term is.
Method 1: Top Level Categories
With this method visitors get an extra dropdown lists next to the normal search field. This dropdown list contains the pages right below the homepage, or in other words the top level pages. If the site has the following structure:
- Home
- Tour
- Examples
- Developerweb
- About
Then users have the option to search in ‘Tour’, ‘Examples’, ‘Developerweb’ or ‘About. When the search engine is indexing the pages of a website it’s also indexing the location and the paths of every page. These page names and page paths are stored in separate fields in the index. For example a page located below the ‘Developerweb’ page contains the following fields:
Pagepath_00_name
: Home
Pagepath_00_url
: http://127.0.0.1:8080/web/Home.htm
Pagepath_01_name
: Developerweb
Pagepath_01_url
: http://127.0.0.1:8080/web/DeveloperWeb.htm
Pagepath_02_name
: Documentation
Pagepath_02_url
: http://127.0.0.1:8080/web/DeveloperWeb/Documentation.htm
The normal query is extended with an extra filter on the pathname_01_name field, for example: (keywords) AND pathname_01_name:Developerweb. The advancedSearch.jspf that is included in the default GX WebManager presentation (version 9.6 and newer) contains example code to implement this.
Method 2: URL Filters
The search engine contains a configuration file called meta.txt
. This file can contain extra parsers for URLs which result in extra fields and information in the index. For example: enter the following code in the meta.txt
file:
Code Block | ||
---|---|---|
| ||
.*/Examples/.* sitepart examples
.*/Forum/.* sitepart forum
|
When the crawler encounters a URL that contains "Examples" it adds a field sitepart
to the index with the value "examples". Same for ‘Forum’, this will result in a value "forum" in the sitepart
field. Using this sitepart
field is similar to the previous method: extend the query with an additional string containing the filter: (keywords) AND sitepart:examples.
When the crawler encounters a URL that contains Examples, it adds a sitepart
field to the index with the value "examples". The same thing happens with Forum - this will result in a value ‘forum’ in the sitepart field. Using this ‘sitepart’ field is similar to the previous method: extend the query with an additional string containing the filter: (keywords) AND sitepart:examples.
Index an External Website
Indexing an external website involves two steps:
...
Code Block | ||
---|---|---|
| ||
index http://localhost:8080/web/webmanager/id=39016 1 127.0.0.1,localhost [5 0 * * *]
index http://www.gxsoftware.com/ 2 www.gx.nl [0 2 * * *]
|
...
Code Block | ||
---|---|---|
| ||
http://www.gxsoftware.com/.* webid 26098
http://www.gxsoftware.com/.* langid 42
|
Documents indexed from www.gxsoftware.com will then have a valid webid
and langid
and will therefore be returned in the search results.
Implement a "Best Bets" Search
A "best bets" search is an addition to an existing search engine where results are returned on top of the normal search results. These results are handpicked by editors and are usually more relevant because they are handpicked.
Implementing a best bets algorithm in XperienCentral can be done by using the keywords fields. There is one important precondition: keywords must not be used for other things than the best bets, or otherwise the results cannot be predicted.
An example: assume that the top 10 queries overview shows that 10% of the visitors search for ‘download’ and the actual download page is at place 7 in the search results. This means the search term ‘download’ needs to be boosted to get a higher relevance. In order to do this the following steps have to be taken:
...
Code Block | ||
---|---|---|
| ||
<xsl:if test="contains($authorization, '1')">
<dt>
<xsl:variable name="bestBetsMinimumScore" select="150"/>
<xsl:variablename="nrOfBestbets" select="count(/root/system/searchresults/entry/score[text() > $bestBetsMinimumScore])"/>
<xsl:if test="(score > $bestBetsMinimumScore) and (position()=1)" >
Recommended:<br/>
</xsl:if>
<xsl:if test="(position() = $nrOfBestbets + 1) and ($nrOfBestbets > 0)">
<br/><br/>Other search results:<br/>
</xsl:if>
<!-- Show position -->
<xsl:if test="$showordernumbers" >
|
...
Excluding Pages from the Search Index
There are two ways to exclude pages or other content types from the index:
- Clear the "Include in search index" option property for pages which will result in an extra meta tag
<meta name="robots" content="noindex" />
in the HTML of the page. - Create a
robots.txt
file.
The search engine looks for a robots.txt
file before indexing a website. More information about robots.txt files can be found here: http://www.robotstxt.org/robotstxt.html The robots.txt
file must be stored in the root of the website in the statics
folder in order for it to be accessible at the URL. Make sure that the robots.txt
file is not blocking normal search engines like Google when a site goes live. Use the User-agent
string parameter to prevent this.
Some example of robots.txt
files:
Don’t allow any search engine to index the website:
User-agent: *
Disallow: /
Don’t allow the XperienCentral search engine to index the website:
User-agent: george
Disallow: /
Don’t allow the XperienCentral search engine to index the pages with URL */web/Examples/* or the login page:
...
Analysis
Once you have information about the queries that people use, it is time to sit down with some people who have in depth knowledge of the content on the website. It’s important to realize that analyzing search queries and search behavior is not a technical process, but that it’s all about linking website content to the website visitors. Therefore the best way to analyze the results of your research is to sit down with several content owners, editors, domain experts or any other role within the organization that can assist in this process.
One way to do this is to organize a session where someone presents the top search queries and summarizes the feedback from search engine users. In this session an attempt could be made to link the top queries to the most relevant pages. Make sure that everyone is aware that it’s not only about what visitors want to find, but also what the organization wants them to find! This is not only important for companies who want to sell products or services, but for any organization. It’s all about conversion, and conversion is not only about leading visitors to ordering the most expensive product, but also about answering questions for citizens (for governments), finding the self-service page etc. Write down the links between top queries and pages, plus other suggestions from the content experts.
In the same session try to get answers to fundamental questions such as:
- Do visitors always want the most recent documents first, or is the relevance more important than the document date? Is this for all documents, or only some documents?
- Do visitors expect a) answers, b) links or c) direct information? This is different from asking “do we want to provide our visitors with answers/links/direct information?”. Choosing a) for example can have a lot of implications for your information structure and search engine, but if that is what visitors expect than this should be the goal.
- Do we really need to include our 10.000 document management system in the search index? Does it bring anything extra, or does it just lower the relevance of our normal web pages?
...
Improve
After this session it’s time for some homework again, because then it’s important to find out why a certain top 10 query leads to page X and not to page Y, which is the best page according to the content experts. The Setup tool can be used to get more information about the relevance score for the queries.
There are two main reasons why a page is less relevant than another page:
- the number of words on a page combined with the index factor of the fields is lower
- the size of the page is larger, i.e. smaller pages tend to be more relevant than long documents
The first is by far the most important factor. Luckily there are several ways to get higher scores. First of all make sure the page is indexed properly. Make sure that all fields content, langid, keyword(s), summary, title and webid are filled correctly. Use a tool to inspect the index (see chapter 5.1)
The index factor settings can lead to wrong results. The index factors for fields are specified in the properties.txt file. When not specified the default settings are:
factor.title=10
factor.description=5
factor.keyword=10
factor.location=1
These factors tell the search engine how important a field is. By default the title is 10 times more important than the location. If the ‘keyword’ field is filled with non-relevant results, or not filled at all, then it might be smart to clean up all keyword settings or set the keyword factor to a lower value (5 or 1). Tip: Don’t use ‘default meta keywords settings’ in Config > Web initiative configuration > [General]. Leave it empty.
The same goes for the ‘title’ field. In most (default) presentations the title field is prefixed with the same value on every page, for example the company name. If it’s import that the company name leads to a specific page then this is almost impossible, unless you remove it from the title or lower the factor.title setting. For more information about using the properties.txt file see chapter 4.1.
These checks will help to improve the relevance for certain pages by analyzing what’s actually indexed and removing unnecessary information or changing the index factor. After changing index factors or changing the presentation it’s necessary to fully re-index the website again because documents are not indexed alone, but in relation to the other documents. To get accurate results this is necessary.