Search Engine How Tos
This topic describes several search engine improvements and changes that occur frequently. Most of these improvements must be performed by a developer and/or a system administrator.
In This Topic
Analyze the Search Index
One of the basic tasks of working with the search engine is knowing what is indexed and how content is indexed. Therefore it can be helpful to have full access to the search index. There are several ways to do this:
Local machines
On local machines or developer PCs a tool like Luke can be used to analyze the search index. You can download Luke at http://www.getopt.org/luke/. This is a Java program that can be started as a .jar file by opening it in a Java runtime environment or by right-clicking on the .jar file and then choosing the open option from the Context menu. When it opens you can select the search index directory and browse through all the fields.
Servers
On remote servers there are two ways to analyze the index. The first is to (g)zip or tar the index directory, copy it to your computer and use the method described in the previous paragraph. The second is to use a client connection to the search engine. On Windows servers a batch file can be used. On Unix servers a connection can be set up with the following command from the search engine directory:
java -Djava.security.policy=conf\grantall_policy -jar ... lib/webmanager-searchengine-client-10.12.0.jar ... rmi://localhost:1099/indexer-wm
The first marked string is the XperienCentral version. The second marked string is the RMI connection string which is configured in the properties.txt
file. Once a connection has been made to the RMI client, commands can be entered. To see a list of all commands enter ?<Enter>
. Two useful commands are:
- Search - enter a search string and see the results. More fields are displayed than with a regular query, but not all fields are displayed.
- Urls - enter a pattern such as *News* or * to see indexed URLs
Implement a Category Search
Large websites can contain 100.000s of pages and documents so when it comes to searching it can help a lot if visitors can search in parts of a website. There are several ways to split up a website in several parts, or ‘categories’ as the generic term is.
Method 1: Top Level Categories
With this method visitors get an extra drop-down list next to the normal search field. This drop-down list contains the pages right below the homepage or in other words, the top level pages. If the site has the following structure:
- Home
- Tour
- Examples
- Contact
- About
Then users have the option to search in "Tour", "Examples", "Contact" or "About". When the search engine is indexing the pages of a website it’s also indexing the location and the paths of every page. These page names and page paths are stored in separate fields in the index. For example a page located below the "Contact" page contains the following fields:
Pagepath_00_name
: HomePagepath_00_url
: http://127.0.0.1:8080/web/Home.htmPagepath_01_name
: DeveloperwebPagepath_01_url
: http://127.0.0.1:8080/web/Contact.htmPagepath_02_name
: DocumentationPagepath_02_url
: http://127.0.0.1:8080/web/Contact/Documentation.htm
The normal query is extended with an extra filter on the pathname_01_name
field, for example: (keywords) AND pathname_01_name:Contact
.
Method 2: URL Filters
The search engine contains a configuration file called meta.txt
. This file can contain extra parsers for URLs which result in extra fields and information in the index. For example: enter the following code in the meta.txt
file:
.*/Examples/.* sitepart examples .*/Forum/.* sitepart forum
When the crawler encounters a URL that contains "Examples" it adds a field sitepart
to the index with the value "examples". Same for "Forum" - this will result in a value "forum" in the sitepart
field. Using this sitepart
field is similar to the previous method: extend the query with an additional string containing the filter: (keywords) AND sitepart
:examples.
When the crawler encounters a URL that contains "Examples", it adds a sitepart
field to the index with the value "examples". The same thing happens with Forum - this will result in a value "forum" in the sitepart field. Using this sitepart
field is similar to the previous method: extend the query with an additional string containing the filter: (keywords) AND sitepart:examples.
Index an External Website
Indexing an external website involves two steps:
Create a new entry in the search engine cronjob. The
crontab.txt
can be extended to index one or more external websites. The task can contain a different time schedule and the depth and valid hosts can be specified. This is a cronjob that indexes a local website at 5 past midnight, but also the www.gxsoftware.com website at 2AM. The homepage will be indexed plus all linked pages with a maximum depth of 2.index http://localhost:8080/web/webmanager/id=39016 1 127.0.0.1,localhost [5 0 * * *] index http://www.gxsoftware.com/ 2 www.gx.nl [0 2 * * *]
As an alternative to a cronjob, the external website can be indexed manually on the Search Tools tab in the Setup Tool.
Change the
meta.txt
file to map the external website to the right search index. The queries that are executed from a normal search element will filter onwebID
andlangid
. Therefore to include the search results on a certain website, the website’swebid
andlangid
have to be included during indexing. This can be done by extending themeta.txt
file:http://www.gxsoftware.com/.* webid 26098 http://www.gxsoftware.com/.* langid 42
Documents indexed from www.gxsoftware.com will then have a valid
webid
andlangid
and will therefore be returned in the search results.
Implement a "Best Bets" Search
A "best bets" search is an addition to an existing search engine where results are returned on top of the normal search results. These results are handpicked by editors and are usually more relevant because they are handpicked.
Implementing a best bets algorithm in XperienCentral can be done by using the keywords fields. There is one important precondition: keywords must not be used for other things than the best bets, or otherwise the results cannot be predicted.
An example: assume that the top 10 queries overview shows that 10% of the visitors search for "download" and the actual download page is at place 7 in the search results. This means the search term "download" needs to be boosted to get a higher relevance. In order to do this the following steps have to be performed:
- Find the current score for the query "download" by entering the query "download" in the Search Tools tab in the Setup Tool. The score is between brackets () between the position and the date, for example "(30)".
- Navigate to Configuration > Channel Configuration > [General] and make sure the field "Default meta keywords" is empty.
- Navigate to the "Download" page in the Workspace.
- Click [Edit] in the Properties widget and select the SEO tab.
- Enter the keywords "Download" and "Downloads" in the keywords field and click [Apply]
- Change the search engine configuration file
properties.txt
in the/conf
directory and add a new property:factor.keyword=500
, or if this parameter already exists change the current value to 500. - Restart the search engine
- For best results, re-index the entire website, or if the website is really large, reindex the page by entering the URL in the Setup Tool.
- Navigate to the Setup Tool, go to the [Search Tools] tab and search for "download" again. The score should now be considerably higher.
Depending on your wishes you can change the search results presentation to reflect the score. A simple script that divides the "best bet" search result(s) from the normal search result is:
<xsl:template match="//wm-searchresults-show"> <xsl:variable name="normal" select="@normal" /> <xsl:variable name="header" select="@header" /> <xsl:variable name="showordernumbers" select="@showordernumbers = 'true'" /> <xsl:variable name="showpath" select="@showpath = 'true'" /> <xsl:variable name="showlead" select="@showlead = 'true'" /> <xsl:variable name="showquery" select="@showquery" /> <xsl:variable name="showtype" select="@showtype" /> <xsl:variable name="searchid" select="@searchid" /> <xsl:variable name="baseUrl" select="@baseUrl" /> <xsl:variable name="uppercase" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'" /> <xsl:variable name="lowercase" select="'abcdefghijklmnopqrstuvwxyz'" /> <xsl:variable name="orgkeyword" select="translate(/root/system/requestparameters/parameter[name='orgkeyword']/value,$uppercase, $lowercase)" /> <xsl:if test="count(/root/system/searchresults) > 0"> <xsl:choose> <xsl:when test="/root/system/searchresults/totalcount = 0">${helpText}</xsl:when> <xsl:otherwise> <div class="searchresults"> <p> <xsl:if test="$header != ''"> <xsl:attribute name="class"><xsl:value-of select="$header" /></xsl:attribute> </xsl:if> <xsl:text disable-output-escaping="yes">${wmfn:escapeToHTML(showText)} </xsl:text> <xsl:value-of select="(/root/system/searchresults/from + 1)" /> <xsl:text>-</xsl:text> <xsl:choose> <xsl:when test="(/root/system/searchresults/totalcount) < (/root/system/searchresults/to)"> <xsl:value-of select="/root/system/searchresults/totalcount" /> </xsl:when> <xsl:otherwise> <xsl:value-of select="/root/system/searchresults/to" /> </xsl:otherwise> </xsl:choose> <xsl:text disable-output-escaping="yes"> (${wmfn:escapeToHTML(foundText)} </xsl:text> <xsl:value-of select="/root/system/searchresults/totalcount" /> <xsl:text> ${wmfn:escapeToHTML(entriesText)})</xsl:text> </p> <p> <xsl:text>${wmfn:escapeToHTML(searchOnText)} "</xsl:text> <xsl:choose> <xsl:when test="$showquery != ''"> <xsl:value-of select="$showquery" /> </xsl:when> <xsl:otherwise> <xsl:value-of select="/root/system/searchresults/query" /> </xsl:otherwise> </xsl:choose> <xsl:text>"</xsl:text> </p> <!-- Show navigation --> <xsl:call-template name="shownav"> <xsl:with-param name="index">0</xsl:with-param> <xsl:with-param name="max">100</xsl:with-param> <xsl:with-param name="totalcount"><xsl:value-of select="/root/system/searchresults/totalcount" /></xsl:with-param> <xsl:with-param name="currentfrom"><xsl:value-of select="/root/system/searchresults/from" /></xsl:with-param> <xsl:with-param name="class"><xsl:value-of select="$normal" /></xsl:with-param> <xsl:with-param name="searchid"><xsl:value-of select="$searchid" /></xsl:with-param> <xsl:with-param name="baseUrl"><xsl:value-of select="$baseUrl" /></xsl:with-param> </xsl:call-template> <dl> <xsl:for-each select="/root/system/searchresults/entry"> <xsl:variable name="authorization"> <xsl:call-template name="check_searchresults_readaccess"> <xsl:with-param name="authorizedgroups"> <xsl:for-each select="meta"> <xsl:if test="name = 'webusergroups'"><xsl:value-of select="value" /></xsl:if> </xsl:for-each> </xsl:with-param> <xsl:with-param name="loginrequired"> <xsl:value-of select="meta[name = 'loginrequired']/value" /> </xsl:with-param> </xsl:call-template> </xsl:variable> <xsl:if test="contains($authorization, '1')"> <xsl:if test="count(meta[name='keyword' and translate(value,$uppercase, $lowercase)=$orgkeyword])" > Recommended:<br/> </xsl:if>
Excluding Pages from the Search Index
There are two ways to exclude pages or other content types from the index:
- Clear the "Include in search index" option property for pages which will result in an extra meta tag
<meta name="robots" content="noindex" />
in the HTML of the page. - Create a
robots.txt
file.
The search engine looks for a robots.txt
file before indexing a website. More information about robots.txt
files can be found here: http://www.robotstxt.org/robotstxt.html The robots.txt
file must be stored in the root of the website in the statics
folder in order for it to be accessible at the URL. Make sure that the robots.txt
file is not blocking normal search engines like Google when a site goes live. Use the User-agent
string parameter to prevent this.
Some example of robots.txt
files:
Don’t allow any search engine to index the website:
User-agent: *
Disallow: /
Don’t allow the XperienCentral search engine to index the website:
User-agent: george
Disallow: /
Don’t allow the XperienCentral search engine to index the pages with URL */web/Examples/* or the login page:
User-agent: george
Disallow: /web/Examples/
Disallow: /web/Login.html