/
Search Engine How Tos

Search Engine How Tos


This topic describes several search engine improvements and changes that occur frequently. Most of these improvements must be performed by a developer and/or a system administrator.

In This Topic



Analyze the Search Index

One of the basic tasks of working with the search engine is knowing what is indexed and how content is indexed. Therefore it can be helpful to have full access to the search index. There are several ways to do this:

Local machines

On local machines or developer PCs a tool like Luke can be used to analyze the search index. You can download Luke at http://www.getopt.org/luke/. This is a Java program that can be started as a .jar file by opening it in a Java runtime environment or by right-clicking on the .jar file and then choosing the open option from the Context menu. When it opens you can select the search index directory and browse through all the fields.


Servers

On remote servers there are two ways to analyze the index. The first is to (g)zip or tar the index directory, copy it to your computer and use the method described in the previous paragraph. The second is to use a client connection to the search engine. On Windows servers a batch file can be used. On Unix servers a connection can be set up with the following command from the search engine directory:


java -Djava.security.policy=conf\grantall_policy -jar ... lib/webmanager-searchengine-client-10.12.0.jar ... rmi://localhost:1099/indexer-wm


The first marked string is the XperienCentral version. The second marked string is the RMI connection string which is configured in the properties.txt file. Once a connection has been made to the RMI client, commands can be entered.  To see a list of all commands enter ?<Enter>. Two useful commands are:

  • Search - enter a search string and see the results. More fields are displayed than with a regular query, but not all fields are displayed.
  • Urls - enter a pattern such as *News* or * to see indexed URLs


Back to top



Implement a Category Search

Large websites can contain 100.000s of pages and documents so when it comes to searching it can help a lot if visitors can search in parts of a website. There are several ways to split up a website in several parts, or ‘categories’ as the generic term is.

Method 1: Top Level Categories

With this method visitors get an extra drop-down list next to the normal search field. This drop-down list contains the pages right below the homepage or in other words, the top level pages. If the site has the following structure:

  • Home
    • Tour
    • Examples
    • Contact
    • About

Then users have the option to search in "Tour", "Examples", "Contact" or "About". When the search engine is indexing the pages of a website it’s also indexing the location and the paths of every page. These page names and page paths are stored in separate fields in the index. For example a page located below the "Contact" page contains the following fields:

Pagepath_00_name: Home
Pagepath_00_urlhttp://127.0.0.1:8080/web/Home.htm
Pagepath_01_name: Developerweb
Pagepath_01_urlhttp://127.0.0.1:8080/web/Contact.htm
Pagepath_02_name: Documentation
Pagepath_02_url: http://127.0.0.1:8080/web/Contact/Documentation.htm


The normal query is extended with an extra filter on the pathname_01_name field, for example: (keywords) AND pathname_01_name:Contact.

Method 2: URL Filters

The search engine contains a configuration file called meta.txt. This file can contain extra parsers for URLs which result in extra fields and information in the index. For example: enter the following code in the meta.txt file:


.*/Examples/.*   sitepart   examples
.*/Forum/.*      sitepart   forum


When the crawler encounters a URL that contains "Examples" it adds a field sitepart to the index with the value "examples". Same for "Forum" - this will result in a value "forum" in the sitepart field. Using this sitepart field is similar to the previous method: extend the query with an additional string containing the filter: (keywords) AND sitepart:examples.

When the crawler encounters a URL that contains "Examples", it adds a sitepart field to the index with the value "examples". The same thing happens with Forum -  this will result in a value "forum" in the sitepart field. Using this sitepart field is similar to the previous method: extend the query with an additional string containing the filter: (keywords) AND sitepart:examples.


Back to top



Index an External Website

Indexing an external website involves two steps:

  1. Create a new entry in the search engine cronjob. The crontab.txt can be extended to index one or more external websites. The task can contain a different time schedule and the depth and valid hosts can be specified. This is a cronjob that indexes a local website at 5 past midnight, but also the www.gxsoftware.com website at 2AM. The homepage will be indexed plus all linked pages with a maximum depth of 2.


    index http://localhost:8080/web/webmanager/id=39016 1   127.0.0.1,localhost [5 0 * * *]
    index http://www.gxsoftware.com/ 2 www.gx.nl [0 2 * * *]
    


    As an alternative to a cronjob, the external website can be indexed manually on the Search Tools tab in the Setup Tool.

  2. Change the meta.txt file to map the external website to the right search index. The queries that are executed from a normal search element will filter on webID and langid. Therefore to include the search results on a certain website, the website’s webid and langid have to be included during indexing. This can be done by extending the meta.txt file:


    http://www.gxsoftware.com/.* webid   26098
    http://www.gxsoftware.com/.* langid  42
    


    Documents indexed from www.gxsoftware.com will then have a valid webid and langid and will therefore be returned in the search results.



Back to top



Implement a "Best Bets" Search

A "best bets" search is an addition to an existing search engine where results are returned on top of the normal search results. These results are handpicked by editors and are usually more relevant because they are handpicked.

Implementing a best bets algorithm in XperienCentral can be done by using the keywords fields. There is one important precondition: keywords must not be used for other things than the best bets, or otherwise the results cannot be predicted.

An example: assume that the top 10 queries overview shows that 10% of the visitors search for "download" and the actual download page is at place 7 in the search results. This means the search term "download" needs to be boosted to get a higher relevance. In order to do this the following steps have to be performed:

  1. Find the current score for the query "download" by entering the query "download" in the Search Tools tab in the Setup Tool. The score is between brackets () between the position and the date, for example "(30)".
  2. Navigate to Configuration > Channel Configuration > [General] and make sure the field "Default meta keywords" is empty.
  3. Navigate to the "Download" page in the Workspace.
  4. Click [Edit] in the Properties widget and select the SEO tab.
  5. Enter the keywords "Download" and "Downloads" in the keywords field and click [Apply]
  6. Change the search engine configuration file properties.txt in the /conf directory and add a new property: factor.keyword=500, or if this parameter already exists change the current value to 500.
  7. Restart the search engine
  8. For best results, re-index the entire website, or if the website is really large, reindex the page by entering the URL in the Setup Tool.
  9. Navigate to the Setup Tool, go to the [Search Tools] tab and search for "download" again. The score should now be considerably higher.
  10. Depending on your wishes you can change the search results presentation to reflect the score. A simple script that divides the "best bet" search result(s) from the normal search result is:


    <xsl:template match="//wm-searchresults-show">
       <xsl:variable name="normal" select="@normal" />
       <xsl:variable name="header" select="@header" />
       <xsl:variable name="showordernumbers" select="@showordernumbers = 'true'" />
       <xsl:variable name="showpath" select="@showpath = 'true'" />
       <xsl:variable name="showlead" select="@showlead = 'true'" />
       <xsl:variable name="showquery" select="@showquery" />
       <xsl:variable name="showtype" select="@showtype" />
       <xsl:variable name="searchid" select="@searchid" />
       <xsl:variable name="baseUrl" select="@baseUrl" />
    <xsl:variable name="uppercase" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'" />
       <xsl:variable name="lowercase" select="'abcdefghijklmnopqrstuvwxyz'" />
    <xsl:variable name="orgkeyword" select="translate(/root/system/requestparameters/parameter[name='orgkeyword']/value,$uppercase, $lowercase)" />
       <xsl:if test="count(/root/system/searchresults) > 0">
          <xsl:choose>
             <xsl:when test="/root/system/searchresults/totalcount = 0">${helpText}</xsl:when>
             <xsl:otherwise>
                <div class="searchresults">
                   <p>
                      <xsl:if test="$header != ''">
                         <xsl:attribute name="class"><xsl:value-of select="$header" /></xsl:attribute>
                      </xsl:if>
                      <xsl:text disable-output-escaping="yes">${wmfn:escapeToHTML(showText)}&nbsp;</xsl:text>
                      <xsl:value-of select="(/root/system/searchresults/from + 1)" />
                      <xsl:text>-</xsl:text>
                      <xsl:choose>
                         <xsl:when test="(/root/system/searchresults/totalcount) < (/root/system/searchresults/to)">
                            <xsl:value-of select="/root/system/searchresults/totalcount" />
                         </xsl:when>
                         <xsl:otherwise>
                            <xsl:value-of select="/root/system/searchresults/to" />
                         </xsl:otherwise>
                      </xsl:choose>
                      <xsl:text disable-output-escaping="yes"> (${wmfn:escapeToHTML(foundText)}&nbsp;</xsl:text>
                      <xsl:value-of select="/root/system/searchresults/totalcount" />
                      <xsl:text> ${wmfn:escapeToHTML(entriesText)})</xsl:text>
                   </p>
                   <p>
                      <xsl:text>${wmfn:escapeToHTML(searchOnText)} "</xsl:text>
                      <xsl:choose>
                         <xsl:when test="$showquery != ''">
                            <xsl:value-of select="$showquery" />
                         </xsl:when>
                         <xsl:otherwise>
                            <xsl:value-of select="/root/system/searchresults/query" />
                         </xsl:otherwise>
                      </xsl:choose>
                      <xsl:text>"</xsl:text>
                   </p>
                   <!-- Show navigation -->
                   <xsl:call-template name="shownav">
                      <xsl:with-param name="index">0</xsl:with-param>
                      <xsl:with-param name="max">100</xsl:with-param>
                      <xsl:with-param name="totalcount"><xsl:value-of select="/root/system/searchresults/totalcount" /></xsl:with-param>
                      <xsl:with-param name="currentfrom"><xsl:value-of select="/root/system/searchresults/from" /></xsl:with-param>
                      <xsl:with-param name="class"><xsl:value-of select="$normal" /></xsl:with-param>
                      <xsl:with-param name="searchid"><xsl:value-of select="$searchid" /></xsl:with-param>
                      <xsl:with-param name="baseUrl"><xsl:value-of select="$baseUrl" /></xsl:with-param>
                   </xsl:call-template>
                   <dl>
                      <xsl:for-each select="/root/system/searchresults/entry">
                         <xsl:variable name="authorization">
                            <xsl:call-template name="check_searchresults_readaccess">
                               <xsl:with-param name="authorizedgroups">
                                  <xsl:for-each select="meta">
                                     <xsl:if test="name = 'webusergroups'"><xsl:value-of select="value" /></xsl:if>
                                  </xsl:for-each>
                               </xsl:with-param>
                               <xsl:with-param name="loginrequired">
                                  <xsl:value-of select="meta[name = 'loginrequired']/value" />
                               </xsl:with-param>
                            </xsl:call-template>
                         </xsl:variable>
                         <xsl:if test="contains($authorization, '1')">
    <xsl:if test="count(meta[name='keyword' and translate(value,$uppercase, $lowercase)=$orgkeyword])" >
       Recommended:<br/>
       </xsl:if> 
    



Back to top



Excluding Pages from the Search Index

There are two ways to exclude pages or other content types from the index:

  1. Clear the "Include in search index" option property for pages which will result in an extra meta tag <meta name="robots" content="noindex" /> in the HTML of the page.
  2. Create a robots.txt file.

The search engine looks for a robots.txt file before indexing a website. More information about robots.txt files can be found here: http://www.robotstxt.org/robotstxt.html The robots.txt file must be stored in the root of the website in the statics folder in order for it to be accessible at the URL. Make sure that the robots.txt file is not blocking normal search engines like Google when a site goes live. Use the User-agent string parameter to prevent this.

Some example of robots.txt files:


Don’t allow any search engine to index the website:

User-agent: *
Disallow: /


Don’t allow the XperienCentral search engine to index the website:

User-agent: george
Disallow: /


Don’t allow the XperienCentral search engine to index the pages with URL */web/Examples/* or the login page:

User-agent: george
Disallow: /web/Examples/
Disallow: /web/Login.html


Back to top