Search Engine Configuration
The XperienCentral search engine uses several configuration files which are stored in the <searchengine-directory>/conf
directory. The most important file is the properties.txt
file which contains the initial settings for the search engine. The other configuration files contain settings for meta information, credentials and parser mappings.
In This Topic
General Configuration: properties.txt
During the startup of the search engine some basic configuration parameters have to be available. These basic settings are stored in the file properties.txt
. The general format in the properties.txt
file is [config parameter]=[config value]
. The names of the configuration parameters are case sensitive. Comments can be added by putting a #
in front of the line. Additional explanation is given in the *** comments of the properties.txt
file.
Task Configuration: crontab.txt
On production environments, indexing the website is a recurring task that is usually executed every 24 hours. The indexing schedule is configured in the crontab.txt
file. This file is read frequently so it’s not required to restart the search service when changes have been made. This crontab file uses the default Unix format for Cron jobs. The XperienCentral configuration on Windows servers and desktops doesn’t use crontabs but instead uses the default Windows Services. For more explanation see the Windows documentation
The crontab.txt
file contains one or more lines. Every line corresponds with one task and consists of three parts:
- Command - Example:
fullindex
,index
,check
.fullindex
will first erase the index and then index the website.index
will index the website.check
will check all the URLs on the website and remove pages that don’t exist anymore.
- Arguments - Example:
http://edit.mywebsite.com/web/webmanager/id=39016
- Time interval, Format:
[minutes hours day_of-month month weekday]
Example 1: [0 0 * * *]
(Run this task at 0:00am = midnight)
Example 2: [30 2 * * *]
(Run this task at 2:30am)
Example 3: [55 * * * *]
(Run this task every hour in the 55th minute)
For more information and examples see Crontab. The fullindex
and index
commands have three arguments:
- The URL of the index page.
- The depth of the indexing process.
- The listing of the host names that can be indexed. The host names in the listing are comma-separated and the listing should minimally contain all the host names used externally, supplemented with the host name of the index page URL (see the first argument).
An example of a crontab.txt
file (two lines, one starts with fullindex
, the second with check
):
index http://localhost:8080/web/webmanager/id=39016 1 127.0.0.1,localhost [5 0 * * *] check * [0 2 * * *]
This configuration specifies that a local website has to be indexed with depth=1
at 5 past midnight every day. At 2 AM the index is checked for all URLs (*) and non existing URLs are removed.
Parser Configuration: parser.txt
The relevant properties of documents are retrieved by using parsers. The mapping between document type and parser type can be configured in the file parser.txt
. This file contains various lines. Each line contains three parts:
- URL Regular expression matched against the document URL
Example: .*\.pdf
- Contenttype regular expression matched against the document content type retrieved from the HTTP header.
Example:application/pdf
- Parser full classname
Example:nl.gx.webmanager.searchengine.parser.XmlParser
Example of parser.txt
:
.* .* nl.gx.webmanager.searchengine.parser.CentralContentParser .*\.htm .* nl.gx.webmanager.searchengine.parser.HtmlParser .*\.html .* nl.gx.webmanager.searchengine.parser.HtmlParser .* text/html.* nl.gx.webmanager.searchengine.parser.HtmlParser .*\.txt .* nl.gx.webmanager.searchengine.parser.TextParser .* text/plain.* nl.gx.webmanager.searchengine.parser.TextParser .*\.xml .* nl.gx.webmanager.searchengine.parser.XmlParser .* text/xml.* nl.gx.webmanager.searchengine.parser.XmlParser .*\.pdf .* nl.gx.webmanager.searchengine.parser.PdfParser .* application/pdf.* nl.gx.webmanager.searchengine.parser.PdfParser .*\.doc .* nl.gx.webmanager.searchengine.parser.AntiwordParser
The parser.txt
file is read every minute so it’s not required to restart the search service when the contents are edited. Every document is matched top-down and from left to right. The document will be sent to the parser of every line that matches. When no valid parser is found, the document will not be indexed. This is also counts for the special parser name ‘-‘, which also means the document type is will not be indexed.
Credentials Configuration: credentials.xml
Even though the search engine indexes the website through the frontend, there is a basic form of authentication required to retrieve the indexer page and the meta information of documents. The authentication for the search engine is configured in the file credentials.xml
. Besides basic authentication, credentials.xml
can also contain advanced authentication for secure websites and documents.
Basic Authentication
In a default installation, the configuration is limited to creating a special search engine user and password, for example "gxsearch" with password "Search987" and entering this information in the credentials.xml
. For example:
<credentials> <credential pattern=".*localhost.*" type="postform" username="gxsearch" password="Search987" /> ... </credentials>
Advanced Authentication
XperienCentral supports three types of secure indexing: NTLM, basic authentication or postform authentication. This can be set up by creating a credential pattern for the website (or part of it) and mapping this credential to the required login attributes of the authentication. All authentication types require at least a username and password and for NTLM authentication a host
and domain
attribute have to be specified.
NTLM Example
<credentials> <credential pattern="http://www.gx.nl/docs/.*" type="ntlm" username="Administrator" password="secretpa$$word" host="wmhost" domain="GX" /> <!—- other credentials here --> </credentials>
Basic Authentication Example
<credentials> <credential pattern="http://localhost/secret.*pdf" type="basic" username="admin" password="secretpa$$word" /> <!—- other credentials here --> </credentials>
Postform Authentication Example
The postform authentication responds to the cookie that is returned after the form submit. This contains the required session ID to index the protected URLs.
<credentials> <credential pattern="http://www.gxsoftware.com/web/show/.*" type="postform" username="gxsearch" password="Search987"> <!-- indicate which input parameters in the login form correspond to the user and password --> <param name="userparam" value="f48305" /> <param name="passwordparam" value="f48306" /> <!-- the action url george needs to post the user/password to --> <param name="actionurl" value=" http://www.gx.nl/web/formhandler?source=form" /> <!-- include all input parameters in the form --> <formparam name="id" value="29347" /> <formparam name="pageid" value="47952" /> <formparam name="handle" value="form" /> <formparam name="ff" value="47954" /> <formparam name="form" value="48067" /> <formparam name="formelement" value="47954" /> <formparam name="originalurl" value=" http://www.gx.nl/web /show/id=40945/cfe=47954/ff=47954" /> <formparam name="errorurl" value=" http://www.gx.nl/web /show/id=40945/cfe=47954/ff=47954/formerror=47954" /> <formparam name="f48305" value="" /> <formparam name="formpartcode" value="f48305" /> <formparam name="f48306" value="" /> <formparam name="formpartcode" value="f48306" /> </credential> <!—- other credentials here --> </credentials>
Additional Meta Data: meta.txt
Additional metadata can be provided during the indexing process by using the configuration file meta.txt
. This file can be used to fill metadata fields in the index with values based on specific URLs. An example of a meta.txt
file is:
.*/javadoc/.* pagetype javadoc http://www.gxsoftware.com/.* owner gx
In this example, documents with URLs that contain the string "javadoc" will get an additional field pagetype
with value "javadoc". The second line creates a field owner
with value "gx" for all documents from the website www.gxsoftware.com. The format for the meta.txt
file is <URL pattern><tab><index field><tab><index value>
. The URL pattern is a regular expression. The string separator has to be a tab and not several spaces. Some IDEs (such as Eclipse) can be configured to automatically convert tabs to spaces which can lead to unwanted behavior. The meta.txt
is read every minute so it’s not required to restart the search engine after the file has been changed. The reason for setting these properties is that they can be used for filtering the search results. For example, based on the above meta.txt
, it is very easy to filter out all the items that have “gx” as value for the property owner
.