...
During the startup of the search engine some basic configuration parameters have to be available. These basic settings are stored in the file properties.txt
. The general format in the properties.txt
file is [config parameter]=[config value]
. The names of the configuration parameters are case sensitive. Comments can be added by putting a #
in front of the line. Additional explanation is given in the *** comments of the properties.txt
file.
...
Task Configuration: crontab.txt
...
For more information and examples see Crontab. The fullindex
and index
commands have three arguments:
...
This configuration specifies that a local website has to be indexed with depth=1
at 5 past midnight every day. At 2 AM the index is checked for all URLs (*) and non existing URLs are removed.
...
Anchor | ||||
---|---|---|---|---|
|
...
The parser.txt
file is read every minute so it’s not required to restart the search service when the contents are edited. Every document is matched top-down and from left to right. The document will be sent to the parser of every line that matches. When no valid parser is found, the document will not be indexed. This is also counts for the special parser name ‘-‘, which also means the document type is will not be indexed.
...
Anchor | ||||
---|---|---|---|---|
|
Even though the search engine indexes the website through the frontend, there is a basic form of authentication required to retrieve the indexer page and the meta information of documents. The authentication for the search engine is configured in the file credentials.xml
. Besides basic authentication, credentials.xml
can also contain advanced authentication for secure websites and documents.
...
Code Block |
---|
<credentials> <credential pattern="http://www.gxsoftware.com/web/show/.*" type="postform" username="gxsearch" password="Search987"> <!-- indicate which input parameters in the login form correspond to the user and password --> <param name="userparam" value="f48305" /> <param name="passwordparam" value="f48306" /> <!-- the action url george needs to post the user/password to --> <param name="actionurl" value=" http://www.gx.nl/web/formhandler?source=form" /> <!-- include all input parameters in the form --> <formparam name="id" value="29347" /> <formparam name="pageid" value="47952" /> <formparam name="handle" value="form" /> <formparam name="ff" value="47954" /> <formparam name="form" value="48067" /> <formparam name="formelement" value="47954" /> <formparam name="originalurl" value=" http://www.gx.nl/web /show/id=40945/cfe=47954/ff=47954" /> <formparam name="errorurl" value=" http://www.gx.nl/web /show/id=40945/cfe=47954/ff=47954/formerror=47954" /> <formparam name="f48305" value="" /> <formparam name="formpartcode" value="f48305" /> <formparam name="f48306" value="" /> <formparam name="formpartcode" value="f48306" /> </credential> <!—- other credentials here --> </credentials> |
...
Additional Meta Data: meta.txt
...
In this example, documents with URLs that contain the string "javadoc" will get an additional field pagetype
with value "javadoc". The second line creates a field owner
with value "gx" for all documents from the website www.gxsoftware.com. The format for the meta.txt
file is <URL pattern><tab><index field><tab><index value>
. The URL pattern is a regular expression. The string separator has to be a tab and not several spaces. Some IDEs (such as Eclipse) can be configured to automatically convert tabs to spaces which can lead to unwanted behavior. The meta.txt
is read every minute so it’s not required to restart the search engine after the file has been changed. The reason for setting these properties is that they can be used for filtering the search results. For example, based on the above meta.txt
, it is very easy to filter out all the items that have “gx” as value for the property owner
.