...
Filling the index with documents is done in three steps: crawling (1)-(2), parsing (3) and indexing (4). Retrieving search results from the index is done from a client such as a page in XperienCentral with a search element (5), from the Setup Tool, or from a command line client such as the george-control
(Linux) or george-client.bat
(Windows).
...
Anchor | ||||
---|---|---|---|---|
|
Retrieving the documents starts with retrieving the URLs of all documents and pages. XperienCentral provides links to all items that should be indexed on one page: the indexer page. This page contains references to:
- Pages of all web initiatives
- Media Content Repository items created in the last 5 days
- Documents (uploaded to a Download element)
- Special content types
The URL of the indexer page is configured in the properties.txt
file (metaurl
parameter) and is the default start page in the Setup tool. Note that the setting in the properties.txt
file is leading.Tool. The search engine crawler retrieves all the URLs from the indexer page and creates requests for each of the documents. Requests are always sent to the front-end in order to benefit from existing caching and to take authorization and personalization into account. With a the default configuration, the crawler only indexes links from the indexer page and it but does not follow links in the retrieved documents. This means the default index depth is 1. An index depth of 1 ensures that the indexing process is efficient and that no outbound links (links outside the web initiative) are found and indexed.
A request for a document is not only a direct request for the document - additional meta information is requested as well. This extra meta information is provided by the indexer page by requesting the indexer page with an additional document=
parameter. For example, to index the homepage on local XperienCentral installation, the crawler requests the URL http://localhost:8080/web/webmanager?id=39016&document= http%3A%2F%2F127.0.0.1%3A8080%2Fweb%2Fshow%2Fid%3D26111
. The underlined value is the URL-encoded URL of the homepage. When this URL is requested a small XML result is returned which looks like thissimilar to the following:
Code Block | ||
---|---|---|
| ||
<document> <langid>42</langid> <contenttype>page</contenttype> <date>2007-12-24</date> <webid>26098</webid> <pagepath>p26111</pagepath> <pagepath_00_name>Home</pagepath_00_name> <pagepath_00_url>http://127.0.0.1:8080/web/Home.htm</pagepath_00_url> </document> |
...
All retrieved documents are parsed before they are stored in the index. All documents are converted to a plain text format. Office (Word, Excel) and PDF documents are converted with external programs. These programs are usually executable files and they are configured in <searchengine-directory>/conf/properties.txt
. The mapping between document type and converter is configured in <searchengine-directory>/conf/parser.txt
. The mapping can be based on both file extension and content-type, which is retrieved from the HTTP header. For more information about the parser.txt
file see *** chapter 4.3 Parser Configuration.
...
Indexing
The index is the storage of the indexed information. In fact it is a small database that is tuned for fast information retrieval. It is also similar to a database because it contains a set of documents (one indexed document is one document in the database) and every document has a set of fields. Notice the similarity between tables and table fields. Before documents are stored in the index they go through three processes. The first is the normalizing process. In this process characters are converted to normal and lowercase equivalents. For example "ç" is converted to "c", and "C" is converted to "c". By doing this a user will get search results when searching for "barcelona" instead of "Barçelona".
...