Page Comparison

...

Indexing

The index is stores the storage of the indexed information. In fact, it is a small database that is tuned for fast information retrieval. It is also similar to a database because it contains a set of documents (one indexed document is one document in the database) and every document has a set of fields. Notice the similarity between tables and table fields. Before documents are stored in the index they go through three processes. The first is the normalizing process. In this process characters are converted to normal and lowercase equivalents. For example "ç" is converted to "c", and "C" is converted to "c". By doing this a user will get search results when searching for "barcelona" instead of "Barçelona".

The second process, tokenizing, breaks up words and sentences in different so-called tokens. These tokens are counted and the amount of tokens inside and outside the document is stored. The number of tokens is one of the important factors for relevance. For example, when a document contains 9 times the token "car", and the rest of the website contains only one other "car" token, then this document is highly relevant when searching for "car" or "cars".

...

Once a website has been successfully indexed, you can perform tests on the search index. The XperienCentral search engine is a customized version of the popular open source search engine Lucene. The syntax for queries is almost the same as Lucene. The entire syntax won’t be repeated here because there’s an online version that explains Lucene’s syntax in detailis described here: https://lucene.apache.org/core/2_9_4/queryparsersyntax.html. For a detailed overview of the scoring algorithm of the Lucene search engine see this page: https://lucene.apache.org/core/3_6_0/scoring.html

...

The XperienCentral search engine has a variable set of fields , depending on the type of documents that has been havebeen indexed. Below is a list of the most import fields that are always part of the search index:

...

Field Name	Description	Example Values
`Children`	Contains URLs to the child pages of this document	http://127.0.0.1/web/show/id=26111/langid=43/dbid=2/typeofpage=75501 127.0.0.1:9000/web/show/id=26111/langid=43/channel=pdf
`Contenttype`	The content type.	Possible values include: `page, element_holder, image, flash, product, jellyfishdownload, jellyfishdocument`
`Description`	The description of the document taken from the HTML `description`. meta tag.	This combination enables continuous web innovation
`Hostname`	The hostname of the document.	127.0.0.1
`Keyword`	A keyword.	WebManager
`Keywords`	Meta keywords taken from the HTML `keywords` meta tag.	WebManager
`Langid`	The language ID of the document.	43 (=Dutch), 42=(English)
`Location`	The URL of the document.	http://127.0.0.1:8080/web/News/import-wcm.swf.htm
`Longdate`	The date the document was created (only relevant for Word documents or PDFs, less for HTML pages).	20080922102830396
`Modified`	The date the document was last modified.	20080922102830396
`Indexed`	The date the document was first indexed.	20080922102830396
`Pagepath`	The combination of WebID’s web ID’s that lead to the document.	p26111p70532 (the document is below the homepage (id=26111) and a subpage (id=70532) below the homepage)
`Pagepath_00_name`	The name of the root page of the document.	Home
`Pagepath_00_url`	The URL of the root page of the document.	http://127.0.0.1:8080/web/Home.htm
`Pagepath_xx_name`	The name of the level xx page that leads to the document. The range of xx is between 00 and the depth of the website.
`Pagepath_xx_url`	The URL of the level xx page that leads to the document.http://127.0.0.1:8080/web/DeveloperWeb.htm
`WebID`	The ID of the web Initiative to which the document belongs.	26111

...

To view the contents of a search engine index see *** chapter 5.1 Analyze the Search Index.

Back to top

Versions Compared

Old Version 16

New Version 17

Key

Indexing