Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Back to top

 

...

Indexing

The index is stores the storage of the indexed information. In fact, it is a small database that is tuned for fast information retrieval. It is also similar to a database because it contains a set of documents (one indexed document is one document in the database) and every document has a set of fields. Notice the similarity between tables and table fields. Before documents are stored in the index they go through three processes. The first is the normalizing process. In this process characters are converted to normal and lowercase equivalents. For example "ç" is converted to "c", and "C" is converted to "c". By doing this a user will get search results when searching for "barcelona" instead of "Barçelona".

The second process, tokenizing, breaks up words and sentences in different so-called tokens. These tokens are counted and the amount of tokens inside and outside the document is stored. The number of tokens is one of the important factors for relevance. For example, when a document contains 9 times the token "car", and the rest of the website contains only one other "car" token, then this document is highly relevant when searching for "car" or "cars".

...

Once a website has been successfully indexed, you can perform tests on the search index. The XperienCentral search engine is a customized version of the popular open source search engine Lucene. The syntax for queries is almost the same as Lucene. The entire syntax won’t be repeated here because there’s an online version that explains Lucene’s syntax in detailis described here: https://lucene.apache.org/core/2_9_4/queryparsersyntax.html. For a detailed overview of the scoring algorithm of the Lucene search engine see this page: https://lucene.apache.org/core/3_6_0/scoring.html

...

The XperienCentral search engine has a variable set of fields , depending on the type of documents that has been havebeen indexed. Below is a list of the most import fields that are always part of the search index:

...

Field NameDescriptionExample Values

Children

Contains URLs to the child pages of this document

http://127.0.0.1/web/show/id=26111/langid=43/dbid=2/typeofpage=75501
127.0.0.1:9000/web/show/id=26111/langid=43/channel=pdf

Contenttype

The content type.

Possible values include: page, element_holder, image, flash, product, jellyfishdownload, jellyfishdocument

Description

The description of the document taken from the HTML description. meta tag.

This combination enables continuous web innovation

Hostname

The hostname of the document.

127.0.0.1

Keyword

A keyword.

WebManager

Keywords

Meta keywords taken from the HTML keywords meta tag.

WebManager

Langid

The language ID of the document.

43 (=Dutch), 42=(English)

Location

The URL of the document.

http://127.0.0.1:8080/web/News/import-wcm.swf.htm

Longdate

The date the document was created (only relevant for Word documents or PDFs, less for HTML pages).

20080922102830396

Modified

The date the document was last modified.

20080922102830396

Indexed

The date the document was first indexed.

20080922102830396

Pagepath

The combination of WebID’s web ID’s that lead to the document.

p26111p70532 (the document is below the homepage (id=26111) and a subpage (id=70532) below the homepage)

Pagepath_00_name

The name of the root page of the document.

Home

Pagepath_00_url

The URL of the root page of the document.

http://127.0.0.1:8080/web/Home.htm

Pagepath_xx_name

The name of the level xx page that leads to the document. The range of xx is between 00 and the depth of the website.

 

Pagepath_xx_url

The URL of the level xx page that leads to the document.http://127.0.0.1:8080/web/DeveloperWeb.htm

 

WebID

The ID of the web Initiative to which the document belongs.

26111

...

To view the contents of a search engine index see *** chapter 5.1 Analyze the Search Index.

 

Back to top