The /wiki/spaces/PD/pages/24707083 page explains how to add or change the metadata of a SOLR document. However, the one piece of information you cannot change that way is the document's location. And sometimes this is just what you want or need to do. For example, when your website is running on https but the SSL offloading is performed by the loadbalancer in front of your server, indexing the website with the https url's is not possible. You have to change them to http.
Fortunately the wmasolrsearch add-on of GX WebManager defines a UrlProvider service in the nl.gx.product.wmasolrsearch.api package just for this purpose.
Adding url's
If you just want to add your own set of url's to the SOLR index, you can define a new Service that implements the UrlProvider interface and implements its method
String[] getUrls(boolean includeAll);
Using the OSGi dependency mechanism in GX WebManager, this new UrlProvider service is automatically picked up by the SearchService in the wmasolrsearch add-on, and its list of url's is appended to the default list of GX WebManager.
Changing or excluding url's
If you want to change the url's determined by the default UrlProviderof GX WebManager, you can do this by first capturing these url's, changing them, and then feeding them to the SearchService in the wmasolrsearch add-on:
private UrlProvider m_urlProvider; // Injected by OSGi private SearchService m_searchService; // Injected by OSGi private void index(boolean fullIndex, boolean clearRest) { // Get all the url's from GX WebManager that should be indexed String[] urls = m_urlProvider.getUrls(indexFullContent); // Update the url's in the list as you see fit ... // Index the urls allowing all hostnames and setting the follow-url depth to 0. // The SearchService class also has a different indexPages method in which you can specify // the allowed hostnames and depth. m_searchService.indexPages(urls, allowedHostNames.split(","), 0, m_configUtil.getClearRest()); }
When you do this, you normally do not want the default indexer task of GX WebManager to run. You can disable it by emptying the wmasolrsearch.crontabschedule configuration setting on the GX WebManager setup page.
// Get all the url's that should be indexed String[] urls = m_urlProvider.getUrls(indexFullContent); LOG.info("The url provider returned " + urls.length + " url's to index."); // Add, remove or change the url's urls = updateUrls(urls); LOG.info("Updating the url's resulted in " + urls.length + " url's to index."); // Index the urls String allowedHostNames = m_configUtil.getAllowedHostNames(); if (allowedHostNames != null && allowedHostNames.trim().length() > 0) { m_searchService.indexPages(urls, allowedHostNames.split(","), 0, m_configUtil.getClearRest()); } else { // Allow all hostnames and set the depth to 0 m_searchService.indexPages(urls, m_configUtil.getClearRest()); }
See also the attached example: urlProviderService.zip