Magnolia 5.4 reached end of life on November 15, 2018. This branch is no longer supported, see End-of-life policy.
The Solr module uses the Apache Solr search platform to index and crawl Magnolia content. Solr is a standalone enterprise search server with a REST like API.
The Magnolia Solr bundle consists of two modules:
Solr uses the Lucene library for full-text indexing and provides faceted search, distributed search and index replication. You can use Solr to index content in an event-based or action-based fashion. The module from version 5.0 is compatible with Solr5.3, older versions of the module are compatible with Solr4.
Maven is the easiest way to install the modules. Add the following dependencies to your bundle:
<dependency> <groupId>info.magnolia.solr</groupId> <artifactId>magnolia-content-indexer</artifactId> <version>5.0.3</version> </dependency>
<dependency> <groupId>info.magnolia.solr</groupId> <artifactId>magnolia-solr-search-provider</artifactId> <version>5.0.3</version> </dependency>
If you install with JAR files, include the dependent third-party libraries.
Apache Solr is a standalone search server. You need the server in addition to the Magnolia Solr modules.
Download Apache Solr and extract the zip to your computer.
A schema file specifies what fields the Magnolia content can contain, how those fields are added to the index, and how they are queried. https://cwiki.apache.org/confluence/display/solr/Documents%2C+Fields%2C+and+Schema+Design
A SolrRequestHandler is a Solr Plugin that defines the logic executed for any request. https://wiki.apache.org/solr/SolrRequestHandler
Create new magnolia config set by duplicating $SOLR_HOME/server/solr/configsets/data_driven_schema_configs
folder and name it magnolia_data_driven_schema_configs ($SOLR_HOME/server/solr/configsets/magnolia_data_driven_schema_configs
).
Download the magnolia example configuration files (based on Solr data_driven_schema_configs https://cwiki.apache.org/confluence/display/solr/Config+Sets) and overwrite the default files in newly created magnolia_data_driven_schema_configs/conf
:
Go to the $SOLR_HOME/bin
, start Solr server and create new core called magnolia
cd $SOLR_HOME/bin ./solr start ./solr create_core -c magnolia -d magnolia_data_driven_schema_configs
This type of startup works for testing and development purposes. For production installation see Taking Solr to Production.
A schema file specifies what fields the Magnolia content can contain, how those fields are added to the index, and how they are queried. An ExtractingRequestHandler extracts searchable fields from Magnolia pages.
Download the configuration files and overwrite the default files in $SOLR_HOME/example/solr/collection1/conf/
:
solr/ bin/ contrib/ dist/ docs/ example/ solr/ collection1/ conf/ schema.xml solrconfig.xml licenses/
Go to the example
directory and start Solr.
cd $SOLR_HOME/example java -jar start.jar
This type of startup works for testing and development purposes. For production installation see Taking Solr to Production.
This version contains changes in solrconfig.xml
and managed-schema
please read the notes before update to 5.0.2.
Fixed the issue of two indexers/crawlers mutually overwriting the resulting index when indexing the same content. For example when one indexer was for indexing the English translation and other one for indexing the German translation. - MGNLEESOLR-102Getting issue details... STATUS
Problem was caused by using jcr uuid(indexers) and url(crawlers) as unique identifier for solr indexes. To fix this issue changes in solrconfig.xml
and managed-schema
were required.
<uniqueKey>
in managed-schema
was changed to uuid
uuid
in info.magnolia.search.solrsearchprovider.logic.providers.FacetedSolrSearchProvider
solrconfig.xml
now generates uuid
field from combination of type
and id
fields. https://wiki.apache.org/solr/Deduplication method is used for generating the uuid. For more details see the change in code diff.Option 1:
If you don't plan to index same content by two different indexers or crawlers then you don't need to update your solrconfig.xml
and managed-schema
for your solr core. Only change what you need to do is add uniqueKeyField
property with value id
into your solr sear result page.
Option 2:
Use new solrconfig.xml and managed-schema configuration files for your solr core and for $SOLR_HOME/server/solr/configsets/magnolia_data_driven_schema_config
.
It's needed to recreate all Solr indexes, because of the changes in configuration files. Probably the easiest way to do it is recreate the solr core and then retrigger indexing int Magnolia.
solrconfig.xml
and managed-schema
configuration files for $SOLR_HOME/server/solr/configsets/magnolia_data_driven_schema_config
Magnolia config set. Delete magnolia
core an create it again
cd $SOLR_HOME/bin ./solr delete -c magnolia ./solr create_core -c magnolia -d magnolia_data_driven_schema_configs
Retrigger the indexers, by changing their property indexed
to false
Solr Search Provider module version 5.0 brings support to Solr 5 (officially tested with version 5.3.1).
Full changelog for version 5.0 https://jira.magnolia-cms.com/browse/MGNLEESOLR/fixforversion/18141
Regarding the changes in the module it's recommended completely recreate the Solr indexes after to upgrade to version 5.0.
org.apache.solr.client.solrj.SolrServer is deprecated and was replaced by org.apache.solr.client.solrj.SolrClient in solr-solrj 5.x library. Because of that info.magnolia.search.solrsearchprovider.MagnoliaSolrBridge#getSolrServer method was changed to info.magnolia.search.solrsearchprovider.MagnoliaSolrBridge#getSolrClient method.
Solr Search Provider module version 3.0 delivers the following key fixes and enhancements:
magnolia-solr-search-provider-theme module has gone - MGNLEESOLR-66Getting issue details... STATUS
Full changelog for version 3.0 https://jira.magnolia-cms.com/browse/MGNLEESOLR/fixforversion/17434
Regarding the changes in the module it's recommended completely recreate the Solr indexes after to upgrade to version 3.0.
The Content Indexer module is a recursive repository indexer and an event based indexer. You can configure multiple indexers for different sites and document types. The content indexer also allows you to crawl external websites using JSoup and CSS selectors. You then define different field mappings that will be obtained for each node and indexed in the solr index.
Configure an indexer in Configuration > /modules/content-indexer/config/indexers
. Example configurations for indexing a website and DAM assets are provided. Duplicate one of the examples to index another site or workspace.
Node name | Value |
---|---|
modules | |
content-indexer | |
config | |
indexers | |
websiteIndexer | |
fieldMappings | |
abstract | abstract |
author | author |
date | date |
teaserAbstract | mgnlmeta_teaserAbstract |
text | content |
title | title |
enabled | true |
indexed | false |
pull | false |
rootNode | / |
type | website |
workspace | website |
Properties:
| required
|
| required Indicates whether indexing was done. When Solr finishes indexing content-indexer will set this property to |
| optional, default is JCR node type to index. For example, if you were indexing assets in the Magnolia DAM you would set this to |
| optional, default is Pull URLs instead of pushing. When |
assetProviderId | optional , default is If |
| required Node in the workspace where indexing starts. Use this property to limit indexing to a particular site branch. |
| required Sets the type of the indexed content such as |
| required Workspace to index. |
| required Field mappings defines how fields in Magnolia content are mapped to Solr fields. Left side is Magnolia, right side is Solr. |
|
You can use the fields available in the schema. If a field does not exist in Solr's schema you can use a dynamic field |
The indexer uses an IndexService to handle the indexing of a node. A basic implementation is configured by default: info.magnolia.search.solrsearchprovider.logic.indexer.BasicSolrIndexService
. You can define and configure your own IndexService for specific needs.
Implement the IndexService
interface:
public class I18nIndexerService implements info.magnolia.module.indexer.indexservices.IndexService { private static final Logger log = LoggerFactory.getLogger(I18nIndexerService.class); @Override public boolean index(Node node, IndexerConfig config) { ...
Register the IndexService in the Content Indexer module configuration:
Node name | Value |
---|---|
modules | |
content-indexer | |
config | |
indexService | |
class | info.magnolia.search.solrsearchprovider.logic.indexer.BasicSolrIndexService |
The crawler mechanism uses the Scheduler to crawl a site periodically.
From version 3.0 Crawlers can be also connected with activation process by adding info.magnolia.module.indexer.crawler.commands.CrawlerIndexerActivationCommand
into command chain with activation command. By default this is done for this activation/deactivation commands:
If you are using custom activation command and you wish to connect it with crawler mechanism, you can use info.magnolia.module.indexer.setup.AddCrawlerIntoCommandChainTask
install/update task for it.
Example: Configuration to crawl bbc.com
Node name | Value |
---|---|
bbc_com | |
sites | |
bbc | |
url | http://www.bbc.co.uk/ |
fieldMappings | |
abstract | #story_continues_1 |
keywords | meta[name=keywords] attr(0,content) |
depth | 2 |
enabled | false |
nbrCrawlers | 2 |
type | news |
Properties:
| required
When a crawler is enabled |
| required The max depth of a page in terms of distance in clicks from the root page. This should not be too high, ideally 2 or 3 max. |
| required The max number of simultaneous crawler threads that crawl a site. 2 or 3 is enough. |
| optional, since version 3.0, default value is info.magnolia.module.indexer.crawler.MgnlCrawler Implementation of {@link edu.uci.ics.crawler4j.crawler.WebCrawler which is used by the Crawler to crawl sites. |
| optional, since version 3.0, default value is content-indexer Name of the catalog where the command resides. |
| optional, since version 3.0, default value is crawlerIndexer Command which is used to instantiate and trigger the Crawler. |
| optional, since version 3.0 If it's set to true then crawler should be triggered only during activation. No scheduler job will be registered for this crawler.
|
| optional, since version 3.0, default value is 5s Defines the delay (in seconds) after which crawler should start when activation is done. Default value is 5s. |
| optional, default is every hour A CRON expression that specifies how often the site will be crawled. CronMaker is a useful tool for building expressions. |
| optional Sets the type of the crawled content such as |
| required List of sites to crawl. For each crawler you can define multiple sites to crawl. |
| required Name of the site. |
| required URL of the site. |
| required Field mappings defines how fields parsed from the site pages are mapped to Solr fields. Left side is Solr field, right side is the crawled site. |
| required You can use any CSS selector to target an element on the page. For example, You can also use custom syntax to get content inside attributes. For example, meta keywords are extracted using |
| optional, since version 3.0 List of jcr items. If any of this items is activated crawler will be triggered. |
| optional, since version 3.0 Name of the jcr item. |
| required, since version 3.0 Workspace where jcr item is stored. |
| required, since version 3.0 Path of the jcr item. |
| optional, since version 5.0.2 Authentication information to allow crawling password restricted area. |
| required, since version 5.0.2 Username which is used for login into restricted area. |
| required, since version 5.0.2 User's password used for login into restricted area. |
| required, since version 5.0.2 Url to page with login form. |
| required, since version 5.0.2, default value is mgnlUserID Name of input field for entering the username in login form. |
| required, since version 5.0.2, default value is mgnlUserPSWD Name of input field for entering the password in login form. |
| required, since version 5.0.2, default value is mgnlLogout String which identifies the logout Url. Crawler doesn't crawl over the urls which contains logoutUrlIdentifier to avoid logout. |
The Solr Search Provider module contains templates to display search results on the site. It also provides faceted search components for refining the results further. The faceted search gets related facets from the search context. Suggestions and available fields are available in Freemarker context.
Configure the Solr server address in Configuration > /modules/solr-search-provider/config/solrConfig@baseURL
. baseURL
should be http://<domain_name>:<port>/solr/<solr_core_name>.
So if solr server was installed as described in the installation section then baseURL is http://localhost:8983/solr/magnolia.
See HttpSolrClient Javadoc for other properties.
Node name | Value |
---|---|
solr-search-provider | |
config | |
solrConfig | |
allowCompression | false |
baseURL | http://localhost:8983/solr/magnolia |
connectionTimeout | 100 |
followRedirects | false |
maxConnectionsPerHost | 100 |
maxRetries | 0 |
maxTotalConnections | 100 |
soTimeout | 1,000 |
Create a search results page using one of the available templates. Which template you use depends on the type of project you have and the modules that are installed.
Module | Template | Configuration |
---|---|---|
mte | mteSolrSearchResult | /modules/solr-search-provider/templates/mteSolrSearchResult |
standard-templating-kit | solrSearchResult | /modules/solr-search-provider/templates/solrSearchResult |
To try it in the demo travel site:
You can filter results by URL domain in the Filter url prefix field
.
The example query title^100 abstract^0.1
will boost the rank for matches in the title
field 1000 times more than equivalent matches in the abstract
.
The query will give the following results:
If instead you boost the abstract over the title you would get the following results for the same search. The returned snippets are now primarily from page titles.
Positive filtering: Return only results where the keyword conference
is present.
Negative filtering: Don't return results where the keyword conference
is present.
You can add more filters by separating them by spaces.
The autocomplete search bar provides suggestions while you type into the search field. jQuery UI Autocomplete widget and info.magnolia.search.solrsearchprovider.logic.servlets.SearchServlet
are used for this functionality.
<script src="path to jquery.js" type="text/javascript"></script>
<script src="path to jquery-ui.js" type="text/javascript"></script>
Add this small javascript into the Search result page
var jq = jQuery.noConflict(); jq(document).ready(function () { jq("#searchbar, #nav-search, #search").autocomplete({ open: function () { jq(this).autocomplete('widget').css('z-index', 999); }, source: function (request, response) { jq.get("${contextPath}/searchservlet/", {search: request.term.toLowerCase(), queryType: "SUGGEST", fields: "collation", fq: "*"}, function (data) { response(data); }, "json" ); }, minLength: 2 }); });
For more information see series of the blog posts:
Suggestions