Magnolia 5.3 reached end of life on June 30, 2017. This branch is no longer supported, see End-of-life policy.
The Solr module uses the Apache Solr search platform to index and crawl Magnolia content. Solr is a standalone enterprise search server with a REST like API. Solr uses the Lucene library for full-text indexing and provides faceted search, distributed search and index replication. You can use Solr to index content in an event-based or action-based fashion. The module is compatible with Apache Solr 4.
The Magnolia Solr bundle consists of three modules:
See Installing a module on how to install the bundle using JARs or Maven dependencies.
If you install with JAR files, include also the dependent third-party libraries. Without them the Tika parser and suggestions won't work.
<dependency> <groupId>info.magnolia.solr</groupId> <artifactId>magnolia-solr-search-provider</artifactId> <version>2.2</version> </dependency> <dependency> <groupId>info.magnolia.solr</groupId> <artifactId>magnolia-content-indexer</artifactId> <version>2.2</version> </dependency> <dependency> <!-- Optional. The theme module provides an autocomplete search bar. --> <groupId>info.magnolia.solr</groupId> <artifactId>magnolia-solr-search-provider-theme</artifactId> <version>2.2</version> </dependency>
Apache Solr is a standalone search server. You need the server in addition to the Magnolia Solr modules.
Download Apache Solr and extract the zip to your computer:
solr/ bin/ contrib/ dist/ docs/ example/ solr/ collection1/ conf/ schema.xml solrconfig.xml licenses/
Go to the example
directory and start Solr.
cd $SOLR_HOME/example java -jar start.jar
This type of startup works for testing and development purposes. For production installation see Taking Solr to Production.
A schema file specifies what fields the Magnolia content can contain, how those fields are added to the index, and how they are queried. An ExtractingRequestHandler extracts searchable fields from Magnolia pages.
Download the configuration files and overwrite the default files in $SOLR_HOME/example/solr/collection1/conf/
:
Magnolia Solr Search Provider module changed the API in version 2.2, this was necessary to fix several issues. Full changelog for version 2.2 https://jira.magnolia-cms.com/projects/MGNLEESOLR/versions/16739
Regarding the changes in the module it's recommended completely recreate the Solr indexes after to upgrade to version 2.2.
Logic of pagination was extracted from info.magnolia.search.solrsearchprovider.logic.model.AbstractSearchResultModel
into info.magnolia.search.solrsearchprovider.logic.model.page.SolrPager
class. Pager can be obtained by info.magnolia.search.solrsearchprovider.logic.model.AbstractSearchResultModel#getPager
method.
So e.g. AbstractSearchResultModel#getResult
is replaced by SolrPager#getItems
, AbstractSearchResultModel#getCount
by SolrPager#getCount
, etc.
info.magnolia.search.solrsearchprovider.logic.providers.SearchService#search
and info.magnolia.search.solrsearchprovider.logic.providers.SolrSearch#getSchemaFields
newly throws org.apache.solr.client.solrj.SolrServerException
and java.io.IOException
. So the exceptions thrown during communication with Solr server are not swallowed anymore, so client can take appropriate action.
Indexers now newly requires workspace
Solr schema field. If you are using Indexers you have to add the workspace
filed into your Solr schema configuration and retrigger reindexing by setting indexed
property to false for your Indexers (Indexer configuration).
The Content Indexer module is a recursive repository indexer and an event based indexer. You can configure multiple indexers for different sites and document types. The content indexer also allows you to crawl external websites using JSoup and CSS selectors. You then define different field mappings that will be obtained for each node and indexed in the solr index.
Configure an indexer in Configuration > /modules/content-indexer/config/indexers
. Example configurations for indexing a website and DAM assets are provided. Duplicate one of the examples to index another site or workspace.
Node name | Value |
---|---|
modules |
|
content-indexer |
|
config |
|
indexers |
|
websiteIndexer |
|
fieldMappings |
|
abstract | abstract |
author | author |
date | date |
teaserAbstract | mgnlmeta_teaserAbstract |
text | content |
title | title |
enabled | true |
indexed | false |
pull | false |
rootNode | / |
type | website |
workspace | website |
Properties:
| required
|
| required Indicates whether indexing was done. When Solr finishes indexing content-indexer will set this property to |
| optional, default is JCR node type to index. For example, if you were indexing assets in the Magnolia DAM you would set this to |
| optional, default is Pull URLs instead of pushing. When |
assetProviderId | optional , default is If |
| required Node in the workspace where indexing starts. Use this property to limit indexing to a particular site branch. |
| required Sets the type of the indexed content such as |
| required Workspace to index. |
| required Field mappings defines how fields in Magnolia content are mapped to Solr fields. Left side is Magnolia, right side is Solr. |
|
You can use the fields available in the schema. If a field does not exist in Solr's schema you can use a dynamic field |
The indexer uses an IndexService to handle the indexing of a node. A basic implementation is configured by default: info.magnolia.search.solrsearchprovider.logic.indexer.BasicSolrIndexService
. You can define and configure your own IndexService for specific needs.
Implement the IndexService
interface:
public class I18nIndexerService implements info.magnolia.module.indexer.indexservices.IndexService { private static final Logger log = LoggerFactory.getLogger(I18nIndexerService.class); @Override public boolean index(Node node, IndexerConfig config) { ...
Register the IndexService in the Content Indexer module configuration:
Node name | Value |
---|---|
modules |
|
content-indexer |
|
config |
|
indexService |
|
class | info.magnolia.search.solrsearchprovider.logic.indexer.BasicSolrIndexService |
The crawler mechanism uses the Scheduler to crawl a site periodically.
Example: Configuration to crawl bbc.com
Node name | Value |
---|---|
bbc_com |
|
sites |
|
bbc |
|
url | http://www.bbc.co.uk/ |
fieldMappings |
|
abstract | #story_continues_1 |
keywords | meta[name=keywords] attr(0,content) |
depth | 2 |
enabled | false |
nbrCrawlers | 2 |
type | news |
Properties:
| required
When a crawler is enabled |
| The max depth of a page in terms of distance in clicks from the root page. This should not be too high, ideally 2 or 3 max. |
| The max number of simultaneous crawler threads that crawl a site. 2 or 3 is enough. |
| optional, default is every hour A CRON expression that specifies how often the site will be crawled. CronMaker is a useful tool for building expressions. |
| optional Sets the type of the crawled content such as |
| required List of sites to crawl. For each crawler you can define multiple sites to crawl. |
| required Name of the site. |
| required URL of the site. |
| required Field mappings defines how fields parsed from the site pages are mapped to Solr fields. Left side is the crawled site, right side is Solr. |
| required You can use any CSS selector to target an element on the page. For example, You can also use custom syntax to get content inside attributes. For example, meta keywords are extracted using |
The Solr Search Provider module contains templates to display search results on the site. It also provides faceted search components for refining the results further. The faceted search gets related facets from the search context. Suggestions and available fields are available in Freemarker context.
Configure the Solr server address in Configuration > /modules/solr-search-provider/config/solrConfig@baseURL
. See HttpSolrServer Javadoc for other properties.
Node name | Value |
---|---|
solr-search-provider |
|
config |
|
solrConfig |
|
allowCompression | false |
baseURL | http://localhost:8983/solr/ |
connectionTimeout | 100 |
followRedirects | false |
maxConnectionsPerHost | 100 |
maxRetries | 0 |
maxTotalConnections | 100 |
soTimeout | 1,000 |
Create a search results page using the solrSearchResult
page template. To try it in the demo-project, edit the home page properties. and select your Solr results page in the Search Page field.
You can filter results by URL domain in the Search Results component dialog.
The example query title^100 abstract^0.1
will boost the rank for matches in the title
field 1000 times more than equivalent matches in the abstract
.
The query will give the following results:
If instead you boost the abstract over the title you would get the following results for the same search. The returned snippets are now primarily from page titles.
Positive filtering: Return only results where the keyword conference
is present.
Negative filtering: Don't return results where the keyword conference
is present.
You can add more filters by separating them by spaces.
Localized search
Suggestions