Magnolia 5.3 reached end of life on June 30, 2017. This branch is no longer supported, see End-of-life policy.

Page tree
Skip to end of metadata
Go to start of metadata

The Solr module uses the  Apache Solr  search platform to index and crawl Magnolia content. Solr is a standalone enterprise search server with a REST like API. Solr uses the Lucene library for full-text indexing and provides faceted search, distributed search and index replication. You can use Solr to index content in an event-based or action-based fashion. The module is compatible with Apache Solr 4.

Installing the Magnolia Solr bundle

The Magnolia Solr bundle consists of three modules:

  • Content Indexer indexes Magnolia workspaces. It can also crawl a published website.
  • Search Provider provides templates for displaying Solr search results on the site and faceted search components.
  • Search Provider Theme provides autocomplete functionality that predicts the search term the user is typing.

See Installing a module on how to install the bundle using JARs or Maven dependencies.

If you install with JAR files, include also the dependent third-party libraries. Without them the Tika parser and suggestions won't work.

Maven dependencies
<dependency>
  <groupId>info.magnolia.solr</groupId>
  <artifactId>magnolia-solr-search-provider</artifactId>
  <version>2.2</version>
</dependency>

<dependency>
  <groupId>info.magnolia.solr</groupId>
  <artifactId>magnolia-content-indexer</artifactId>
  <version>2.2</version>
</dependency>
 
<dependency>
  <!-- Optional. The theme module provides an autocomplete search bar. -->
  <groupId>info.magnolia.solr</groupId>
  <artifactId>magnolia-solr-search-provider-theme</artifactId>
  <version>2.2</version>
</dependency>

Installing Apache Solr

Apache Solr is a standalone search server. You need the server in addition to the Magnolia Solr modules.

Download Apache Solr and extract the zip to your computer:

solr/
  bin/
  contrib/
  dist/
  docs/
  example/
    solr/
      collection1/
        conf/
          schema.xml
          solrconfig.xml
  licenses/

Starting Apache Solr

Go to the example directory and start Solr.

cd $SOLR_HOME/example
java -jar start.jar

This type of startup works for testing and development purposes. For production installation see Taking Solr to Production.

Configuring a schema and request handlers

A schema file specifies what fields the Magnolia content can contain, how those fields are added to the index, and how they are queried. An ExtractingRequestHandler extracts searchable fields from Magnolia pages.

Download the configuration files and overwrite the default files in  $SOLR_HOME/example/solr/collection1/conf/ :

Update of Solr Search Provider module into version 2.2

Magnolia Solr Search Provider module changed the API in version 2.2, this was necessary to fix several issues. Full changelog for version 2.2 https://jira.magnolia-cms.com/projects/MGNLEESOLR/versions/16739

Regarding the changes in the module it's recommended completely recreate the Solr indexes after to upgrade to version 2.2.

 

Refactoring of pagination

Logic of pagination was extracted from info.magnolia.search.solrsearchprovider.logic.model.AbstractSearchResultModel into info.magnolia.search.solrsearchprovider.logic.model.page.SolrPager class. Pager can be obtained by info.magnolia.search.solrsearchprovider.logic.model.AbstractSearchResultModel#getPager method. 

So e.g. AbstractSearchResultModel#getResult is replaced by SolrPager#getItems,  AbstractSearchResultModel#getCount by  SolrPager#getCount, etc.

Changes in exception handling when error occurs during communication with Solr server

info.magnolia.search.solrsearchprovider.logic.providers.SearchService#search and info.magnolia.search.solrsearchprovider.logic.providers.SolrSearch#getSchemaFields newly throws org.apache.solr.client.solrj.SolrServerException and java.io.IOException.  So the exceptions thrown during communication with Solr server are not swallowed anymore, so client can take appropriate action.

Changes in Solr schema configuration

Indexers now newly requires workspace Solr schema field. If you are using Indexers you have to add the workspace filed into your Solr schema configuration and retrigger reindexing by setting indexed property to false for your Indexers (Indexer configuration).

Indexing Magnolia workspaces

The Content Indexer module is a recursive repository indexer and an event based indexer. You can configure multiple indexers for different sites and document types. The content indexer also allows you to crawl external websites using JSoup and CSS selectors. You then define different field mappings that will be obtained for each node and indexed in the solr index.

Indexer configuration

Configure an indexer in Configuration > /modules/content-indexer/config/indexers. Example configurations for indexing a website and DAM assets are provided. Duplicate one of the examples to index another site or workspace.

Node nameValue

 modules

 

 content-indexer

 

 config

 

 indexers

 

 websiteIndexer

 

 fieldMappings

 

 abstract

abstract

 author

author

 date

date

 teaserAbstract

mgnlmeta_teaserAbstract

 text

content

 title

title

 enabled

true

 indexed

false

 pull

false

 rootNode

/

 type

website

 workspace

website

Properties:

enabled

required

true enables the indexer configuration. false disables the indexer configuration.

indexed

required

Indicates whether indexing was done. When Solr finishes indexing content-indexer will set this property to true. You can set it to false to trigger re-indexing.

nodeType

optional, default is mgnl:page

JCR node type to index. For example, if you were indexing assets in the Magnolia DAM you would set this to mgnl:asset.

pull

optional, default is false (push)

Pull URLs instead of pushing. When true Solr will use Tika to extract information from a document, for instance a PDF. When false it will push the collected information using a Solr document.

assetProviderId

optional , default is jcr

If pull is set to true, specify an assetProviderId to obtain an asset correctly.

rootNode

required

Node in the workspace where indexing starts. Use this property to limit indexing to a particular site branch.

type

required

Sets the type of the indexed content such as website or documents. When you search the index you can filter results by type.

workspace

required

Workspace to index.

fieldMappings

required

Field mappings defines how fields in Magnolia content are mapped to Solr fields. Left side is Magnolia, right side is Solr.

<Magnolia_field>

<Solr_field>

You can use the fields available in the schema. If a field does not exist in Solr's schema you can use a dynamic field mgnlmeta_*. For instance if you have information nested in a deep leaf of your page stored with property specComponentAbstract, you can map this field with mgnlmeta_specComponentAbstract. The indexer contains a recursive call which will explore the node's child leaves until it finds the property.

IndexService

The indexer uses an IndexService to handle the indexing of a node. A basic implementation is configured by default: info.magnolia.search.solrsearchprovider.logic.indexer.BasicSolrIndexService. You can define and configure your own IndexService for specific needs.

Implement the IndexService interface:

IndexService
public class I18nIndexerService implements info.magnolia.module.indexer.indexservices.IndexService {

   private static final Logger log = LoggerFactory.getLogger(I18nIndexerService.class);

   @Override
   public boolean index(Node node, IndexerConfig config) {
      ...

Register the IndexService in the Content Indexer module configuration:

Node nameValue

 modules

 

 content-indexer

 

 config

 

 indexService

 

 class

 info.magnolia.search.solrsearchprovider.logic.indexer.BasicSolrIndexService

Crawling a website

The crawler mechanism uses the Scheduler to crawl a site periodically.

Example: Configuration to crawl bbc.com

Node nameValue

 bbc_com

 

 sites

 

 bbc

 

 url

http://www.bbc.co.uk/

 fieldMappings

 

 abstract

#story_continues_1

 keywords

meta[name=keywords] attr(0,content)

 depth

2

 enabled

false

 nbrCrawlers

2

 type

news

Properties:

enabled

required

true enables the crawler. false disables the crawler.

When a crawler is enabled info.magnolia.module.indexer.CrawlerIndexerFactory registers a new scheduler job for the crawler automatically. 

depth

The max depth of a page in terms of distance in clicks from the root page. This should not be too high, ideally 2 or 3 max.

nbrCrawlers

The max number of simultaneous crawler threads that crawl a site. 2 or 3 is enough.

cron

optional, default is every hour (0 0 0/1 1/1 * ? *)

A CRON expression that specifies how often the site will be crawled. CronMaker is a useful tool for building expressions.

type

optional

Sets the type of the crawled content such as news. When you search the index you can filter results by type.

sites

required

List of sites to crawl. For each crawler you can define multiple sites to crawl.

<site>

required

Name of the site.

url

required

URL of the site.

fieldMappings

required

Field mappings defines how fields parsed from the site pages are mapped to Solr fields. Left side is the crawled site, right side is Solr.

<site_field>

required

You can use any CSS selector to target an element on the page. For example, #story_continues_1 targets an element by ID.

You can also use custom syntax to get content inside attributes. For example, meta keywords are extracted using meta[name=keywords] attr(0,content). This will extract first value of keywords meta element. If you don't specify anything after the CSS selector then the text contained in the element is indexed. meta[name=keywords] would return an empty string because a meta element does contain any text, keywords are in the attributes. To get the value of a specific attribute specify attr(<index>,<Solr_field_name>). If you set index=-1 then all attributes are extracted and separated by a semicolon ;.

Providing a Solr search

The Solr Search Provider module contains templates to display search results on the site. It also provides faceted search components for refining the results further. The faceted search gets related facets from the search context. Suggestions and available fields are available in Freemarker context.

Configuring the Solr server base URL

Configure the Solr server address in Configuration > /modules/solr-search-provider/config/solrConfig@baseURL. See HttpSolrServer Javadoc for other properties.

Node nameValue

 solr-search-provider

 

 config

 

 solrConfig

 

 allowCompression

false

 baseURL

http://localhost:8983/solr/

 connectionTimeout

100

 followRedirects

false

 maxConnectionsPerHost

100

 maxRetries

0

 maxTotalConnections

100

 soTimeout

1,000

Creating a search results page

Create a search results page using the solrSearchResult page template. To try it in the demo-project, edit the home page properties. and select your Solr results page in the Search Page field.

URL domain filtering

You can filter results by URL domain in the Search Results component dialog.

Field boosting for relevance

The example query title^100 abstract^0.1 will boost the rank for matches in the title field 1000 times more than equivalent matches in the abstract.

The query will give the following results:

If instead you boost the abstract over the title you would get the following results for the same search. The returned snippets are now primarily from page titles.

Filtering search results

Positive filtering: Return only results where the keyword conference is present.

Negative filtering: Don't return results where the keyword conference is present.

You can add more filters by separating them by spaces.

Other features

  • Pagination
  • Faceting on all fields
  • Ranged faceting
  • Similar search
  • Localized search

  • Suggestions