This course is currently in draft.

Introduction

This course is for solution architects and frontend developers and provides an introduction to designing and implementing an enterprise search using Funnelback.

Special tool tips or hints will appear throughout the exercises as well, which will provide extra knowledge or tips for completing the exercise. They will look like:

This box is used to provide links to further reading available on the current topic.
This box is used to provide tips or hints related to the current topic.
This box is used to provide important advice relating to the current topic. This includes advice to be aware of to avoid common errors.
This box contains advice specific to the Linux version of Funnelback.
This box contains advice specific to the Windows version of Funnelback.

What this workshop will cover:

  • Overview of enterprise search

  • Securing your search

  • Indexing other file formats

  • Indexing other data source types

  • Implementing custom DLS

  • Troubleshooting enterprise search

Prerequisites to completing the course:

Students are expected to have completed the following courses:

  • FUNL201, FUNL202, FUNL203

What you will need before you begin:

  • Access to the Funnelback training VM for Funnelback that has been setup for this course.

  • Internet access.

1. Enterprise search overview

An enterprise search typically aims to provide a single search interface across an organisation’s internal data repositories.

This in essence is a fairly straightforward problem - but the solution can be very complicated.

1.1. How is an enterprise search structured?

A typical enterprise search consists of a number of collections that correspond to the different data repositories. Each collection handles how to connect to and retrieve content from the respective repository.

These collections are tied together with one or more meta collections which are configured to provide a search interface and search over one or more of the content repository collections.

1.2. What makes enterprise search difficult?

There are a number of factors common to enterprise search that can vastly increase the solution’s complexity and chance of success.

  • High levels of system integration. Search covers numerous repositories of different types, requiring a lot of integration points with other systems. Integration can go in both directions - with Funnelback integrating to index content from different systems and some systems may integrate with Funnelback to nest Funnelback search results within the application, or to submit content directly to the Funnelback push API. High levels of integration translates to many possible points of failure.

  • Data repositories are often very large. This affects the resourcing required for any setup and usually demands a multi-server environment. It also makes debugging a lot more difficult due to the size of log files.

  • Authentication and filtering of results based on security permissions is a common requirement. This vastly increases complexity of the search, the amount of work to process a query and also places some limitations on the search.

  • Internal systems are often personalised. Personalised content presents a unique problem - Funnelback is only able to index the view of content that is returned to it and cannot account for what different users will see when accessing the resource.

  • Internal network policies often hinder access to data. It is common for internal security configuration to present barriers to accessing data. This can also impose restrictions on where the Funnelback server is located and how it is configured.

  • Some of the repository types require the Windows version of Funnelback. Awareness of the dependencies is crucial when designing the solution.

  • What are some of the key challenges that you’ll face when implementing enterprise search?

  • What challenges does personalisation present when indexing data repositories?

  • Why is enterprise search often very slow?

Funnelback provides a few different options for securing the search results pages.

2.1. Authenticated access to the search results

2.1.1. SAML

Funnelback supports the use of SAML authentication using a single identity provider for both administration and search results interfaces. The configuration is separate for administration and the search results interface.

SAML can be used for authentication but not authorisation as there is no access to the group membership or user attributes of the users from the search data model.

2.1.2. Windows integrated authentication

The Windows version of Funnelback supports single sign-on from Active Directory using Windows integrated authentication for the search results interface. Note: AD logons are not supported for accessing the administration interface.

Windows Integrated Authentication supports reading the user attributes (such as group membership) to infer the user permissions and alter the DLS search accordingly. The principal node in the data model will contain the attributes of the user from Active Directory

Configuration of single sign-on is required for searches that provide DLS-enabled fileshare or TRIM collections.

2.1.3. Using a CMS to authenticate users

If the Funnelback search page is accessed via a CMS (i.e. partial HTML integration) then this can be used to authenticate users.

Collection level security that restricts access to the CMS server’s IP address should be configured for this use case as it prevents a user from bypassing the authentication to access the search directly.

If document level security is configured on any of the collections then the CMS will need to pass through the relevant user details for DLS to work. Note: filecopy and TRIM collections do not support this method of integration.

2.2. Restricting access via IP address

Funnelback provides a mechanism known as collection level security that restricts the serving of search result to specified IP addresses.

Collection level security is useful for restricting a Funnelback search results page to specific IPs when the search is hosted on a publicly accessible server, or to restrict search results so that they must be served via a CMS integration.

2.3. Document level security

In a nutshell document level security (DLS) allows search results to be personalised so that result items that the current user receives do not include any results that they do not have permission to view.

2.3.1. Search complexity

Document level security adds a lot of complexity to a search as it requires:

  • the user details to form part of the query. For some repositories it requires the search to impersonate (or run as) the user.

  • the search to obtain the current user’s set of keys (the security groups/permissions to which the user has access).

  • custom code to run at gather time which looks up the document’s locks (the security permissions applied to the document) and adds this information to the index

  • custom code run at query time that determines if a user’s set of keys will grant it access to the document’s locks (called the locks-to-keys matcher)

The locks-to-keys matcher is specific to the system that is being accessed. It is possible to write your own locks to keys matcher that implements the matching algorithm. Funnelback has a generic portal locks-keys matcher that can be used as the starting point for a custom matcher.

2.3.2. Security model

Funnelback employs a form of document level security where the document’s permissions are indexed along with the content. This is known as early-binding security.

The implications of using an early-binding security model are:

  • The security applied to the document is point-in-time and reflects the permissions that were on the document when it was gathered. If the document’s permissions change between when the crawl occurred and when the index is queried a user may see a result that they do not have access to (or not see a result that they do have access to). Updating these permissions requires the document to be re-gathered and indexed. Permissions change on the users are reflected almost immediately however (depending on the cache time that was set).

  • If a user does click on a document that they do not have access to the underlying system housing the document will deny access - this means that the impact of a change of permissions is limited to what gets displayed in the search result listing (such as the title and summary snippet).

2.3.3. Translucent document level security

Translucent DLS is a variation of document level security where a user will see entries for every result that matches their query, but items that they are not permitted to see have identifying features suppressed. (e.g. result snippet, title, author)

The search administrator can configure what document metadata is displayed when a user doesn’t have permission to view the document.

2.3.4. Performance implications

Document level security has significant performance implications for search. For early binding security a lookup to fetch the user’s keys needs to be performed against each system that has document level security enabled. Each lookup will add to the search response time as Funnelback has to wait for the answers before it can start processing the query.

Funnelback supports key caching to improve performance. When enabled a user’s keys will be cached for a set period of time by Funnelback. The effect of the cache can have a similar effect on the results as early binding security as the cached key could change in the time between when the search was made and when the key was cached.

2.3.5. Supported collections

Funnelback has official document level security support for:

  • Windows NTFS/CIFS fileshares (requires the Windows version of Funnelback)

  • Squiz Matrix (Note: Squiz Matrix collections with DLS cannot currently be combined with DLS enabled fileshares or TRIM). This is due to the fact the Matrix collections are integrated with Matrix using a REST asset and fileshares/TRIM both require that the user connect directly with Funnelback). If there is a requirement for combining these then a locks-to-keys matcher will need to be implemented. Please contact support@funnelback.com well in advance to discuss your options).

  • TRIM 6-7, HP RM8, HP CM9 (requires the Windows version of Funnelback)

Custom repositories can be supported by implementing the locks fetcher, keys fetcher and locks-keys matcher components.

The following Funnelback features do not support DLS:

  • Auto-completion. The completion system will return words from all the documents regardless of their permissions. As it could potentially be used to leak protected document content, disabling auto-completion may be desirable.

  • Knowledge graph

Cached pages are disabled automatically when document level security is enabled.

2.3.6. Implementing DLS support for a custom repository

Implementation of DLS for a system requires several custom components to be developed.

For DLS to work Funnelback must be able to

  • access a URL’s access controls and associate these with the URL as security metadata. This is known as the set of document locks.

  • access the user’s set of permissions at request time. This is known as the set of user keys.

  • check the user’s keys against each URL’s locks and determine if the user’s permissions grants access to the document based on the system’s security model.

Each of these components is specific to the system that is being indexed and the method of obtaining each of these will be system dependent.

Obtaining a document’s set of locks

There are various techniques that can be used to expose the document’s set of locks to Funnelback. For example:

  • Return the set of document locks as a HTTP header that is returned when a specific user requests an item from the system.

  • Include the set of document locks in an XML or metadata field returned when an item is fetched.

  • Provide an API or web service that returns a set of document locks given a URI.

  • Provide a feed or external metadata that lists the document locks by URL.

Obtaining a user’s set of keys

There are various techniques that can be used by Funnelback to obtain the user’s set of keys. Functionality is required within both Funnelback (via a user to key mapper) and the target system (to provide the actual keys upon request).

For example a target system might be able to provide user keys:

  • via an API or web service that returns a set of user keys given a user ID.

  • as a CGI or request parameter when a search request is made (for example if a user is authenticated by a CMS and the CMS hosts the search page, then the CMS can pass additional information about the user to Funnelback when a search is made from the CMS).

  • using a custom plugin that will need to be written for the system.

  • Have Funnelback impersonate the user and an API call within the remote system that returns the user’s own permissions.

Funnelback’s user to key mapper then implements whatever is required to fetch the key from the service and provide the keys to the query processor.

Performing locks to keys matching

A locks to keys matcher takes a set of user keys, a set of document locks and URI and calculates if the user has access to the document.

The matcher must implement whatever logic is required by the system’s security model and returns a yes or no given the supplied parameters.

The authoring of a locks to keys matcher requires software development skills and knowledge of the C programming language.

Review questions: Security
  • Discuss some of the options available for securing a search? What are the constraints for each option?

  • Does Funnelback include support for OAuth? and SAML? Are there any limitations?

  • Explain the concept of document level security.

  • Explain the concept of collection level security.

  • When is it appropriate to use DLS? And what are the costs?

  • How would you use collection level security to secure an intranet search?

  • What are the major components required when adding DLS support to a repository?

  • What is the benefit from increasing the length of time that a user key is cached? What is the trade-off?

  • What happens if a user’s permissions change and are different to what is recorded in the index?

  • Which common Funnelback features do not support DLS?

3. Configuring Funnelback to index additional file types

By default Funnelback supports the indexing of HTML, Microsoft Office (Word/Excel/Powerpoint), RTF and text documents out of the box.

The binary formats are converted to text using Apache Tika - which supports a large number of document formats.

These formats can easily be added to the filetypes indexed by Funnelback as long as Tika can process the file format.

When considering additional formats to index remember that Funnelback can only index text that is extracted from the file which limits the useful set of file formats to add to the search. For many file formats document metadata will be the only useful text that can be extracted.

Tika’s list of supported formats changes regularly so be sure to check the correct Tika version for the version of Funnelback that will be installed.

The first approach discussed below does not convert the documents to text but uses metadata to describe the binary documents. The other approaches use Tika and external filtering in order to exrtact text from the binary documents.

3.1. Indexing non-textual files

For Funnelback to successfully index a document it needs to have a textual representation of the document. For text-based documents such as PDFs or Microsoft Word documents filtering is used to extract the text contained within the document and this is what Funnelback indexes.

For other files types such as multimedia types (e.g. images, movies, sound files) the filtering will only extract any metadata that has been embedded within the files and this will normally not be anything useful as the embedded metadata is usually attributes about the file such as the bit rate, duration, camera used to take a photo etc.

The best approach for indexing non-textual files is to index text that has been written to describe the files and associating it with the file’s URL. For a sound file or movie index a transcipt or write descriptive metadata such as a title, description and keywords which can then be used as the text that describes the file.

When you use this approach to index non-textual documents the actual files themselves do not need to be downloaded by Funnelback (e.g. if you’re doing a web crawl these can be in the exclude list). This is because the files themselves are not indexed - Funnelback is using the XML record to index the file and then attaching the file’s URL to the search result.
Tutorial: Index non-textual files using an XML file listing

Consider a site that has three non-text files:

  • An image (shakespeare.jpg) containing a picture of William Shakespeare.

  • A sound file (hamlet.mp3) containing a radio performance of Hamlet.

  • A video file (lear.mov) of a performance of King Lear.

This tutorial shows how to use a simple XML file to describe a number of non-text files to add them to the search index.

  1. Produce an XML file containing all the useful fielded information describing your files. This could be produced manually, or automatically generated from metadata/database information. e.g.

    <?xml version="1.0" encoding="UTF-8" ?>
    <files>
    	<file>
    		<title><![CDATA[Chandos portrait]]></title>
    		<uri>http://shakespeare.example.com/images/shakespeare.jpg</uri>
    		<description><![CDATA[The Chandos portrait is the most famous of the portraits that may depict William Shakespeare. Painted between 1600 and 1610, it may have
    served as the basis for the engraved portrait of Shakespeare used in the First Folio in 1623. It is named after the Dukes of Chandos, who formerly owned the
    painting.]]></description>
    		<author><![CDATA[John Taylor]]></author>
    		<date>1610</date>
    		<location><![CDATA[National Portrait Gallery, London]]></location>
    		<keywords>
    			<keyword><![CDATA[John Taylor]]></keyword>
    			<keyword><![CDATA[William Shakespeare]]></keyword>
    			<keyword><![CDATA[painting]]></keyword>
    		</keywords>
    		<filetype>Image</filetype>
    		<filesize>820kB</filesize>
    		<format>jpg</format>
    	</file>
    	<file>
    		<title><![CDATA[Hamlet]]></title>
    		<uri>http://shakespeare.example.com/radio/hamlet.mp3</uri>
    		<description><![CDATA[A full-text radio production of the play, co-produced by the BBC and the Renaissance Theatre Company. Features Kenneth Branagh as Hamlet,
    Derek Jacobi and Claudius, Judi Dench as Gertrude, and John Gielgud as the Ghost.]]></description>
    		<author><![CDATA[William Shakespeare]]></author>
    		<author><![CDATA[Kenneth Branagh]]></author>
    		<author><![CDATA[Derek Jacobi]]></author>
    		<author><![CDATA[Judi Dench]]></author>
    		<author><![CDATA[Renaissance Theatre Company]]></author>
    		<author><![CDATA[British Broadcasting Corporation]]></author>
    		<date>1992</date>
    		<keywords>
    			<keyword><![CDATA[William Shakespeare]]></keyword>
    			<keyword><![CDATA[audio]]></keyword>
    			<keyword><![CDATA[radio]]></keyword>
    			<keyword><![CDATA[BBC Radio 3]]></keyword>
    		</keywords>
    		<duration>235</duration>
    		<duration_units>min</duration_units>
    		<filetype>Sound recording</filetype>
    		<filesize>399.7MB</filesize>
    		<format>mp3</format>
    	</file>
    	<file>
    		<title><![CDATA[King Lear]]></title>
    		<uri>http://shakespeare.example.com/video/lear.mov</uri>
    		<description><![CDATA[King Lear is a 2018 British-American television film directed by Richard Eyre. An adaptation of the play of the same name by William
    Shakespeare, cut to just 115 minutes, was broadcast on BBC Two on 28 May 2018.]]></description>
    		<author><![CDATA[William Shakespeare]]></author>
    		<author><![CDATA[Richard Eyre]]></author>
    		<author><![CDATA[Jim Broadbent]]></author>
    		<author><![CDATA[Jim Carter]]></author>
    		<author><![CDATA[Tobias Menzies]]></author>
    		<author><![CDATA[Emily Watson]]></author>
    		<author><![CDATA[John Macmillan]]></author>
    		<author><![CDATA[Florence Pugh]]></author>
    		<author><![CDATA[Emma Thompson]]></author>
    		<author><![CDATA[Anthony Calf]]></author>
    		<author><![CDATA[Anthony Hopkins]]></author>
    		<author><![CDATA[Simon Manyonda]]></author>
    		<author><![CDATA[Chukwudi Iwuji]]></author>
    		<author><![CDATA[Karl Johnson]]></author>
    		<author><![CDATA[Samuel Valentine]]></author>
    		<author><![CDATA[Andrew Scott]]></author>
    		<author><![CDATA[Christopher Eccleston]]></author>
    		<date>2018</date>
    		<keywords>
    			<keyword><![CDATA[William Shakespeare]]></keyword>
    		</keywords>
    		<duration>115</duration>
    		<duration_units>min</duration_units>
    		<filetype>Video recording</filetype>
    		<filesize>4.8GB</filesize>
    		<format>mov</format>
    	</file>
    </files>
  2. Make the XML available at a web accessible address. (e.g. http://shakespeare.example.com/files.xml)

  3. Ensure that the XML file is included in your search. e.g. for a web collection you could add the XML’s URL to your start URLs.

  4. Update your search collection.

  5. Set the following XML processing options (Note the paths here are specific to the example XML above). This will split the XML document into multiple records, and assign the URL and filetype based on the contents of specified fields in the XML.

    • XML document splitting: /files/file

    • Document URL: /files/file/uri

    • Document filetype: /files/file/format

  6. Create metadata mappings for all of the fields that you wish to include in the index. e.g.

    • t: //title

    • author: //author

  7. Re-index the live view to incorporate the metadata.

  8. At this point you should see the additional results appearing in your search results. You will need to modify your template to display the result appropriately.

3.2. Add an additional filetype that is supported by Tika

The steps for adding additional filetypes vary depending on the collection type being used.

Tutorial: Add additional tika-supported filetypes to web collections

The following tutorial outlines the collection configuration settings that need to be set to add additional file types to a web collection.

  1. Ensure the filetype extension is not present in the crawler.reject_files collection configuration setting. The default value is:

    crawler.reject_files=Z,asc,asf,asx,avi,bat,bib,bin,bmp,bz2,c,class,cpp,css,deb,dll,dmg,dvi,exe,fits,fts,gif,gz,h,ico,jar,java,jpeg,jpg,lzh,man,mid,mov,mp3,mp4,mpeg,mpg,o,old,pgp,png,ppm,qt,ra,ram,rpm,svg,swf,tar,tcl,tex,tgz,tif,tiff,vob,wav,wmv,wrl,xpm,zip
  2. Set the parser mime types. If you wish links to be extracted (for crawl purposes) from the document then ensure the mime type is listed in the crawler.parser.mimeTypes list. note: only text documents should be listed here. The default value is:

    crawler.parser.mimeTypes=text/html,text/plain,text/xml,application/xhtml+xml,application/rss+xml,application/atom+xml,application/json,application/rdf+xml,application/xml
  3. Configure the non html files list. Add the file extension of the new filetype to the crawler.non_html list. The default value in collection.cfg is:

    crawler.non_html=pdf,doc,ps,ppt,xls,rtf
  4. Set the Tika processed file types. Check that the file extension is listed in the filter.tika.types list. The default value in collection.cfg is:

    filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm
  5. Run a full update of the collection.

Tutorial: Add additional tika-supported filetypes to filecopy collections

The following tutorial outlines the collection configuration settings that need to be set to add additional file types to a filecopy collection.

  1. Add the file extension of the new filetype to the filecopy.filetypes list. The default value is:

    filecopy.filetypes=doc,docx,rtf,pdf,html,xls,xlsx,txt,htm,ppt,pptx
  2. Set the Tika processed file types. Check that the file extension is listed in the filter.tika.types list. The default value is:

    filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm
  3. Run a full update of the collection.

Tutorial: Add additional tika-supported filetypes to trimpush collections

The following tutorial outlines the collection configuration settings that need to be set to add additional file types to a trimpush collection.

  1. Add the file extension of the new filetype to the trim.extracted_file_types list. The default value is:

    trim.extracted_file_types=*,doc,docx,pdf,ppt,pptx,rtf,xls,xlsx,txt,htm,html,jpg,gif,tif,vmbx
  2. Set the Tika processed file types. Check that the file extension is listed in the filter.tika.types list. The default value is:

    filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm
  3. Run a full update of the collection.

Tutorial: Add additional tika-supported filetypes to other collection types

The following tutorial outlines the collection configuration settings that need to be set to add additional file types to other collection types.

this applies to other collection types except for local collections which do not filter binary documents.
  1. Set the Tika processed file types. Check that the file extension is listed in the filter.tika.types list. The default value is:

    filter.tika.types=doc,dot,ppt,xls,rtf,docx,pptx,xlsx,xlsm,pdf,png,gif,jpg,jpeg,tif,tiff,epub,vsd,msg,odt,odp,ods,odg,docm
  2. Run a full update of the collection.

3.3. Add an additional filetype using an external converter

An external converter can also be used for conversion of binary documents. The external conversion program must be able to run on the command line, accept the input document and return the extracted text.

Tutorial: Add an additional filetype using an external converter
The use of external converters is generally discouraged as there may be a significant impact on performance as a separate system process is run for each document that is being filtered.
  1. Install binaries. Ensure any extra binaries are installed onto the Funnelback server and made executable by the search user (or relevant Windows user account used to run updates).

  2. Add any new binaries to executables.cfg and create a textify.cfg containing extension to command mappings.

  3. Ensure that ExternalFilterProvider is included in the filter chain for the collection. The default value is:

    filter.classes=CombinerFilterProvider,TikaFilterProvider,ExternalFilterProvider:JSoupProcessingFilterProvider:DocumentFixerFilterProvider
  4. Ensure that the filetype is added to the acceptable files for the collection using the Tika instructions above (web: crawler.non_html and optionally crawler.parser.mimeTypes; filecopy: filecopy.filetypes; TRIM/HP RM: trim.extractedfile_types)

  5. If the external filter is overriding Tika then ensure that the file extension is removed from filter.tika.types.

  6. Run a full update of the collection.

4. Configuring collections for different data sources

The following guides present examples of configuration for different repository types.

4.1. Intranet websites with no document level security

This guide relates to internal websites where the content is not restricted in any way based on user logon. The site is either unauthenticated, or a logged in user has access to everything.

This includes internal websites delivered using various content management systems such as Microsoft Sharepoint or IBM WebSphere if there is no document level security.

Before crawling any intranet ensure that there are no links that perform destructive operations via GET parameters as the web crawler will follow every link it finds on the site. E.g. the page has no links that will delete the page without requiring parameters to be submitted via a POST request or using JavaScript.
Before you start it is also a good idea to check that the internal site does not have a dependence on Javascript (for example it’s a single page app, it has Javascript generated menus, or all the content is generated via Javascript or pulled in via AJAX requests).

4.1.1. Unauthenticated intranets

An unauthenticated intranet can be crawled in the same way as any public website.

4.1.2. Configuring an authenticated web crawl

Most intranets will require some form of authentication in order for a user to access the content.

When configuring a user it is best practice to use an account that has been created for Funnelback, that only has read access to the portions of the website that you wish to expose in the search results.
Configure HTTP Basic or NTLM authentication

Basic HTTP authentication is configured using the following collection configuration settings:

  • http_user: set this to the valid username for the website

  • http_passwd: set this to the password

NTLM authentication is configured using the following collection configuration settings

  • crawler.ntlm.domain: set this to a valid NTLM domain

  • crawler.ntlm.username: set this to the valid username in the NTLM domain

  • crawler.ntlm.password: set this to the password

Configure form-based authentication

Form-based authentication allows crawling of the website by configuring Funnelback to automatically fill in a login form using Funnelback’s form interaction feature. This method can often be used to crawl websites that are protected by SAML authentication.

Debugging form-based authentication

Funnelback provides a debug API that can be used to assist with debugging of form-based authentication. See: the enterprise search debugging section (below) for more information.

4.1.3. Using a proxy

Note: Funnelback currently only supports basic HTTP authentication when authenticating with the proxy server.

The Funnelback web crawler can be configured to connect using a proxy server. This is sometimes required to crawl within an organisation’s network.

Four collection configuration options configure the proxy:

  • http_proxy: The hostname (e.g. proxy.company.com) of the HTTP proxy to use during crawling. This hostname should not be prefixed with 'http://'.

  • http_proxy_passwd: The proxy password to be used during crawling.

  • http_proxy_port: Port of HTTP proxy used during crawling.

  • http_proxy_user: The proxy user name to be used during crawling.

4.1.4. Crawling personalised websites

Care needs to be taken when crawling a personalised website.

When Funnelback crawls a personalised website it will index the view of the website that is returned to the web crawler - this means that special care needs to be taken to ensure that the crawled version of the site is relevant for the users of the search.

The best technique for crawling a personalised site to ensure that any personalisation is suppressed for the user that Funnelback is using to crawl the site. Personalised content should either be removed within the CMS before it is returned to Funnelback or should be wrapped in noindex tags so that these regions of the page are not indexed.

4.2. Fileshares

Funnelback supports the indexing of local or network connected fileshares via the SMB or CIFS protocol using a filecopy collection.

The Windows version of Funnelback includes support for NTFS/CIFS fileshares with document level security.

The file copier gathers content by accessing each source path then traversing the directory structure. FTP and WebDAV repositories can usually be crawled using a web collection.

4.2.1. Configuring a filecopy collection

Setup of a filecopy collection is similar to a web collection.

It requires:

  • one or more source directories. Source directories are either local paths, UNC paths or Samba URLs.

  • include and exclude rules

Funnelback includes some indexer security settings that prevent the indexing of file paths that exist beneath the $SEARCH_HOME path. This setting can be disabled by setting the -check_url_exclusion indexer option to off.

4.2.2. Configuring fileshare DLS

Some additional settings need to be configured in order to collection and index the document locks.

4.2.3. Local collections

The use of local collections is discouraged as various Funnelback features are disabled when a local collection is used. Existing local collections should be replaced with an alternate collection type such as a filecopy, web or custom collection (dependent on what the local collection is indexing).

Local collections produce an index the files located at a local file path. The files are indexed in place without any modification.

This can save on disk space but because the files are indexed in place filtering is disabled for the collection.

Local collections do not support DLS.

A filecopy collection provides similar functionality to a local collection but copies and filters the content meaning that you receive the full benefits available when creating an index.

4.2.4. Limitations and gotchas

  • DLS support is limited to Windows fileshares and only available when using the Windows version of Funnelback with NTFS fileshares.

  • Windows share permissions are not supported - when indexing Windows fileshares only the document’s permissions are indexed.

  • Crawl must occur as a user account with read access granted to all files (that should be in the search index)

  • For very large fileshares it is sometimes necessary to break up into multiple collections

  • All of the fileshares in the seed list must be available or the crawl will fail

  • Large fileshare collections will need a large heap.

  • Initial fileshare crawls are slow due to the fact that the majority of documents must be filtered.

  • The search must be accessed directly (not via a CMS) when querying DLS-enabled fileshares.

  • Filtering is disabled for local collections meaning that these collections can only include text-based content.

Exercise 1: Index a fileshare

In this exercise you will create an index of a small fileshare containing a number of e-books in various formats.

The fileshare consists of a small set of files organised in the following structure:

/opt/funnelback/share/training/build/resources/main/training-resources/training-data/e-library/Shakespeare-Complete-Works.pdf
/opt/funnelback/share/training/build/resources/main/training-resources/training-data/e-library/children/grimmscompletefa00grim.pdf
/opt/funnelback/share/training/build/resources/main/training-resources/training-data/e-library/children/pg16-images.epub
/opt/funnelback/share/training/build/resources/main/training-resources/training-data/e-library/children/pg19033-images.epub
/opt/funnelback/share/training/build/resources/main/training-resources/training-data/e-library/crime/pg2852-images.epub
/opt/funnelback/share/training/build/resources/main/training-resources/training-data/e-library/crime/Selections_from_The_Improbable_Adventures_of_Sherlock_Holmes.rtf
  1. Create a new collection with the following details

    • Project group ID: Training 204

    • Collection ID: elibrary

    • Collection type: filecopy

      exercise index a fileshare 01
    • service_name: e-library

    • filecopy.source: $SEARCH_HOME/share/training/build/resources/main/training-resources/training-data/e-library/

    • Other fields - leave as their default values

      exercise index a fileshare 02
  2. Update the collection then view the search results.

  3. Observe that the collection update completes successfully but there are no items in the index. Inspect the the log files (remember to view the live log files because the update was successful) and observe in the Step-Index.log that each item is flagged with an [Excluded by pattern] message. This indicates that Funnelback excluded the file because the URL matched an exclusion rule. The exclusion rule referred to here is one that excludes any URL that contains Funnelback’s install path - this rule is designed to prevent Funnelback from indexing internal configuration. This check can be disabled by setting an indexer option. Note: this is not a problem for most filecopy collections which will generally index a network fileshare but it is worth being aware of in case you index files beneath Funnelback’s install folder.

  4. Add the following to the indexer_options to disable the path check -check_url_exclusion=off then reindex the collection.

    exercise index a fileshare 03
  5. Observe that there are now 4 items in the search results. This is less than expected and looking at the results shows that the epub files are missing from the search results.

    exercise index a fileshare 04
  6. Configure the collection to include the epub filetype. Add the file extension of the new filetype to the filecopy.filetypes list in the collection configuration:

    filecopy.filetypes=doc,docx,rtf,pdf,html,xls,xlsx,txt,htm,ppt,pptx,epub
    exercise index a fileshare 05
  7. Run a full update of the collection and the search for !showall. Observe that the missing epub files are now returned as expected.

    exercise index a fileshare 06

4.3. Indexing SQL database content

Funnelback has the ability to connect to any SQL DBMS (with a valid JDBC driver) and index the output from running a valid SQL query within that database.

Each row from the result of the query is indexed as a separate document within Funnelback with fields mapped to Funnelback metadata classes.

Funnelback indexes the output of an SQL query. It doesn’t preserve or index the database structure or relationships within the index and is not a drop-in replacement for SQL database queries.

4.3.1. Supported databases

Funnelback ships with built-in support for PostgreSQL. Indexing of other database systems requires installation of an appropriate JDBC driver.

Funnelback has also been used successfully to index data from:

  • IBM DB2

  • Microsoft SQL server

  • MySQL

  • Oracle

  • SQLite

4.3.2. Limitations and gotchas

  • Funnelback query language does not support table joins or grouping at query time.

  • Best practice is to have the database owner create a view containing all the relevant fields in a de-normalised form so that Funnelback can run a select * from view SQL query against the table view. This simplifies the Funnelback configuration and also makes it clearer within the DBMS what will be available within Funnelback.

  • Each row in the resulting table is indexed as an item or document in Funnelback.

  • Records are downloaded and stored internally as XML.

  • Fields within the indexed view are mapped to metadata classes within Funnelback.

  • There is no built-in DLS support for database record access

4.3.3. Installing a JDBC driver

Installation of JDBC drivers requires back-end server access to Funnelback and involves copying the JDBC driver .jar files into a specified location on the server ($SEARCH_HOME/lib/java/dbgather/).

Before installing a JDBC driver ensure that the installed version of Java used by Funnelback is compatible with the driver, and also ensure that no existing drivers are installed for the database system to avoid any Java class conflicts. E.g. there are several different versions of the Microsoft SQL Server JDBC driver available - only one of these can be installed into Funnelback otherwise there will be class conflicts when the driver is loaded.

Exercise 2: Index an SQL database

This exercise covers the steps required to index an SQL database. For this exercise we will index an SQLite database using the SQLite JDBC driver included with Funnelback.

Further information about the Chinook database that is used in this exercise (including schema information): http://www.sqlitetutorial.net/sqlite-sample-database/

  1. Create a new database collection:

    • Project group ID: Training 204

    • Collection ID: chinook

    • Collection type: database

      exercise index an sql database 01
    • service_name: Chinook database

    • db.jdbc_class: set this to the appropriate JDBC driver class string. Set this to: org.sqlite.JDBC

    • db.jdbc_url: jdbc:sqlite:$SEARCH_HOME/share/training/build/resources/main/training-resources/training-data/chinook/chinook.db

    • db.username and db.password: leave these as is

    • db.full_sql_query: select * from albums

    • db.primary_id_column: Albumid

      exercise index an sql database 02
  2. Click the create button to create the database collection.

  3. Update the collection.

  4. Run a query for !showall and confirm that results are being returned.

    exercise index an sql database 03
  5. View the cached copy of For Those About To Rock We Salute You and observe the XML record that is stored by Funnelback

    exercise index an sql database 04

The cached XML record highlights the requirement for Funnelback to be provided with denormalised data - the XML record includes only an ID for the album artist rather than the artist’s name. For useful search results to be printed the record needs to expand the artist IDs. For a traditional database query this join would be performed at query time.

With Funnelback you would need to update the query that Funnelback runs to perform any table joins, or ideally have the database owner provide you with a view tailored for Funnelback to query.

Extended exercises: database collections
  1. Map fields from the results table to metadata fields using the metadata mappings configuration in the administration interface. Note: DB fields will appear as available XML field mappings. Configure the Freemarker template to display the mapped fields.

  2. Modify the database SQL query to perform a join with the artists table so that you can present the artist name in the search results.

4.4. Directories (LDAP)

Directory collections index the result of a query to a directory service such as LDAP or Microsoft Active Directory.

Directory collections are often used to provide staff directory search though this is only limited to the types of objects that are available within the directory and could be used to provide search over other objects such as rooms or assets.

4.4.1. Limitations and gotchas

  • Uses the Java Naming and Directory Interface (JNDI) to connect to the directory service.

  • Requires a user with permission to read the directory.

  • Requires a well-structured directory to work effectively, otherwise lots of manual exclusions will be required to clean the index.

  • There is no built-in DLS support for access to directory collections

  • Records are downloaded and stored internally as XML with fields mapped to Funnelback metadata classes.

4.5. Social media channels

Funnelback includes built-in support for the following social media channels:

  • YouTube

  • Facebook

  • Twitter

  • Flickr

The following channels have been successfully indexed using web and custom collections with differing levels of support for the channel:

  • Instagram

  • Vimeo

  • Linkedin

  • Soundcloud

Additional social media channels can be indexed by implementing a custom gatherer to communicate with the custom channel’s APIs to fetch and index the channel’s data feeds. See the following section on indexing APIs for further information. In some cases a web collection may be used to crawl the content as though it is a website.

Most social media gatherers will interact with a REST API and process the returned data, usually JSON or XML.

When working with social media channels it is often useful to separate these from the main search results (e.g. on a social media tab or via extra searches) and to use channel-specific presentation for different result types (e.g. using a video lightbox for YouTube results).

4.6. Delimited text, XML and JSON data sources

Data contained within delimited text files, XML and JSON can be indexed natively by Funnelback as additional data sources within any search.

Funnelback includes filters for the conversion of delimited text and JSON data to XML.

The CSV filter includes support for CSV (comma-delimited text), TSV (tab-delimited text), Microsoft Excel and SQL files.

Delimited text, XML and JSON data sources that are available via HTTP should normally be indexed using a web collection and the filter framework used to convert the data to XML which is natively indexed by Funnelback.

The fields (once converted to XML) are then mapped to Funnelback metadata classes.

If the delimited text file is available via HTTP then the recommended way of indexing the file is using a web collection.

If the delimited text is returned via an API or similar then a custom collection is probably the best way to index the data source.

4.6.1. Limitations and gotchas

  • There is no built-in DLS support for CSV, XML or JSON data

  • Records are downloaded and stored internally as XML with fields mapped to Funnelback metadata classes.

Tutorial: Delimited text, XML and JSON data sources
  • Create a web collection

  • Ensure that the URLs for the CSV/XML/JSON file(s) are added to the start URLs list and include patterns.

  • If any of the files being downloaded are larger than 10MB in size add the following collection configuration settings to ensure the file(s) are all stored. <SIZE-IN-MB> should be set to a value greater than the size of the largest file being downloaded.

    crawler.max_download_size=<SIZE-IN-MB>
    crawler.max_parse_size=<SIZE-IN-MB>
  • If crawling JSON or CSV add the following to the filter classes:

  • For CSV: CSVToXml

  • For JSON: JSONToXml

  • It may also be necessary to chain this filter with the corresponding filter to force the mime type if the web server is returning the files using an incorrect mime type. E.g. ForceCSVMime:CSVToXml or ForceJSONMime:JSONToXml

  • For CSV there may be some additional settings to configure (such as adding csv to the crawler.non_html setting, if the file has a header row or the type of CSV file as this filter supports some other delimited file formats).

    filters are not required for XML as this is natively supported by Funnelback, however if the XML file contains the incorrect headers or is missing the XML declaration it may be detected as HTML and have the HTML parser apply. This can be seen in the Step.Index.log with the entry for the file being marked as HTML rather than XML. There is a ForceXMLMime that can be added to the filter chain to force all downloaded documents to be indentified as XML (similar to ForceJSONMime above).
  • Run a crawl

  • Edit the metadata mappings. Note that CSV and JSON files will be converted to XML so the metadata mappings need to use the XPaths that correspond to the converted format. See the documentation for the CSVToXML and JSONToXML filters for more information on the mappings.

4.7. Indexing systems via an API

Funnelback provides a generic custom collection type that allows the implementation of custom logic to interact with APIs to gather content.

These custom gatherers are written using the Groovy programming language and need to implement any interactions required to authenticate with the target repository, communicate with the APIs and process the returned content. The exact steps required will vary from system to system and may require use of a set of libraries or a Java SDK.

The custom gatherer can then feed the output through Funnelback’s filter framework which can be used to convert JSON/CSV to XML as detailed above.

4.7.1. Custom gatherer design

  • Consider what a search result should be and investigate the available APIs to see what steps will be required to get to the result level data. This will often involve several calls to get the required data. It may also involve writing code to follow paginated API responses.

  • Consider what type of store is suitable for the custom gatherer - custom gatherers can operate in a similar manner to a normal collection storing the data in a warc file or can be configured to use a push collection to store the data. Using a push collection is suitable for collections that will be incrementally updated whereas the more standard warc stores are better if you fully update the data each time as you have an offline and live view of the data.

  • If working with a push collection consider how the initial gather will be done and how this might differ from the incremental updates. When working with push collections it’s also critical to think about how items will be updated and removed.

  • Start by implementing basic functionality in your custom gatherer and then iteratively enhance this.

4.7.2. Implement a custom gatherer

Tutorial: Index Jira via the Jira REST API

Step 1: figure out what you want to have as search result level records

For a search of Jira data we are interested in having the Jira issues/tickets as the result level information.

Step 2: figure out how to get access to the relevant items

The next step is to determine the best way to access the items relevant for the search index.

For example, investigate the APIs to see if it’s possible to get the issues returned via the API and what chain of API calls might be required to get to the final goal of getting back all the issue data. This might require several API calls (such as one to get a set of project IDs, then one to list out how many issues are part of each project then have a loop that fetches each ticket based on ID from 1 through to the number of issues in the project).

Inspecting the developer documentation shows that Jira has a REST API, and that has a search API call that will return a set of issues that match a particular Jira query (specified in JQL). This includes time based queries (e.g. return all tickets updated since a given date).

For Jira the following API call returns all the issues.

  • Make API Call to get list of all issues updated since 1970/01/01 00:00 https://jiraserver.mysite.com/rest/api/2/search/?jql=updated%201970%2F01%2F01%2000%3A00%22

Some additional parameters can be supplied to this to return a specific number of results, and to allow us to paginate through the results. E.g. return 100 results starting at result 1001: https://jiraserver.mysite.com/rest/api/2/search/?jql=updated%201970%2F01%2F01%2000%3A00%22&maxResults=100&start=1001

The response also includes a value that indicates the total number of results - we can make use of this to implement the logic to paginate through the results.

It is also important to check the authentication requirements when looking over the API documentation - is authentication required to access the API and if so what is it?

For Jira Basic HTTP authentication is the recommended method of accessing the REST API as long as the Jira server is accessed via HTTPS.

Step 3: validate the approach

Run some few tests directly against the REST API in a browser or using a tool such as curl. This shows that we can get the information we need back using the search API call, and that we will get the full ticket contents in the body of the return. We can paginate through the results to get a full set of issues.

Step 4: design the custom gatherer

This can be translated into a custom gatherer that performs the following operations:

  1. Make an API call to determine the total number of tickets. The total number can be read from the updatedIssues.total field in the response.

  2. Because the number of tickets is large the API calls will need to be paginated. Use the total number of tickets to create a loop that creates paginated search API requests to fetch all the tickets.

  3. For each result page read the issues into a JSON object then iterate over each individual issue, storing each as a separate item.

Step 5: implement basic custom gatherer logic

The following code implements the basic custom gathering logic, reading the Jira server, username and password from the collection.cfg.

A debug mode is also used to print additional debug information to the log.

import com.funnelback.common.*;
import com.funnelback.common.config.*;
import com.funnelback.common.io.store.*;
import com.funnelback.common.io.store.bytes.*;
import com.funnelback.common.io.store.xml.*;
import com.funnelback.common.utils.*;
import com.google.common.collect.*;
import java.io.File;
import java.net.URL;
// JSON imports
import org.json.*;

/*
 * Custom gatherer for crawling content from Jira
 * Tested with Jira server 7.1.7
 */

// Create a configuration object to read collection.cfg
def config = new NoOptionsConfig(new File(args[0]), args[1]);

// Create a Store instance to store gathered data
def store = new RawBytesStoreFactory(config)
                .withFilteringEnabled(true) // enable filtering.
                .newStore();

// Upper limit on the number of issues to return in a page of results for Jira search API calls.
def maxResults = 100;

// Read host details from configuration
def jiraHost = config.value("jira.host", null) //"https://jira.cbr.au.funnelback.com"
def jiraUser = config.value("jira.user", null)
def jiraPassword = config.value("jira.password", null)

// Set the debug flag (if applicable)
def debug = config.value("jira.debug", "false").toBoolean()

// Configure the authentication header for Basic HTTP Authentication. Note Basic HTTP Auth requires the auth string to be base64 encoded.
def auth = jiraUser+":"+jiraPassword
def authString = auth.bytes.encodeBase64().toString()

// Set the HTTP request properties
def reqProps = ["Authorization":"Basic "+authString, "User-Agent":"FunnelBack"]

// Open the store (where the individual issue records are stored for indexing)
store.open()

// Print out a status message to the update logs (gather_executable.log)
println("Performing gather of Jira repository at "+jiraHost)

try {
    // Set the JQL query to fetch items updated since the prevUpdateStartTime
    def updatedFrom = URLEncoder.encode("1970/01/01 00:00","UTF-8")

    // Determine the number of updated issues since the last update
    def updatedIssues = new JSONObject(new URL(jiraHost+"/rest/api/2/search?jql=updated%20%3E=%20%22"+updatedFrom+"%22&maxResults=0").getText(connectTimeout: reqTimeout, readTimeout: readTimeout, requestProperties: reqProps))
    def numUpdated = updatedIssues.total
    println("Found: "+numUpdated+" issues")

    // Get the issue IDs in lots of maxResults
    if (debug) { println("DEBUG: "+(numUpdated.intdiv(maxResults)+1)+" results pages to process.") }
    for (i=0; i<numUpdated; i+=maxResults) {
        def start = i + 1

        // If debug mode is enabled print out a status message indicating which results page is being fetched
        if (debug) { println("DEBUG: Fetching "+jiraHost+"/rest/api/2/search?jql=updated%20%3E=%20%22"+updatedFrom+"%22&maxResults=100&startAt="+start) }

        // Fetch the next page of issues
        def issuesPage = new JSONObject(new URL(jiraHost+"/rest/api/2/search?jql=updated%20%3E=%20%22"+updatedFrom+"%22&maxResults=100&startAt="+start).getText(connectTimeout: reqTimeout, readTimeout: readTimeout, requestProperties: reqProps))

        // Iterate over each issue in the page of results.
        issuesPage.issues.each { issue ->

            // If debug mode is enabled print out the JSON packet for the current issue
            if (debug) { println("DEBUG: "+issue.toString()) }

            // Check the issue's project ID and compare this to the include/exclude list.  Only include the issue if it's in the include list and not present in the exclude list.
            def projKey = issue.key.replaceAll("-\\d+\$","")
            println("Processing issue: "+issue.key)

            // Set the Issue URL
            def issueUrl = jiraHost+"/browse/"+issue.key
            def record = new RawBytesRecord(issue.toString().getBytes("UTF-8"), issueUrl)

            // Set the correct Content-Type of the record.
            def metadata = ArrayListMultimap.create()
            metadata.put("Content-Type", "application/json")

            // Add the item and associated to the document store
            store.add(record,metadata)

        }
    }
}
catch (e) {
    println("Error processing '"+jiraHost+"/rest/api/2/search?jql=updated%20%3E=%20%22"+updatedFrom+"%22&maxResults=100&startAt="+start+"': "+e)
}

// Close the document store
store.close()

Step 6: enhance the gatherer to add additional functionality

The code below enhances the functionality to add a number of features:

  • Allows for hardcoded values to be set via collection.cfg options

  • Adds HTTP timeouts and a request delay

  • Listens for a user stop (the user requesting a stop update)

  • Sets update feedback messages in the administration interface

  • Adds an incremental update mode and support for push collections

  • Note: there is no allowance for removing issues that have been deleted or moved.

  • Adds the ability to include/exclude jira issues based on project.

import com.funnelback.common.*;
import com.funnelback.common.config.*;
import com.funnelback.common.io.store.*;
import com.funnelback.common.io.store.bytes.*;
import com.funnelback.common.io.store.xml.*;
import com.funnelback.common.utils.*;
import com.google.common.collect.*;
import java.io.File;
import java.net.URL;
// JSON imports
import org.json.*;

/*
 * Custom gatherer for crawling content from Jira
 * Tested with Jira server 7.1.7
 */

// Create a configuration object to read collection.cfg
def config = new NoOptionsConfig(new File(args[0]), args[1]);

// Create a Store instance to store gathered data
def store = new RawBytesStoreFactory(config)
                .withFilteringEnabled(true) // enable filtering.
                .newStore();

// Read host details from configuration
def jiraHost = config.value("jira.host", null) //"https://jira.cbr.au.funnelback.com"
def jiraUser = config.value("jira.user", null)
def jiraPassword = config.value("jira.password", null)

// Read the update mode from configuration. Supported options: 'full' and 'refresh'
def jiraUpdateMode = config.value("jira.update.mode", "full")

// Include and exclude rules
def jiraInclude = config.value("jira.project.include", "").split(",")
def jiraExclude = config.value("jira.project.exclude", "").split(",")

// Upper limit on the number of issues to return in a page of results for Jira search API calls.
def maxResults = config.valueAsInt("jira.api.maxResults",100);

// Set the debug flag (if applicable)
def debug = config.value("jira.debug", "false").toBoolean()

// Read store type from collection configuration.
def storeClass = config.value("store.raw-bytes.class", "com.funnelback.common.io.store.bytes.WarcFileStore")

// Configure the authentication header for Basic HTTP Authentication. Note Basic HTTP Auth requires the auth string to be base64 encoded.
def auth = jiraUser+":"+jiraPassword
def authString = auth.bytes.encodeBase64().toString()

// Set the HTTP request properties
def reqProps = ["Authorization":"Basic "+authString, "User-Agent":"FunnelBack"]
// HTTP connection settings
def reqDelay = config.valueAsInt("jira.api.requestDelay",250); // Sleep between API requests (ms)
def reqTimeout = config.valueAsInt("jira.api.requestTimeout",60000); // Request timeout (ms)
def readTimeout = config.valueAsInt("jira.api.readTimeout",60000); // Read timeout (ms)

// Open the store (where the individual issue records are stored for indexing)
store.open()

// store and skip counter
def storeCounter = 0
def skipCounter = 0

// Recording update start time (for use with future updates)
def now = new Date()
def updateStartTime = now.format("yyyy/MM/dd hh:mm")

// Set this to the start of this update so it's populated.  We'll update this if a refresh update is run.
def prevUpdateStartTime = "1970/01/01 00:00"

// Print out a status message to the update logs (gather_executable.log)
println("Performing gather of Jira repository at "+jiraHost)

// If doing a refresh update then read the previous update time from last-update-time.cfg
if (jiraUpdateMode == "refresh") {
    // Attempt to read the previous update start time
    def prevUpdateStartTimeFile = new File(args[0], "conf" + File.separator + args[1] + File.separator + "last-update-time.cfg")
    if (prevUpdateStartTimeFile.exists()) {
        prevUpdateStartTime = prevUpdateStartTimeFile.getText('UTF-8')
    }
    // If we're not using a push store and it's a refresh update then copy the existing data from the live view
    if (storeClass != "com.funnelback.common.io.store.bytes.Push2Store") {
        def offlineData = new File(args[0], "data" + File.separator + args[1] + File.separator + "offline" + File.separator + "data");
        def liveData = new File(args[0], "data" + File.separator + args[1] + File.separator + "live" + File.separator + "data");
        org.apache.commons.io.FileUtils.copyDirectory(liveData, offlineData);
    }
}

try {
    // Set the JQL query to fetch items updated since the prevUpdateStartTime
    def updatedFrom = URLEncoder.encode(prevUpdateStartTime,"UTF-8")

    // Determine the number of updated issues since the last update
    def updatedIssues = new JSONObject(new URL(jiraHost+"/rest/api/2/search?jql=updated%20%3E=%20%22"+updatedFrom+"%22&maxResults=0").getText(connectTimeout: reqTimeout, readTimeout: readTimeout, requestProperties: reqProps))
    def numUpdated = updatedIssues.total
    println(numUpdated+" issues hav been modified since "+prevUpdateStartTime)

    // Get the issue IDs in lots of maxResults
    if (debug) { println("DEBUG: "+(numUpdated.intdiv(maxResults)+1)+" results pages to process.") }
    for (i=0; i<numUpdated; i+=maxResults) {
        def start = i + 1

        // If debug mode is enabled print out a status message indicating which results page is being fetched
        if (debug) { println("DEBUG: Fetching "+jiraHost+"/rest/api/2/search?jql=updated%20%3E=%20%22"+updatedFrom+"%22&maxResults="+maxResults+"&startAt="+start) }

        // Fetch the next page of issues
        def issuesPage = new JSONObject(new URL(jiraHost+"/rest/api/2/search?jql=updated%20%3E=%20%22"+updatedFrom+"%22&maxResults="+maxResults+"&startAt="+start).getText(connectTimeout: reqTimeout, readTimeout: readTimeout, requestProperties: reqProps))

        // Iterate over each issue in the page of results.
        issuesPage.issues.each { issue ->

            // Check for a user stop every 100 records processed, and update the admin UI status message
            if (((storeCounter+skipCounter) % 100) == 0)
            {
                // Check to see if the update has been stopped
                if (config.isUpdateStopped()) {
                    store.close()
                    throw new RuntimeException("Update stop requested by user.");
                }
                // Set the collection update progress message
                config.setProgressMessage("Stored "+storeCounter+" issues, skipped "+skipCounter+" issues.");
                println("Progress: Stored "+storeCounter+" issues, skipped "+skipCounter+" issues.")
            }

            // If debug mode is enabled print out the JSON packet for the current issue
            if (debug) { println("DEBUG: "+issue.toString()) }

            // Check the issue's project ID and compare this to the include/exclude list.  Only include the issue if it's in the include list and not present in the exclude list.
            def projKey = issue.key.replaceAll("-\\d+\$","")
            if ( (jiraInclude.contains(projKey) || jiraInclude.contains("")) && (!jiraExclude.contains(projKey) || jiraExclude.contains("")) ) {
                println("Processing issue: "+issue.key)

                // Set the Issue URL
                def issueUrl = jiraHost+"/browse/"+issue.key
                def record = new RawBytesRecord(issue.toString().getBytes("UTF-8"), issueUrl)

                // Set the correct Content-Type of the record.
                def metadata = ArrayListMultimap.create()
                metadata.put("Content-Type", "application/json")

                // Add the item and associated to the document store
                store.add(record,metadata)

                // Increment the store counter
                storeCounter++
            }
            else {
                // Increment the skip counter
                println("Skipping '"+issue.key+"' due to include/exclude rules")
                skipCounter++
            }
        }
        // Add a request delay
        sleep(reqDelay)
    }
}
catch (e) {
    println("Error processing '"+jiraHost+"/rest/api/2/search?jql=updated%20%3E=%20%22"+updatedFrom+"%22&maxResults="+maxResults+"&startAt="+start+"': "+e)
}

// Close the document store
store.close()

// Recording the time of the update in the last-update-time.cfg (as this is a successful update)
prevUpdateStartTimeFile = new File(args[0], "conf" + File.separator + args[1] + File.separator + "last-update-time.cfg")
prevUpdateStartTimeFile.write updateStartTime
Extended exercise: Integrating Jira with FB using Jira’s Webhooks and the Push API

The code above adds support for push collections but doesn’t provide the ability to remove documents from the collection.

Unfortunately there is no way to obtain a list of tickets that have been deleted from Jira via the APIs.

To support the removal of deleted tickets requires either a full gather of Jira to occur for each update, or for Jira to speak directly with the push API when a deletion is triggered.

Jira has a feature called Webhooks that allows an external URL to be called when certain events occur within Jira.

These can be used to maintain a Funnelback index. Once you have a fully gathered the issues from an instance of Jira you can setup Jira Webhooks to communicate directly with the Funnelback Push API to keep the index up-to-date.

e.g.

  • When an issue is created or updated - call the Push API and add the document

  • When an issue is deleted - call the Push API to remove the document.

4.8. Squiz Matrix

Matrix collections support document level security on Squiz Matrix websites.

A standard web collection should be use for publicly accessible websites hosted in Squiz Matrix.

Squiz Matrix will be crawled using the web crawler (so all the usual web crawler options are available) which will also index some additional security related metadata supplied by Squiz Matrix.

The query-time document level security requires that Squiz Matrix nest the Funnelback search within a REST asset. Other integration methods are not currently supported.

4.8.1. Limitations and gotchas

  • Matrix DLS cannot be combined with other collections that have DLS as nested integration is currently the only supported form of integration.

  • A Squiz Matrix security plugin would need to be developed for Funnelback to enable other integration methods to be supported, and the combining of Squiz Matrix DLS with other collections secured with DLS.

4.9. Push collections

Push collections are a generic collection type within Funnelback that are updated via a REST API.

It is possible to integrate with a push collection directly from a third party system if it contains support for triggering external requests when events within the system are triggered.

4.9.1. Basic integration of a push collection with a CMS

A push collection should be created within Funnelback along with a non-expiring API token for the CMS to use to authenticate with Funnelback.

Event handlers within the CMS should be configured that handle the following:

  1. When an item is added or updated call the Funnelback Push API with a PUT request that submits the content.

  2. When an item is deleted call the Funnelback push API with a DEL request to delete the content.

  3. When an item is moved call the Funnelback push API with a PUT request submitting the content on the new URL, and a DEL request to remove the content from the old URL.

A method of doing an initial or bulk update should also be considered and may involve writing a Funnelback custom gatherer that fetches the initial content from the CMS, or have a data load script or process created within the CMS that sequentially submits all the content to the push collection.

The system that interacts with Funnelback’s push API must also perform error handling in the event that the push API returns an error.

This could include logging the error or implementing a process that queues requests and retries any failures.

Error handling by the application responsible for submitting the content to the push API is extremely important because push collection updates always modify an existing index in place and any failures to add or remove content that are not retried successfully will result in an index that is missing content or contains content that has been deleted. This differs from standard Funnelback collections that have processes to fully refresh the index periodically.

4.9.2. Push collections and document level security

Push collections can be used with document level security (for example the trimpush collection type that ships with Funnelback is underpinned by a push collection).

As with other collection types the implementation of DLS requires implementation of a number of custom components that are responsible for handling locks/keys extraction and matching. The push collection itself is purely concerned with the storage and retrieval of the information contained within the indexes.

4.10. ManifoldCF enterprise connectors

Funnelback includes support for connecting to a number of enterprise repository systems through an open source connector framework called ManifoldCF. This framework, along with the associated Funnelback ManifoldCF connector, allows Funnelback to be populated content from repositories supported by ManifoldCF and to apply document level security for repositories where ManifoldCF supports fetching security information.

A separate ManifoldCF server is required to connect to the enterprise content repositories and fetch the content. ManifoldCF then submits the documents to a push collection that has been configured within Funnelback using the Funnelback output connector that must be installed into ManifoldCF.

Mixed levels of success have been experienced when using ManifoldCF connectors. Caution is advised when using any of the existing connectors.
Review questions: configuring collections for different data sources
  1. Can you use a local collection to index a folder full of documents?

  2. What’s the difference between filecopy and local collection types?

  3. Can you index a fileshare on Linux? Are there any limitations?

  4. What are some of the things that you need to account for when crawling an intranet that requires authentication?

  5. Could I use a directory collection to produce a building map showing staff locations? What would be required?

  6. Could I use a directory collection to provide a search of available resources such as rooms and printers?

  7. Explain the limitations of indexing a SQL database.

  8. How would you exclude content from a database search.

  9. What database systems are supported by Funnelback?

  10. What social media channels does Funnelback support?

  11. How would you add support for another social media channel?

5. Implementing custom DLS

Tutorial: DLS for Sitecore

The following was implemented for an older version of Sitecore and may no longer be functional. However it is a good example of what is typically required when implementing custom DLS.

Step 1: design and implement a strategy to provide the document locks to Funnelback

After investigating the options available it was decided to enable Sitecore to return the document locks in a custom HTTP response header when the Funnelback crawl user requests a URL from the site.

The following (Sitecore) code was written as a Sitecore plugin and when enabled returns the set of document locks as a custom HTTP response header. The code also returns some additional headers exposing some other Sitecore metadata:

using System;
using System.Web;
using System.Text;
using System.Reflection;
using System.Collections;
using Sitecore;
using Sitecore.Pipelines.HttpRequest;
using Sitecore.Data.Fields;
using System.Configuration;

namespace FunnelbackSitecore
{
    public class ProvideFunnelbackHeaders : HttpRequestProcessor
    {
        public override void Process(HttpRequestArgs args)
        {
            if (Context.Item == null)
                return;
            if (Context.Database == null)
                return;

            String username = ConfigurationManager.AppSettings["Funnelback.CrawlerAccountName"];

            if (Context.GetUserName() != username)
            {
                String debugMode = ConfigurationManager.AppSettings["Funnelback.DebugMode"];
                if (debugMode == "true")
                {
                    args.Context.Response.Headers.Add("X-Funnelback-Debug", "Username is " + Context.GetUserName() + " but we are configured to use " + username);
                }
                return;
            }

            args.Context.Response.Headers.Add("X-Funnelback-Id", Context.Item.ID.Guid.ToString());

            String title = "";
            foreach (Field field in Context.Item.Fields)
            {
                if (field.DisplayName.ToLowerInvariant() == Hyro.SitecoreHelper.SitecoreConstants.TitleFieldName.ToLowerInvariant())
                {
                    title += " " + field.Value;
                }
            }
            if (title.Length == 0)
            {
                title = Context.Item.DisplayName;
            }
            if (title.Length > 0)
            {
                args.Context.Response.Headers.Add("X-Funnelback-Title", title);
            }

            String description = "";
            foreach (Field field in Context.Item.Fields)
            {
                if (field.DisplayName.ToLowerInvariant() == Hyro.SitecoreHelper.SitecoreConstants.SummaryFieldName.ToLowerInvariant())
                {
                    description += " " + field.Value;
                }
            }
            if (description.Length > 0)
            {
                args.Context.Response.Headers.Add("X-Funnelback-Description", description);
            }

            String permissions = "master:Allow";
            permissions += permissions_for_item(Context.Item);
            args.Context.Response.Headers.Add("X-Funnelback-Lockstring", permissions);
        }

        private string permissions_for_item(Sitecore.Data.Items.Item item)
        {
            string result = "";

            var rules = item.Security.GetAccessRules();

            string allow = "";
            string deny = "";
            string inherit_allow = "";
            string inherit_deny = "";
            string debug = "";
            foreach (var rule in rules)
            {
                debug += "[" + item.DisplayName + "-" + rule.Account.Name + "-" + rule.SecurityPermission + "-" + rule.AccessRight + "],";
                if (rule.SecurityPermission == Sitecore.Security.AccessControl.SecurityPermission.AllowAccess)
                {
                    if (rule.AccessRight == Sitecore.Security.AccessControl.AccessRight.ItemRead || rule.AccessRight == Sitecore.Security.AccessControl.AccessRight.Any)
                    {
                        allow += "," + rule.Account.Name + ":Allow";
                    }
                }
                else if (rule.SecurityPermission == Sitecore.Security.AccessControl.SecurityPermission.DenyAccess)
                {
                    if (rule.AccessRight == Sitecore.Security.AccessControl.AccessRight.ItemRead || rule.AccessRight == Sitecore.Security.AccessControl.AccessRight.Any)
                    {
                        deny += "," + rule.Account.Name + ":Deny";
                    }
                }
                else if (rule.SecurityPermission == Sitecore.Security.AccessControl.SecurityPermission.AllowInheritance)
                {
                    inherit_allow += "," + rule.Account.Name + ":InheritAllow";
                }
                else if (rule.SecurityPermission == Sitecore.Security.AccessControl.SecurityPermission.DenyInheritance)
                {
                    inherit_deny += "," + rule.Account.Name + ":InheritDeny";
                }
            }
            // Denies override allows (Security Admin's Cookbook pg 7)
            result += deny + allow;

            // TODO - is this correct?
            result += inherit_deny + inherit_allow;

            if (item.Parent != null)
            {
                result += ",parent_item(" + item.Parent.DisplayName + ")" + permissions_for_item(item.Parent);
            }

            // result = result + "," + debug;
            return result;
        }
    }

}

This (Microsoft Visual Studio) code builds component that must be installed within Sitecore.

Sitecore is configured to call this code as part of the request processing pipeline. Once this is configured additional response headers are returned when page is requested from Sitecore:

X-Funnelback-Id:4aea0172-2f00-4d62-9455-06fd8ef12ac6
X-Funnelback-Lockstring:master:Allow,sitecore\user3:Deny,sitecore\user1:Deny,sitecore\groupB:Allow,sitecore\groupB:Allow,sitecore\user2:Allow,parent_item(events),sitecore\groupB:Deny,sitecore\groupA:Deny,sitecore\groupC:Deny,sitecore\groupB:Allow,sitecore\groupA:Allow,sitecore\groupC:Allow,parent_item(news-and-events),parent_item(Home),Everyone:Deny,Everyone:Allow,sitecore\groupA:Allow,sitecore\groupA:Allow,sitecore\groupB:Allow,sitecore\groupB:Allow,sitecore\funnelback:Allow,sitecore\funnelback:Allow,sitecore\groupC:Allow,sitecore\groupC:Allow,parent_item(Content),Everyone:Allow,Everyone:Allow,parent_item(sitecore),Everyone:Allow,Everyone:Allow
X-Funnelback-Title:content-c
X-Funnelback-Description:

Step 2: design and implement a user to key mapper

A Sitecore web service was created to enable Funnelback to look up a user’s set of keys at query time.

The following Sitecore code implements a web service within Sitecore that returns a user’s set of keys. This coded is installed into Sitecore and sets up a web service.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Sitecore;

/*
 * A web service for getting keys for the current user.
 *
 * Possibly there is some way to look up a user given their name and
 * get keys for that (hence why it takes a username parameter, though
 * this is currently not used).
 *
 * Can be called to test with...
 * curl -vkX POST \--data-binary '@post.txt' -H 'SOAPAction: "http://sitecore.example.com/UserKeys"' -H'Content-Type: text/xml; charset=utf-8' https://sitecore.example.com/sitecore/shell/WebService/funnelback.asmx
 *
 * post.txt should contain...
 *
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
  <soap:Body>
    <UserKeys xmlns="http://sitecore.example.com/">
      <username>user2</username>
    </UserKeys>
  </soap:Body>
</soap:Envelope>
 *
 *
 * To make this available in sitecore, add Funnelback.asmx containing
 *
 * <%@ WebService Language="c#" Codebehind="KeyService.cs" Class="FunnelbackSitecore.KeyService" %>
 *
 * to <SITCORE>\Website\sitecore\shell\WebService
 *
 */
namespace FunnelbackSitecore
{
    public class KeyService
    {
        [System.Web.Services.WebMethod]
        public string UserKeys(string username)
        {
            string keystring = "";
            keystring += Context.GetUserName();
            foreach (Sitecore.Security.Accounts.Role role in Context.User.Roles)
            {
                keystring += "," + role.Name;
            }
            // TODO - is this correct?
            // There seems to be no "Everyone role", so just assume they're in it
            keystring += ",sitecore\\Everyone,Everyone";

            return keystring;
        }
    }
}

An example of a request/response is:

GET http://sitecore.example.com/sitecore/shell/Userservice/userservice.asmx/UserGroups?username=org\testuser1 HTTP/1.1
Host: sitecore.example.com
Connection: keep-alive
Cache-Control: max-age=0

HTTP/1.1 200 OK
Cache-Control: private, max-age=0
Content-Type: text/xml; charset=utf-8
Server: Microsoft-IIS/8.0
X-Powered-By: UrlRewriter.NET 2.0.0
X-AspNet-Version: 4.0.30319`
X-Powered-By: ASP.NET
Date: Mon, 14 Oct 2013 00:30:36 GMT
Content-Length: 128

<?xml version="1.0" encoding="utf-8"?>
<roles>org\STAFF,org\Internal users,sitecore\Everyone,Everyone</roles>

A plugin for Funnelback (a user to key mapper) is then required to enable Funnelback to access the user’s keys by communicating with the web service.

The mapper is a Java class within the modern UI and uses the user’s current login information to communicate with Sitecore to get back the set of user keys.

package com.funnelback.publicui.search.lifecycle.input.processors.userkeys;

import java.io.ByteArrayInputStream;
import java.io.InputStream;
import java.io.StringWriter;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;

import javax.xml.parsers.DocumentBuilderFactory;

import org.apache.commons.io.IOUtils;
import org.w3c.dom.Document;

import lombok.extern.log4j.Log4j2;

import com.funnelback.common.config.Keys;
import com.funnelback.publicui.search.model.collection.Collection;
import com.funnelback.publicui.search.model.transaction.SearchTransaction;

/**
 * Authenticates search users via SiteCore-provided XML.
 */
@Log4j2
public class SitecoreMapper implements UserKeysMapper {

    private static final String xmlTagOfInterest = "roles";

    @Override
    public List<String> getUserKeys(Collection currentCollection, SearchTransaction transaction) {

        //Storage for keys to return
        List<String> userKeys = new ArrayList<String>();

        String urlStem = null;
        String searchName = null;
        try {

            //Assemble the url to query
            try {
                //Pull the start of the URL from collection.cfg
                //e.g. http://sitecore.example.com/sitecore/shell/userservice/userservice.asmx/UserGroups?username=
                urlStem = currentCollection.getConfiguration().value(Keys.SecurityEarlyBinding.SITECORE_SERVICE_URL);
                if(urlStem == null) {
                    throw new Exception ("urlStem was null");
                }
            } catch (Exception e) {
                log.error("Unable to get sitecore service url out of collection.cfg", e);
                //Return immediately
                return userKeys;
            }

            try {
                //Attach the current search user to the end
                searchName = transaction.getQuestion().getPrincipal().getName();
                if(searchName == null) {
                    throw new Exception("searchName was null");
                }
            } catch (Exception e) {
                log.error("Unable to get principle name from search transaction", e);
                //Return immediately
                return userKeys;
            }

            URL url = new URL(urlStem + searchName);
            InputStream is = url.openStream();

            StringWriter writer = new StringWriter();
            IOUtils.copy(is, writer, "UTF-8");
            String xmlContents = writer.toString();

            is.close();

            log.debug("XML Contents to follow: ");
            log.debug(xmlContents);

            InputStream xmlIs = new ByteArrayInputStream(xmlContents.getBytes());

            //Parse the xml
            Document doc =
                DocumentBuilderFactory
                    .newInstance()
                    .newDocumentBuilder()
                    .parse(xmlIs);

            //create a debug log string
            log.debug("LOOKED IN: " + urlStem + searchName);

            String payload = doc
                .getElementsByTagName(xmlTagOfInterest)
                .item(0)
                .getTextContent();

            //Split the comma-separated values and add then to the return list
            for (String s : payload.split(",")) {

                //Remove internal spaces - experimental to try to get it working
                userKeys.add(s.replace(" ", "").trim());
            }

            String dbgUserKeys = "FOR USER " + searchName + ", RETURNED USERKEYS:\r\n";
            for(String s : userKeys) {
                dbgUserKeys += s + "\r\n";
            }
            log.debug(dbgUserKeys);

        } catch (Exception e) {
            log.error("Hit " + e.toString() + " trying to parse xml from " + urlStem + " with searchName " + searchName, e);
        }
        return userKeys;
    }
}

Step 3: Implement a security plugin

The security plugin implements the locks to keys matcher. The security plugin takes the set of user keys (obtained by communicating with the Sitecore webservice) and implements an algorith that compares these keys with the set of locks or security permissions on each document. The job of the security plugin is to determine if a user has access to the document based on the set of keys supplied and the set of locks recorded against the document.

/*
 * Security plugin for Sitecore
 */

/*
 * PLUGIN_DEFINITION must be defined in top level plugin source
 */
#define PLUGIN_DEFINITION

#include <sys/types.h>
#include "common/windows.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include "common/utility_nodeps.h"
#include "queries/secPlugin.h"

#ifdef __MINGW32__
char *strtok_r(char *str, const char *delim, char **save)
{
    char *res, *last;

    if( !save )
        return strtok(str, delim);
    if( !str && !(str = *save) )
        return NULL;
    last = str + strlen(str);
    if( (*save = res = strtok(str, delim)) )
    {
        *save += strlen(res);
        if( *save < last )
            (*save)++;
        else
            *save = NULL;
    }
    return res;
}
#endif

/*
 * session information structure
 *
 * Persistent for life of plugin
 */
typedef struct sampleInfo {
    char name[100];
    int count;
} sampleInfo_t;

int secPluginOpen(secPluginInfo_t *info, void **userfield) {

#ifdef VERBOSE
    printf("in secPluginOpen info=%p userfield=%p\n", info, userfield);

    printf("secPluginSample open: name=%s\n", info->name);

    printf("Passed parameters\n");
    printf("  name=%s\n", info->name);
    printf("  script=%s\n", info->script);
    printf("  search_home=%s\n", info->search_home);
    printf("  collection_name=%s\n", info->collection_name);
    printf("  profile name=%s\n", info->profile_name);
    printf("  document_level_security=%d\n", info->document_level_security);
    printf("  document level security limit=%d\n", info->dls_max2check);
#endif

    /*
     * create the persistent structure
     */
    sampleInfo_t *p = (sampleInfo_t *) malloc(sizeof (sampleInfo_t));
    *userfield = p;

    /*
     * do something with session persistent  struct
     */
    strcpy(p->name, "secPlugin: ");
    strcat(p->name, info->name);
    p->count = 0;

    return 0;
}

int secPluginClose(secPluginInfo_t *info, void **userfield) {

    if (*userfield == NULL) {
#ifdef VERBOSE
        printf("secPluginSample close: userfield is null!!!\n");
#endif
    } else {
#ifdef VERBOSE
        sampleInfo_t *p = (sampleInfo_t *) (*userfield);
        printf("secPluginSample close: name=%s used=%d\n", p->name, p->count);
#endif

        free(*userfield);
#ifdef VERBOSE
        printf("secPluginSample close: userfield struct freed");
#endif
        *userfield = NULL; // be tidy
    }

    return 0;
}

char *strmalcpy_char(char *instring) {
	char *rslt;
	int len, arg;

	if (instring == NULL) return NULL;

	len = strlen(instring);
	arg = len + 1;

	rslt = malloc (arg);
	if (rslt == NULL) {
		fprintf(stderr, "strmalcpy_char: Malloc failed in Sitecore security checker.");
		exit(-1);
	}
	if (arg > 1) strcpy(rslt, instring);
	else rslt[0] = 0;
	return rslt;
}

int keystring_contains(char *keystring, char *user) {
	char *keystring_buffer, *candidate, *save_ptr;

	keystring_buffer = strmalcpy_char(keystring); // non-const copy
	candidate = strtok_r(keystring_buffer, "\r\n", &save_ptr);

	while (candidate != NULL) {
		if (strcmp(candidate, user) == 0) {
			free(keystring_buffer);
			return 1;
		}

		candidate = strtok_r(NULL, "\r\n", &save_ptr);
	}

	free(keystring_buffer);
	return 0;
}

int key_matches_lock(byte *keystring, byte *lockstring, secPluginInfo_t *info, void **userfield) {
	char *lockstring_buffer, *save_ptr, *lock, *colon, *user, *permission;

    if (lockstring == NULL || lockstring[0] == 0) return 0;
    if (keystring == NULL || keystring[0] == 0) return 0;

	lockstring_buffer = strmalcpy_char(lockstring); // non-const copy

	lock = strtok_r(lockstring_buffer, "\r\n", &save_ptr);

    while (lock != NULL) {

		colon = strchr(lock, ':');
		if (colon != NULL) {
			*colon = 0; // Null it out
			user = lock;
			permission = colon + 1;

			if (keystring_contains(keystring, user)) {

				if (strcmp(permission, "Allow") == 0) {
					free(lockstring_buffer);
					return 1;
				} else if (strcmp(permission, "Deny") == 0) {
					free(lockstring_buffer);
					return 0;
				} else if (strcmp(permission, "InheritDeny") == 0) {
					// If we haven't got permission yet, and can't inherit, deny.
					free(lockstring_buffer);
					return 0;
				}
			}
		}

		lock = strtok_r(NULL, "\r\n", &save_ptr);
    }

	free(lockstring_buffer);
    return 0;
}

Enterprise search is notoriously difficult to debug because of the complexity and level of integration that is typical of such searches.

There are two main areas of troubleshooting that are specific to enterprise search

6.1. Troubleshooting the gather process

6.1.1. Manually test connections

It is a good idea to manually test your connection when experiencing connection issues with an update. The advantage of manually testing the connection is that you can connect directly to the system using mechanisms that are completely independent of Funnelback. This helps to determine if the problem lies within the Funnelback configuration, or it’s another misconfiguration such as a blocked connection or incorrectly configured credentials within the target system.

Most connections can be tested manually using a tool such as curl. This can test websites and also API connections.

Attempt to connect to the seed URL or connection URL using curl, passing all the relevant parameters and inspect the response. It is often useful to also inspect the response headers.

Other systems may require specific tools to test manually. For example, to test a database connection use the database’s native client to connect to the database with the credentials configured within Funnelback.

6.1.2. Increase log levels

Web collections

Set the crawler.verbosity in collection.cfg to increase the logging provided by the web crawler.

Trimpush collections

Alter the logging key’s value in $SEARCH_HOME/wbin/trim/Funnelback.TRIM.Gather.exe.config

e.g. Set the log level to debug. Update:

<arg key="level" value="INFO"/>

to

<arg key="level" value="DEBUG"/>
Push collections

Push collections are a service provided by Jetty. The logging configuration for push collections is set globally in $SEARCH_HOME/web/conf/push/log4j2.xml

e.g. increase logging level to debug. Update:

<Logger name="com.funnelback" level="info"/>

to

<Logger name="com.funnelback" level="debug"/>
Other collection types

The log4j2 framework is used for logging for most collection types. Logging settings are usually inherited from the global log4j2 settings contained in $SEARCH_HOME/conf/log4j2.xml.default

To override these, copy the log4j2.xml.default file to the collection’s configuration folder as log4j2.xml and modify the level (similar to push collections above).

6.1.3. Debugging form authentication

The Funnelback admin API provides a debug API call that can be used to assist with the debugging of form authentication.

It executes the form authentication using the collection configuration and logs all the communication and redirects that occur as part of the connection.

The debug call is accessed from the API-UI option in the administration interface system menu and is part of the admin API.

6.2. Troubleshooting query time issues

Debugging can be tricky due to the authentication and other factors that are specific to the site hosting the enterprise search. E.g.

  • The Funnelback server will often be installed on Windows and the server won’t have suitable tools for testing and debugging. A list of some suitable tools is available in this KB article: Useful programs to have installed on a Windows environment for Funnelback.

  • Internet Explorer used to be the only option for interacting with enterprise search due to the need for Windows integrated authentication support (with Kerberos), but it both Chrome and Firefox can now also be configured for this sort of integration (but it requires some advanced configuration). See: http://woshub.com/enable-kerberos-authentication-in-browser/

  • The security zones (e.g. Intranet, trusted) in Internet Explorer often block things. When testing ensure you use a local hostname (not fully qualified) - and that IE treats the site as being in the intranet zone.

  • Group policies may mean that you will be prompted for a username and password even when single sign-on is enabled.

  • Always start any testing using the service account used to gather the content. This user should always have access to content within the search.

  • Testing via the server that hosts Funnelback may yield different results to a normal network connected user and this may be different again from a user that is connected via a VPN.

6.2.1. Troubleshooting document level security

There are three fundamental areas that need testing for document level security. Test for the following in this order.

  • Test that lock strings are being correctly extracted. Lock string extraction should result in lock strings being written to the security type metadata field.

  • Test that user key lookup is functioning.

  • Test that locks-keys matching is working correctly.

If Funnelback is returning unexpected documents (for the expected permissions) attempt to access the file directly before debugging inside Funnelback. It is common for incorrectly set permissions to be the cause of any unexpected documents (rather than there being an issue with how the DLS is functioning). Attempt searches using the user account used for gathering before testing with other accounts. The gather user should have access to all documents in the index if DLS is correctly functioning.

For TRIM collections there are debug options for each of the binaries and these are set in the same way as outlined above for the Trim gatherer.

6.2.2. Modern UI debug options

The modern UI provides increased logging levels to assist with debugging user key lookup and locks-keys matching.

Edit $SEARCH_HOME/web/conf/modernui/log4j2.xml. Towards the end where the <Loggers> are defined, add a few more loggers:

<Loggers>
  <Root level="warn">
    <AppenderRef ref="ModernUILog"/>
  </Root>
  <Logger name="com.funnelback" level="info"/>

  [...]

  <!-- To display the raw XML packet from PADRE -->
  <Logger name="com.funnelback.publicui.search.lifecycle.data.fetchers.padre.AbstractPadreForking" level="trace" />

  <!-- Additional logging for the single sign-on / Modern UI authentication -->
  <Logger name="waffle" level="trace" />

  [...]
</Loggers>

Changes to this file are reloaded every 30s, so no Jetty restart is required.

Additional TRACE logs should appear in the Modern UI logs.

$SEARCH_HOME/data/<COLLECTION NAME>/log/modernui.(Admin|Public).log

Padre has a -deb_security=true query processor option that writes additional information to the XML response. To view this output the Modern UI needs to be configured to output the Padre XML response in the modern UI log.

Review questions: Troubleshooting
  • A customer complains that they are seeing items returned in a DLS enabled search that they believe they should not have access to. Explain how you would resolve this.