This course is currently in draft.

Introduction

This course is for solution architects and frontend developers and provides an introduction to designing and implementing an enterprise search using Funnelback.

Special tool tips or hints will appear throughout the exercises as well, which will provide extra knowledge or tips. They will look like:

This box is used to provide links to further reading available on the current topic.
This box is used to provide tips or hints related to the current topic.
This box is used to provide important advice relating to the current topic. This includes advice to be aware of to avoid common errors.
This box contains advice specific to the Linux version of Funnelback.
This box contains advice specific to the Windows version of Funnelback.

What this workshop will cover:

  • Overview of enterprise search

  • Designing an enterprise search

  • Collection security

  • Enterprise repositories

Prerequisites to completing the course:

  • FUNL101, FUNL201, FUNL202, FUNL203, FUNL204, FUNL301

  • Extensive experience with Funnelback implementation.

  • System administration skills.

1. Background

This course looks at what is involved in designing an enterprise search solution and extends the methodoloy for gathering requirements and translating these into a functional design outlined in FUNL301.

The course is aimed primarily at solution architects who are designing the search on behalf of a customer, but the information is also useful if you are designing the search for yourself.

2. Enterprise search overview

An enterprise search typically aims to provide a single search interface across an organisation’s internal data repositories.

enterprise search overview 01

This in essence is a fairly straightforward problem - but the solution can be very complicated.

2.1. How is an enterprise search structured?

A typical enterprise search consists of a number of collections that correspond to the different data repositories. Each collection handles how to connect to and retrieve content from the respective repository.

These collections are tied together with one or more meta collections which are configured to provide a search interface and search over one or more of the content repository collections.

2.2. What makes enterprise search difficult?

There are a number of factors common to enterprise search that can vastly increase the solution’s complexity and chance of failure.

  • High levels of system integration. Search covers numerous repositories of different types, requiring a lot of integration points with other systems. Integration can go in both directions - with Funnelback integrating to index content from different systems and some systems may integrate with Funnelback to nest Funnelback search results within the application, or to submit content directly to the Funnelback push API.

  • Data repositories are often very large. This affects the resourcing required for any setup and usually demands a multi-server environment. It also affects the crawl time, with some repositories taking weeks or months to get fully crawled. It also makes debugging a lot more difficult due to the size of log files.

  • Authentication and filtering of results based on security permissions is a common requirement. This vastly increases complexity of the search, the amount of work to process a query and also places some limitations on the search.

  • Internal systems are often personalised. Personalised content presents a unique problem - Funnelback is only able to index the view of content that is returned to it and cannot account for what different users will see when accessing the resource.

  • Internal network policies often hinder access to data. It is common for internal security configuration to present barriers to accessing data. This can also impose restrictions on where the Funnelback server is located and how it is configured. Funnelback typically requires an on-premises installation. And it can often be difficult to obtain user accounts that have non-expiring passwords or the level of access required to produce the index.

  • Some of the repository types require the Windows version of Funnelback. Awareness of the dependencies is crucial when designing the solution.

The importance of careful planning cannot be overstated when planning an enterprise search.

3.1. Gather requirements

The requirements gathering process for enterprise search should follow a similar methodology to that outlined in the FUNL301 designing website search course, but with additional questions that are specific to the enterprise search requirements.

Some of the key areas of additional requirements are highlighted in the sections below.

3.2. Identify candidate repositories

The first step when planning an enterprise search is to identify the different content repositories that should be included in the search.

Avoid the temptation to include everything that is available - include repositories if their presence in the search results is useful.

Less repositories will improve response times, reduce noise in the search results and also reduce integration issues.

For each for each repository that should be included:

  • Note the repository type (including version).

  • Identify if any authentication is required to access the content, and if so what type of authentication is used.

  • Consider if document level security is required at search time.

  • Consider what content should be included and excluded from the index.

  • Estimate the size of the repository (in both MB and number of documents)

Special connectors are usually not required for web systems such as Sharepoint or Sitecore if no document level security is required. For this use case the Sharepoint or Sitecore system is just another website to Funnelback that can be accessed using the web crawler.
Some repository types are only available on the Windows version of Funnelback.

3.3. Consider repository access requirements

It goes without saying that the Funnelback server will need to be able to access each of the repositories that will be indexed for the enterprise search.

There are many factors to consider including:

  • Where (in terms of network architecture) do each of the repositories sit and will the Funnelback server be able to communicate with the repository? Different networks (dev vs prod), network zones (such as internal vs. DMZ) and other factors (such as firewalling) may prevent Funnelback from communicating with the repository).

  • What sort of authentication is required in order for Funnelback to access the repository content? Is this supported? Some repositories support multiple authentication methods and Funnelback may not be compatible will all of the methods.

3.4. Consider security requirements

When planning an enterprise search carefully consider the security requirements.

Enabling security (especially document level security) should be avoided unless absolutely necessary as it adds a huge amount of complexity to the search, can dramatically affect the search response times and also the reliability of the search once it has been released.

When considering the security requirements:

  • Consider if security is actually required for the search. It is quite common for the security to be a solid requirement that is descoped when the complexity of the search and cost/benefit is considered. Also consider if minor compromises (such as excluding some sensitive content) could mean that an unsecured search could be provided without significantly reducing the usefulness of the search.

    • As an example, a web intranet search should be secured so that it is available only to internal staff and not to the general public. However, it may be available to all staff, meaning that there are no requirements to respect permission for each document / page (i.e. no document level security)

  • If security is required what level of security will be appropriate?

    • Collection level security can be used to restrict the search to specified IP addresses.

    • Document level security can be used to filter the result set to return only the results a user has permissions to view.

  • If document level security is required further investigation is needed:

    • Document level security is only supported for certain repository types.

    • Document level security restricts the results to those that the current user can see. This requires the same usernames to be used across the different authenticated repositories. E.g. if you’re searching as user=jsmith, then this will be applied to each of the repositories.

    • Some repository types apply extra restrictions on how the search can be configured. E.g. TRIM repositories are restricted to direct searching only, and only on Windows - Funnelback search cannot be nested within another system (such as a CMS) as TRIM search requires user impersonation to be configured. These restrictions can also impose restrictions on who can access the search (for example impersonation may only work for users on the same network zone as the Funnelback server).

    • Some repository types have other specific requirements. E.g. TRIM collections require single sign-on using Windows integrated authentication with Kerberos, and delegation to be configured.

    • Users will need to be authenticated to access the search. Funnelback will need to be integrated with the authentication mechanism (Active Directory, SAML, etc.)

3.5. Consider search-time authentication requirements

If the search needs to be authenticated some consideration needs to be given to how the user will be interacting with the search.

The included repository types and types of security applied to the search results will restrict how a user can interact with the search and limit the options available for search results integration with other systems.

  • A content management system (CMS) can be used to authenticate and personalise a user’s search experience in some cases. For this scenario collection level security can be used to restrict the search to return results only to the CMS server meaning that all search requests must be handled via the CMS.

  • Because collection level security applies IP restrictions to the search it can also be used to provide un-authenticated search to a set of controlled IP addresses.

  • Funnelback includes support for SAML single sign-on authentication of users at query time via a single identity provider. However, at this time the SAML attributes (such as group membership) cannot used for search purposes / DLS.

  • Funnelback (running on Windows) also includes support for single sign-on using Microsoft Active Directory.

3.6. Design the search interfaces

The design of the search results screen was covered in some detail in the FUNL301 course and the questions posed there are equally applicable to enterprise searches.

Think about each of the search interfaces that need to be provided by the solution.

A search interface in this context is any point where the search is called to return a set of results. This includes both interactive searches run by users and search powered content where the Funnelback search index is used in a similar manner to a database to provide content for web pages and other similar services.

For each search interface consider:

  • what sort of response Funnelback should provide when it returns the results. For example, a fully templated and formatted HTML page, or a JSON packet containing a specified set of fields.

  • what repositories should be included in the overall set of results. (e.g. a people search screen would only include results from the people data source.)

  • how each type of result will be formatted and what information should be displayed. This is more important to consider for enterprise repositories which often have a much higher reliance on metadata.

  • faceted navigation (filters) that will be applied across the search, and what the filters will be based on. Also consider if tab facets are appropriate to segment the search results listing.

  • which data sources are suitable for use in concierge auto-completion. For example for most internal searches it is desirable to show people directory results in the auto-completion.

  • which data sources should be listed in the main results, and which ones should perhaps be separated as an extra search that supplements the main set of results.

  • if any result collapsing (grouping) should be applied to the search results.

  • useful relationships that exist in the data, that could be used to configure knowledge graph.

  • specific curator rules, synonyms and best bets that should be applied.

4. Search solution design

The solution design process for enterprise search extends what was covered in FUNL301 Designing website search.

A similar process should be followed as for web search solution design to create a high-level design and functional design.

4.1. High level design

The high-level design should cover the same elements as for web search:

  • Analysis of requirements obtained via the requirements gathering session

  • Determination of the collection structure, accounting for the additional constraints imposed by the different enterprise collection types. The enterprise collection types are covered in more detail later in this course.

  • Identification of the profiles and frontend services

  • Consideration of custom functionality

In addition the hosting architecture needs careful consideration.

4.2. Functional design

The same methodology used for web search should be followed when designing an enterprise search.

As for web search consider:

  • Data source collections

  • Frontend meta collections

  • Data source requirements

Special consideration should be given to the system integration related requirements. This includes:

  • Pre-requisites such as the user accounts and any special access requirements of these accounts

  • Any document level security requirements

5. Search hosting design

The system architecture requires careful planning for an enterprise solution to ensure that it meets basic performance and integration requirements.

The information below can also be used for the design of installed web search solutions.

5.1. Consider hosting options

Almost all enterprise searches will require an installed version of Funnelback on customer infrastructure due to security constraints placed on the search.

Even the location of the server within the customer’s network will require careful planning due to the security constraints that apply to the search.

Any servers used for Funnelback should be dedicated for Funnelback and not host other services. This will avoid any conflicts due to software dependencies or required services/ports and also make it easier to solve problems when they occur.

5.2. Choose a suitable platform

Funnelback is available for Windows and Linux. There are many factors which can dictate which version to use. Consider:

  • any repository types that require a specific platform (e.g. TRIM and Windows fileshare collections with DLS require the Windows version of Funnelback).

  • the environment into which Funnelback will be deployed. There is little sense in deploying a Linux version of Funnelback into an environment that consists mostly of Windows servers. If the search is being implemented for a customer remember that they will be responsible for performing the day-to-day management of the server.

5.3. Server roles

Any deployment of Funnelback must handle a number of server roles covering the administration, crawl, index and query functions.

Funnelback has an architecture that can be scaled to multiple servers, with different servers performing different roles (although all servers in a Funnelback cluster are fully capable Funnelback installations capable of all server roles). The key roles are

5.3.1. Administration

The administration role is responsible for providing various search administration services including:

  • Web administration

  • APIs

  • Analytics and reporting

5.3.2. Search indexes

The search indexes role is responsible for the gathering of all content and building of the search indexes

5.3.3. Query processing

A query processing role is responsible for serving search results to end users.

5.4. Hosting architectures

Funnelback supports a number of different architectures.

A single server architecture (with or without a disaster recovery or failover server) covers all three server roles.

The multi-server architecture supports a single active server with administration/search indexes role and multiple query processing servers. Redundancy can be added to the administration/search indexing server via a failover server.

Having multiple-servers enables the query processing performance to be scaled and have redundancy added to the architecture but adds complexity to the setup and maintenance of the system.

Load balancers can be used with multiple query processors to provide redundancy that can be spread across a number of different physical locations.

Using a multi-server environment is recommended for most enterprises searches unless it is a small search with few repositories.

Various architectures can be specified but for most installations a dual server environment containing an admin/crawl server and a query processor server will be sufficient.

The following are the most commonly used hosting architectures:

5.4.1. Single server

single server 01

Advantages:

  • Simplest structure

  • No workflow required for index and configuration publishing

Disadvantages:

  • Gathering and admin functions can affect query processing performance due to shared resources

  • No redundancy

5.4.2. Single server with disaster recovery

single server with disaster recovery 01

Advantages:

  • Simplest structure

  • No workflow required for index and configuration publishing

  • Basic disaster recovery

Disadvantages:

  • Gathering and admin functions can affect query processing performance due to shared resources

  • Some work required to keep DR server in sync

  • Downtime when switching to DR server

5.4.3. Single administration/indexing server, single query processor

single administration indexing server single query processor 01

Advantages:

  • Gathering and query processing is separated so query processing performance is not affected by collection updates.

Disadvantages:

  • No redundancy in query processing or administration (although the administration server could be used to serve queries if required).

5.4.4. Single administration server, multiple query processors

single administration server multiple query processors 01

Advantages:

  • Gathering and query processing is separated so query processing performance is not affected by collection updates.

  • Query processing has redundancy, and this can be further enhanced by distributing the query processors across different data centres/network connections.

  • Additional query processors can be added as query demand increases.

Disadvantages:

  • No redundancy in administration.

5.4.5. Single administration server, multiple query processors, failover administration

single administration server multiple query processors failover administration 01

Advantages:

  • Gathering and query processing is separated so query processing performance is not affected by collection updates.

  • Query processing has redundancy, and this can be further enhanced by distributing the query processors across different sites/network connections.

  • Additional query processors can be added as query demand increases.

  • Failover administration server provides redundancy, and the ability to maintain limited administration services during upgrades/outages. Administration servers can be distributed across data centres and network connections. Disadvantages:

  • Complexity of design

6. Capacity planning

There are many factors that affect the hardware requirements for Funnelback. It is almost impossible to accurately estimate the hardware requirements for any given Funnelback installation.

The hardware requirements presented below should be used as a very rough guide for a starting point when provisioning the Funnelback VMs and will vary widely depending on the search that is being implemented.

The performance of the servers should be monitored and adjusted as required.

The details below can also be used for capacity planning of installed web searches.

6.1. Estimated hardware requirements

The following are baseline recommendations for different types of searches.

6.1.1. Small website / intranet

  • 10K documents

  • Predominantly HTML content, some binary files (averaging 1MB each)

  • 1000+ search queries/day

Server HDD CPU RAM

Administration/indexing

100GB

Quad-Core

4GB

Query processor

100GB

Quad-Core

4GB

6.1.2. Large website

  • 100K documents

  • Predominantly HTML content, some binary files (averaging 1MB each)

  • 10,000+ queries/day

Server HDD CPU RAM

Administration/indexing

200GB

Quad-Core

8GB

Query processor

200GB

Quad-Core

8GB

6.1.3. Medium enterprise

  • 2 million documents

  • Predominantly binary files (averaging 1MB each), some HTML / textual content

  • 1,000+ search queries/day

  • 2% content changes daily

Server HDD CPU RAM

Administration/indexing

600GB

Quad-Core

8GB

Query processor

600GB

Quad-Core

8GB

6.1.4. Large enterprise

  • 10 million+ documents

  • Predominantly binary files (averaging 1MB each), some HTML / textual content

  • 10,000+ search queries/day

  • 2% content changes daily

Server HDD CPU RAM

Administration/indexing

2TB

Quad-Core

32GB

Query processor

2TB

Quad-Core

16GB

6.2. Factors affecting server resourcing

6.2.1. Factors affecting RAM usage

  • The memory required to update the various collections can vary widely. Large collections, or collections with complex filters will require increased RAM allocation on the server responsible for indexing.

  • Concurrently running updates each require a RAM allocation.

  • The memory allocated to the Funnelback daemon service on the server responsible for administration and indexing services should be increased for an enterprise search that includes TRIM as the Funnelback daemon is responsible for the filtering of binary documents for this collection type.

  • The memory allocated to the Jetty webserver should be increased on high volume query processing servers and administration servers with high volumes of API usage.

  • Query processing servers should ideally have enough free RAM equivalent to 80% of the size of the search indexes. This enables the OS to keep search indexes resident in memory which improves the query response time.

  • Analytics updates on the administration server also require available RAM when the update runs.

  • The update of knowledge graph can have an impact on RAM usage, especially if the graph consists of a large number of nodes. The funnelback-graph service is given a tiny amount of memory by default (which is not sufficient for anything but the smallest of knowledge graphs).

6.2.2. Factors affecting load/CPU requirements

  • The number of concurrently running updates will affect the CPU usage and server load.

  • Analytics updates will also impact on CPU usage.

  • Query volume can affect the server load - if queries are received by the Funnelback query processing server faster than they can be processed then the load will increase. There are some options available in Funnelback to discard queries if the load passes a threshold.

6.2.3. Factors affecting storage

  • Non-push collections maintain two sets of data, indexes and log files as the live and offline version of the indexes.

  • Collections containing a lot of binary content (such as fileshare and TRIM collections) and large web collections require significantly more storage space per URL than collections based on structured data (such as social media collections, databases, directories etc.)

Funnelback provides a few different options for securing the search results pages.

7.1. Authenticated access to the search results

7.1.1. SAML

Funnelback supports the use of SAML authentication using a single identity provider for both administration and search results interfaces. The configuration is separate for administration and the search results interface.

SAML can be only be used for authentication (not for authorization as there is no access to the group membership or user attributes of the users from the search data model).

7.1.2. Windows integrated authentication

The Windows version of Funnelback supports single sign-on from Active Directory using Windows integrated authentication for the search results interface. Note: AD logons are not supported for accessing the administration interface.

Windows Integrated Authentication supports reading the user attributes (such as group membership) to infer the user permissions and alter the DLS search accordingly. The principal node in the data model will contain the attributes of the user from Active Directory

Configuration of single sign-on is required for searches that provide DLS-enabled fileshare or TRIM collections.

Tutorial: Enabling Windows authentication

This tutorial shows how to enable single sign on using Windows authentication for access to the search results. This is required for most searches that use DLS.

Enabling Windows authentication on a collection does not enable DLS access - this requires additional steps that are outlined separately.
  1. Edit the collection.cfg (for the collection being queried) and add the following option:

    ui.modern.authentication=true
  2. Save the collection.cfg. The authentication should apply as soon as the file is saved (and published if on a multi-server environment).

7.1.3. Using a CMS to authenticate users

If the Funnelback search page is accessed via a CMS (i.e. partial HTML integration) then this can be used to authenticate users.

Collection level security that restricts access to the CMS server’s IP address should be configured for this use case as it prevents a user from bypassing the authentication to access the search directly (see section below).

If document level security is configured on any of the collections then the CMS will need to pass through the relevant user details for DLS to work.

filecopy and TRIM collections with DLS enabled do not support this method of integration.

7.2. Restricting access via IP address

Funnelback provides a mechanism known as collection level security that restricts the serving of search result to specified IP addresses.

Collection level security is useful for restricting a Funnelback search results page to specific IPs when the search is hosted on a publicly accessible server, or to restrict search results so that they must be served via a CMS integration.

There is also an access_alternate setting that allows you to redirect requests to an alternate collection if the access_restriction fails.

It is best practice to apply the access restriction to all the collections that should be secured as it may be possible to access the individual collections directly if you know the collection ID.
Tutorial: Configuring collection level security

In order to configure collection level security you will need a list of IP addresses to whitelist.

  1. Edit the collection.cfg (for the collection being queried) and add the following option, adding any IP addresses that should be whitelisted for access. Ensure this list contains the CMS’s public IP address (as Funnelback would see) and any other IP addresses that should have direct access to the search results.

    access_restriction=<LIST OF IP ADDRESSES>
  2. Save the collection.cfg. The access restriction should apply as soon as the file is saved (and published if on a multi-server environment).

7.3. Document level security

In a nutshell document level security (DLS) allows search results to be personalised so that result items that the current user receives do not include any results that they do not have permission to view.

7.3.1. Search complexity

This adds a lot of complexity to a search as it requires:

  • the user details to form part of the query. For some repositories it requires the search to impersonate (or run as) the user.

  • the search to obtain the current user’s set of keys (the security groups/permissions to which the user has access).

  • custom code to run at gather time which looks up the document’s locks (the security permissions applied to the document) and adds this information to the index

  • custom code run at query time that determines if a user’s set of keys will grant it access to the document’s locks (called the locks-to-keys matcher)

The locks-to-keys matcher is specific to the system that is being accessed. It is possible to write your own locks to keys matcher that implements the matching algorithm. Funnelback has a generic portal locks-keys matcher that can be used as the starting point for a custom matcher.

7.3.2. Security model

Funnelback employs a form of document level security where the document’s permissions are indexed along with the content. This is known as early-binding security.

The implications of using an early-binding security model are:

  • The security applied to the document is point-in-time and reflects the permissions that were on the document when it was gathered. If the document’s permissions change between when the crawl occurred and when the index is queried a user may see a result that they do not have access to (or not see a result that they do have access to). Updating these permissions requires the document to be re-gathered and indexed. Permissions change on the users are reflected almost immediately however (depending on the cache time that was set).

  • If a user does click on a document that they do not have access to the underlying system housing the document will deny access - this means that the impact of a change of permissions is limited to what gets displayed in the search result listing (such as the title and summary snippet).

7.3.3. Translucent document level security

Translucent DLS is a variation of document level security where a user will see entries for every result that matches their query, but items that they are not permitted to see have identifying features suppressed. (e.g. result snippet, title, author)

The search administrator can configure what attributes (metadata) of the document are displayed when a user doesn’t have permission to view the document.

7.3.4. Performance implications

Document level security has significant performance implications for search. For early binding security a lookup to fetch the user’s keys needs to be performed against each system that has document level security enabled. Each lookup will add to the search response time as Funnelback has to wait for the answers before it can start processing the query.

  • Funnelback supports key caching to improve performance. When enabled a user’s keys will be cached for a set period of time by Funnelback. The effect of the cache can have a similar effect on the results as early binding security as the cached key could change in the time between when the search was made and when the key was cached.

7.3.5. Supported collections

Funnelback has official document level security support for:

  • Windows NTFS/CIFS fileshares (requires the Windows version of Funnelback)

  • Squiz Matrix (Note: Squiz Matrix collections with DLS cannot currently be combined with DLS enabled fileshares or TRIM). This is due to the fact the Matrix collections are integrated with Matrix using a REST asset and fileshares/TRIM both require that the user connect directly with Funnelback). If there is a requirement for combining these then a locks-to-keys matcher will need to be implemented. Please contact support@funnelback.com well in advance to discuss your options).

  • TRIM 6-7, HP RM8, HP CM9 (requires the Windows version of Funnelback) Custom repositories can be supported by implementing the locks fetcher, keys fetcher and locks-keys matcher components. The following Funnelback features do not support DLS:

  • Auto-completion. The completion system will return words from all the documents regardless of their permissions. As it could potentially be used to leak protected document content, disabling auto-completion may be desirable.

  • Knowledge graph

Cached pages are disabled automatically when document level security is enabled.

7.3.6. Configuring document level security

The configuration of document level security is collection specific and requires configuration of both the data source collection and any front end collections that provide search.

7.3.7. Implementing DLS support for a custom repository

Implementation of DLS for a system requires several custom components to be developed.

For DLS to work Funnelback must be able to:

  • access a URL’s access controls and associate these with the URL as security metadata. This is known as the set of document locks.

  • access the user’s set of permissions at request time. This is known as the set of user keys.

  • check the user’s keys against each URL’s locks and determine if the user’s permissions grants access to the document based on the system’s security model.

Each of these components is specific to the system that is being indexed and the method of obtaining each of these will be system dependent.

Obtaining a document’s set of locks

There are various techniques that can be used to expose the document’s set of locks to Funnelback. For example:

  • Return the set of document locks as a HTTP header that is returned when a specific user requests an item from the system.

  • Include the set of document locks in an XML or metadata field returned when an item is fetched.

  • Provide an API or web service that returns a set of document locks given a URI.

  • Provide a feed or external metadata that lists the document locks by URL.

Obtaining a user’s set of keys

There are various techniques that can be used by Funnelback to obtain the user’s set of keys. Functionality is required within both Funnelback (via a user to key mapper) and the target system (to provide the actual keys upon request).

For example a target system might be able to provide user keys:

  • via an API or web service that returns a set of user keys given a user ID.

  • as a CGI or request parameter when a search request is made (for example if a user is authenticated by a CMS and the CMS hosts the search page, then the CMS can pass additional information about the user to Funnelback when a search is made from the CMS).

  • using a custom plugin that will need to be written for the system.

  • Have Funnelback impersonate the user and an API call within the remote system that returns the user’s own permissions.

Funnelback’s user to key mapper then implements whatever is required to fetch the key from the service and provide the keys to the query processor.

The user to key mapper is implemented in either Groovy or Java.

Performing locks to keys matching

A locks to keys matcher takes a set of user keys, a set of document locks and URI and calculates if the user has access to the document.

The matcher must implement whatever logic is required by the system’s security model and returns a yes or no given the supplied parameters.

The authoring of a locks to keys matcher requires software development skills and knowledge of the C programming language.

8. Enterprise repositories

8.1. Web, FTP and WebDAV sites

Most intranets can be crawled using a standard web collection. A special connector should only be required if document level security is a requirement for the repository.

If the website being crawled is personalised for users then extra steps need to be taken when crawling.

8.1.1. Websites requiring authentication

Funnelback offers several authentication methods for websites that require a login:

  • Basic HTTP and NTLM authentication allows crawling of the website by specifying only the username, password (and domain for NTLM) as collection.cfg parameters.

  • Form-based authentication allows crawling of the website by configuring Funnelback to automatically fill in a login form using Funnelback’s form interaction feature. This method can often be used to crawl websites that are protected by SAML authentication.

8.1.2. Crawling personalised websites

Care needs to be taken when crawling a personalised website.

When Funnelback crawls a personalised website it will index the view of the website that is returned to the web crawler - this means that special care needs to be taken to ensure that the crawled version of the site is relevant for the users of the search.

The best technique for crawling a personalised site to ensure that any personalisation is suppressed for the user that Funnelback is using to crawl the site. Personalised content should either be removed within the CMS before it is returned to Funnelback or should be wrapped in noindex tags so that these regions of the page are not indexed.

8.1.3. FTP and WebDAV sites

Funnelback includes basic support for the crawling of FTP and WebDAV sites using the web crawler.

8.1.4. Limitations and gotchas

  • Ideally the Funnelback crawl user should have read only access to the site.

  • Before crawling an intranet check to ensure that any links or buttons within a page that may perform a destructive operation (such as deleting the page) are not crawlable. e.g. ensure that delete links are protected by JavaScript or submitted via a HTTP POST request.

  • Funnelback does not process JavaScript so ensure that the site can be viewed and navigated when JavaScript is disabled.

  • Use noindex tags and robots.txt directives to prevent Funnelback from crawling personalised content.

  • Web collections have no in-built support for DLS (but can be used with custom DLS integrations).

8.2. Fileshares

Funnelback supports the indexing of local or network connected fileshares via the SMB or CIFS protocol.

The Windows version of Funnelback includes support for NTFS/CIFS fileshares with document level security.

The file copier gathers content by accessing each source path then traversing the directory structure.

FTP and WebDAV repositories can usually be crawled using a web collection.

8.2.1. Limitations and gotchas

  • DLS support is limited to Windows fileshares and only available when using the Windows version of Funnelback.

  • Windows share permissions are not supported - when indexing Windows fileshares only the document’s permissions are indexed.

  • Crawl must occur as a user account with read access granted to all files (that should be in the search index)

  • For very large fileshares it is sometimes necessary to break up into multiple collections

  • All of the fileshares in the seed list must be available or the crawl will fail

  • Large fileshare collections will need a large heap.

  • Initial fileshare crawls are slow due to the fact that the majority of documents must be filtered.

  • The search must be accessed directly (not via a CMS) when querying DLS-enabled fileshares.

8.3. SQL databases

Funnelback has the ability to connect to any SQL DBMS (with a valid JDBC driver) and index the output from running a valid SQL query within that database.

Each row from the result of the query is indexed as a separate document within Funnelback with fields mapped to Funnelback metadata classes.

Funnelback indexes the output of an SQL query. It doesn’t preserve or index the database structure or relationships within the index and is not a drop-in replacement for SQL database queries.

8.3.1. Supported databases

Funnelback ships with built-in support for PostgreSQL. Indexing of other database systems requires installation of an appropriate JDBC driver.

Funnelback has also been used successfully to index data from:

  • IBM DB2

  • Microsoft SQL server

  • MySQL

  • Oracle

  • SQLite

8.3.2. Limitations and gotchas

  • There is no built-in DLS support for database record access. The permissions on the documents and for users are highly dependent on the structure and the content of the database making it impossible for Funnelback to provide a generic DLS implementation.

  • Funnelback query language does not support table joins or grouping at query time.

  • Best practice is to have the customer create a view containing all the relevant fields, denormalised, so that Funnelback can run a select * from view SQL query against the table view. This simplifies the Funnelback configuration and also makes it clearer within the DBMS what will be available within Funnelback.

  • Each row in the resulting table is indexed as an item or document in Funnelback.

  • Records are downloaded and stored internally as XML.

  • Fields within the indexed view are mapped to metadata classes within Funnelback.

8.4. Directories

Directory collections index the result of a query to a directory service such as LDAP or Microsoft Active Directory.

Directory collections are often used to provide staff directory search though this is only limited to the types of objects that are available within the directory and could be used to provide search over other objects such as rooms or assets.

8.4.1. Limitations and gotchas

  • Uses the Java Naming and Directory Interface (JNDI) to connect to the directory service.

  • Requires a user with permission to read the directory.

  • Requires a well-structured directory to work effectively, otherwise lots of manual exclusions will be required to clean the index.

  • There is no built-in DLS support for access to directory collections

  • Records are downloaded and stored internally as XML with fields mapped to Funnelback metadata classes.

8.5. Social media channels

Funnelback includes built-in support for the following social media channels:

  • YouTube

  • Facebook

  • Twitter

  • Flickr

The following channels have been successfully indexed with differing levels of success but are not officially supported:

  • Instagram

  • Vimeo

  • Linkedin

  • Soundcloud

Additional social media channels can be indexed by implementing a custom gatherer to communicate with the custom channel’s APIs to fetch and index the channel’s data feeds. See the following section on indexing APIs for further information.

Most custom gatherers will interact with a REST API and process the returned data, usually JSON or XML.

When working with social media channels it is often useful to separate these from the main search results (e.g. on a social media tab or via extra searches) and to use channel-specific presentation for different result types (e.g. using a video lightbox for YouTube results).

Social media providers will often change their APIs with little notice. It is important to manage expectations and have mechanisms in place to fix a broken social media integration at short notice.

8.6. Delimited text, XML and JSON data sources

Data contained within delimited text files, XML and JSON can be indexed by Funnelback as additional data sources within and enterprise search.

The CSV filter includes support for CSV, TSV, Microsoft Excel and SQL files.

Delimited text, XML and JSON data sources that are available via HTTP should normally be indexed using a web collection and the filter framework used to convert the data to XML which is natively indexed by Funnelback.

The fields (once converted to XML) are then mapped to Funnelback metadata classes.

8.7. Indexing systems via an API

Funnelback provides a generic custom collection type that allows the implementation of custom logic to interact with APIs to gather content.

These custom gatherers are written using the Groovy programming language and need to implement any interactions required to authenticate with the target repository, communicate with the APIs and process the returned content. The exact steps required will vary from system to system and may require use of a set of libraries or a Java SDK.

The custom gatherer can then feed the output through Funnelback’s filter framework which can be used to convert JSON/CSV to XML as detailed above.

8.7.1. Custom gatherer design

  • Consider what a search result should be and investigate the available APIs to see what steps will be required to get to the result level data. This will often involve several calls to get the required data. It may also involve writing code to follow paginated API responses.

  • Consider what type of store is suitable for the custom gatherer - custom gatherers can operate in a similar manner to a normal collection storing the data in a warc file (similar to a zip file of a website) or can be configured to use a push collection to store the data. Using a push collection is suitable for collections that will be incrementally updated whereas the more standard warc stores are better if you fully update the data each time as you have an offline and live view of the data.

  • If working with a push collection consider how the initial gather will be done and how this might differ from the incremental updates. When working with push collections it’s also critical to think about how items will be updated and removed. The custom gatherer also needs to deal with any errors that may occur when attempting to push a document (for example implementing a queue mechanism or retrying failures).

  • Take into account any rate limiting that the API may apply. It may be necessary to throttle the API requests or implement a retry policy.

8.8. Squiz Matrix

Matrix collections support document level security on Squiz Matrix websites.

Squiz Matrix will be crawled using the web crawler (so all the usual web crawler options are available) which will also index some additional security related metadata supplied by Squiz Matrix.

The query-time document level security requires that Squiz Matrix nest the Funnelback search within a REST asset. Other integration methods are not currently supported.

8.8.1. Limitations and gotchas

  • Matrix DLS cannot be combined with other collections that have DLS as nested integration is currently the only supported form of integration.

  • A Squiz Matrix security plugin would need to be developed for Funnelback to enable other integration methods to be supported, and the combining of Squiz Matrix DLS with other collections secured with DLS.

8.9. TRIM / Records Manager / Content Manager

Gatherer support is limited to the listed versions of content manager / records manager / TRIM. If you need to connect to a different version then contact Funnelback support to discuss your requirements.

The Windows version of Funnelback supports indexing of records for Micro Focus / HPE Content Manager 9.0 and 9.1, HP Records Manager 8 and HP TRIM 6+.

Integration with CM/RM/TRIM is complex with many requirements at the network and server architecture level.

8.9.1. Limitations and gotchas

  • The initial gather will take a long time to complete (this can take weeks for a large repository). The collection can be queried while the initial crawl occurs, though will only search across the subset of data that has been gathered.

  • Requires a user account with the ability to see all relevant records, as well as permission to read ACLs.

  • The relevant desktop client software needs to be installed on the Funnelback server as this includes the SDK libraries required to communicate with the relevant APIs.

  • User impersonation must be correctly configured on the Funnelback server in order for the document level security to correctly function. This relies on both the use of Kerberos (for the single sign on) as well as the Funnelback server having delegation rights.

  • The memory allocation to the Funnelback daemon, which is used for filtering, should be increased.

  • When querying, the push collection used for storage should be included in the meta collection (and all security should be configured on this collection)

  • Trim gather code is located in $SEARCH_HOME/wbin/trim or $SEARCH_HOME/wbin/rm8

  • Logging on the various components can be increased by editing the relevant Funnelback.TRIM.X.exe.config file

8.10. Push collections

Push collections are a generic collection type within Funnelback that are updated via a REST API.

It is possible to integrate with a push collection directly from a third-party system if it contains support for triggering external requests when events within the system are triggered.

Tutorial: Basic integration of a push collection with a CMS

A push collection should be created within Funnelback, and a non-expiring API token created for the CMS to use to authenticate with Funnelback.

Event handlers within the CMS should be configured that handle the following events:

  1. When an item is added or updated call the Funnelback Push API with a PUT request that submits the content.

  2. When an item is deleted call the Funnelback Push API with a DEL request to delete the content.

  3. When an item is moved call the Funnelback Push API with a PUT request submitting the content on the new URL, and a DEL request to remove the content from the old URL.

A method of doing an initial or bulk update should also be considered and may involve writing a Funnelback custom gatherer that fetches the initial content from the CMS, or have a data load script or process created within the CMS that sequentially submits all the content to the push collection.

The system that interacts with Funnelback’s Push API must also perform error handling in the event that the Push API returns an error.

This could include logging the error, or implementing a process that queues requests and retries any failures.

This is extremely important because push collection updates always modify an existing index in place and any failures to add or remove content that are not retried successfully will result in an index that is missing content or contains content that has been deleted. This differs from standard Funnelback collections that have processes to fully refresh the index periodically.

8.10.1. Push collections and document level security

Push collections can be used with document level security (for example the trimpush collection type that ships with Funnelback is underpinned by a push collection). As with other collection types the implementation of DLS requires implementation of a number of custom components that are responsible for handling locks/keys extraction and matching. The push collection itself is purely concerned with the storage and retrieval of the information contained within the indexes.

8.11. Slack

Funnelback includes basic support for the indexing of Slack message content, sourced from an export from Slack.

The update of Slack collections is currently a manual process that requires an export of Slack content to be uploaded into Funnelback.

8.12. Raytion enterprise connectors

Work is currently underway to integrate Funnelback with the Raytion suite of enterprise search connectors.

The back-end framework required to support Raytion’s connections will be incorporated into an upcoming version of Funnelback.

Once this is in place it will be possible for Raytion to adapt any of their existing connectors to communicate with Funnelback. However, this will be implemented as the connectors are required.

This will provide a pathway to support the indexing of systems including:

  • Microsoft Sharepoint

  • Salesforce

  • Google Drive

  • Atlassian Jira and Confluence

  • ServiceNow

  • Adobe AEM

  • Documentum

  • Box

  • SAP

  • Facebook Workplace

  • Microsoft OneDrive

The connectors handle communication with the data source, with the gathered content being added to a push collection.

Contact Funnelback before considering a Raytion connector for a project as the connectors are only suitable for certain projects and there are a number of additional arrangements that need to be put in place with Raytion in order to deliver a project.

8.12.1. Limitations and gotchas

  • Using Raytion connectors carries additional costs including:

  • Raytion connectors are licenced per-connector, per server.

  • A consulting component is (required by Raytion) for each connector that is implemented for a customer.

  • Adapting existing connectors will be done as required and only if there is a guarantee of return on the time invested to adapt the connectors.

8.13. Manifold-CF enterprise connectors

Funnelback is not actively tested with new releases of Manifold-CF. The Manifold-CF connector that ships with Funnelback has been tested with Manifold-CF v2.3 and is not guaranteed to work with newer versions.

Funnelback includes support for connecting to a number of enterprise repository systems through an open source connector framework called Manifold-CF. This framework, along with the associated Funnelback Manifold-CF connector, allows Funnelback to be populated content from supported repositories and to apply document level security for repositories where Manifold-CF supports fetching security information.

Mixed levels of success have been experienced when using Manifold-CF - it is an option in the toolkit but should be used with caution.

8.13.1. Repository connectors

Manifold-CF includes connectors for:

  • Alfresco

  • Alfresco WebScripts

  • Amazon S3

  • Confluence

  • CMIS

  • DropBox

  • EMC Documentum

  • Email

  • IBM FileNet

  • File System

  • Google Drive

  • GridFS

  • HDFS

  • JDBC

  • Jira

  • Kafka

  • OpenText LiveLink

  • Meridio

  • RSS

  • Microsoft SharePoint (2003/2007/2010/2013)

  • Web

  • Windows Shares

  • Wiki

8.13.2. Limitations and gotchas

  • Repository connectors seem to work, but at a very basic level

  • Using a separate PostgreSQL database server is recommended as the embedded Derby database is unreliable.

  • Manifold-CF should be run on a separate server from Funnelback.

  • Good points

    • Great support from Manifold-CF team at Apache

    • Product documentation is kept reasonably up to date

    • It allows multiple authority connectors for a single repository connector

    • Codebase is messy but quite easy to understand and extend

    • Authority connectors packaged with Manifold-CF work well out of the box (e.g. Active Directory)

  • Not so good points

    • The user interface is confusing

    • Manifold-CF’s database is a bottleneck for large repositories

    • Scheduling updates is hit and miss

    • Continuous updating is not reliable

    • Updates to Manifold-CF are not tested thoroughly before release

    • A separate server is required for Manifold-CF to operate with acceptable performance when deployed for enterprise.

    • It’s difficult to gauge the status of a job running in Manifold-CF.

    • The user interface becomes unresponsive when processing multiple repositories

    • Setting include and exclude patterns is time consuming and doesn’t always work as you would expect.

8.14. Other repository examples

The purpose of this section is to provide some examples that give background on how a wide variety of systems have been successfully integrated in the past in order to illustrate the different techniques that can be applied to create an index of an enterprise system.

None of the systems in the section below are supported by Funnelback and it should not be assumed that there is a current working solution for any of these systems.

The examples below are provided to illustrate the multitude of ways that content can be indexed from a repository.

8.14.1. Microsoft Sharepoint

Sharepoint search with DLS has been implemented with varying levels of success using Manifold-CF.

There was a long list of problems encountered including:

  • Major performance issues (Manifold-CF was run on the same server as Funnelback for this project).

  • Some of the code didn’t work correctly and had to be rewritten (but the code is open-source and changes were contributed back to the main codebase).

  • Include/exclude rules didn’t always work as expected.

Many public facing websites using Microsoft Sharepoint have also been indexed as a standard website using the web crawler.

8.14.2. Sitecore

An unsupported Sitecore module was developed for to index a Sitecore powered intranet. It supports DLS over Sitecore.

It requires:

  • Sitecore to be configured to output lock strings in the HTTP headers returned to Funnelback. Code was written that is compiled into a .dll that needs to be installed into the Sitecore site. Sitecore then needs to be configured to call this as part of the processing pipeline.

  • A webservice to be configured within Sitecore that implements a user-key mapper (it takes the user as a parameter and returns the user’s keys).

  • The code has not been maintained for a number of years and may no longer function correctly with newer versions of Sitecore.

The code used for this implementation forms the custom DLS example in the FUNL204 Implementing enterprise search training.

Many public facing websites using Sitecore have also been indexed as a standard website using the web crawler.

8.14.3. Atlassian Confluence

Confluence can be indexed using a standard web collection with form authentication using a Confluence user that has read only access to all the spaces (that should be included in the search results). This provides a searchable index of Confluence but does not support document level security.

Confluence also has extensive APIs that could be used to access the content using a custom collection.

8.14.4. Atlassian Jira

Jira can be indexed using a custom gatherer that connects to the Jira REST API to fetch issues. A basic custom gatherer was developed and is outlined in the FUNL204 Implementing enterprise search course.

The gatherer is not officially supported but is written in a re-usable way and could be used to create an index of issues contained within a Jira instance. The gatherer includes basic options for excluding by project. The gatherer does not support document level security.

Jira can also be indexed as a database collection that connects directly to the Jira SQL database. This method also does not support document level security.

Although this is not the recommended way of indexing Jira, it does highlight that there are often different approaches for creating search of an enterprise repository. Configuration similar to the following was used when this method of integration was used.

db.full_sql_query=SELECT   MAX (project.pkey || '-' || jiraissue.issuenum) AS pkey,   MAX (jiraissue.project) AS issue_project,   project.pname,   MAX (jiraissue.summary) AS summary,   MAX (jiraissue.description) AS description,   MAX (resolution.pname) AS resolution,   MAX (issuestatus.pname) as issuestatus,   MAX (jiraissue.reporter) AS reporter,   MAX (jiraissue.assignee) AS assignee,   MAX (jiraissue.created) AS date_created,   MAX (jiraissue.updated) AS date_updated,   array_agg (actionbody) as comments,   MAX (projectversion.vname) AS fixversion    FROM jiraissue  LEFT JOIN jiraaction   ON jiraaction.issueid = jiraissue.id    LEFT JOIN project    ON jiraissue.project = project.id  LEFT JOIN issuestatus   ON jiraissue.issuestatus = issuestatus.id  LEFT JOIN resolution   ON jiraissue.resolution = resolution.id  LEFT JOIN nodeassociation   ON (     nodeassociation.source_node_id = jiraissue.id   AND nodeassociation.association_type = 'IssueFixVersion'   ) LEFT JOIN projectversion   ON nodeassociation.sink_node_id = projectversion.id  GROUP BY jiraissue.id, project.pname
db.jdbc_class=org.postgresql.Driver
db.jdbc_url=jdbc:postgresql://SERVER:5432/jira
db.password=XXXXXXXXXXXXXXX
db.primary_id_column=pkey

8.14.5. Atlassian Jira Service Desk

Issues in Jira Service Desk can be indexed using the Jira custom gatherer above as it also supports the base Jira server API. Additional metadata mappings may be required to access the JSD specific fields.

8.14.6. Sugar CRM

Customer and contacts data has been successfully indexed from Sugar CRM using a custom gatherer that communicates with Sugar CRM’s REST API.

8.14.7. ServiceNow

A search of FAQs stored within ServiceNow has been implemented using a custom gatherer to download and index XML or JSON data containing the FAQ items. This is accessed via the Service Now REST API.

8.14.8. GitLab

Gitlab project pages have been successfully indexed using a custom gatherer that communicates with the GitLab API. Note: this gatherer doesn’t attempt to index the code.

8.14.9. HipChat

HipChat logs from the public chatrooms can be indexed using a custom gatherer that fetches the last day’s log.

8.14.10. Egnyte

A custom gatherer has been successfully used to interact with the Egnyte API and produce an index of the documents stored within Egnyte.

8.14.11. Zendesk

A custom gatherer has been successfully used to index Zendesk tickets via the API. It also implements some additional logic to display the customer names by cross-referencing with a secondary database.

8.14.12. Vimeo

Small repositories can be indexed using a web collection that retrieves XML pages from Vimeo’s simple API. The simple API limits usage to public content and limits the number of items that can be retrieved.

Vimeo also have a full featured API that could be integrated with using a custom collection.

8.14.13. Instagram

There is a basic Instagram connector included in the Funnelback stencils.

8.14.14. Soundcloud

Small repositories can be indexed using a web collection.

8.14.15. Discourse

Content from the Discourse discussion platform has been indexed using a custom gatherer that connects to the Discourse API, retrieving the data as a series of JSON packets.

8.14.16. Kaltura

An index of video content was successfully created using a custom gatherer that interacts with the Kaltura API to retrieve metadata records for each of the videos.

9. Meta collection design

Meta collections are particularly important for an effective enterprise search as a meta collection is the mechanism used to provide search across a set of enterprise search repositories.

Key things to consider:

  • What data sources should be included in the meta collection?

  • Is it useful to organise the search in to different tabs that group one or more of the data sources?

    • Show different auto-completion for different tabs

    • Show different faceted navigation for different tabs

  • Make use of faceted navigation

    • Use collection facets to provide filtering by data source

  • Define result templates that are tailored to the data. e.g.

    • For videos display a thumbnail and open a video lightbox when clicking on the result.

    • For structured data (e.g. people results) display relevant fielded information.

  • What auto-completion is appropriate?

    • Recall that concierge auto-completion can be made up of several different auto-completion data sources.

  • Use meta collection component weighting to prioritise the importance of different data sources

  • Use result diversification techniques such as same site suppression or result collapsing to avoid the search results being flooded by a single data source.

9.1. Profiles

Profiles are incredibly useful for a number of reasons:

  • Each profile has unique templating, best bets, curator rules, synonyms, faceted navigation

  • Analytics reports are generated at the profile level

  • Profiles can be separately tuned with different ranking and display settings.

Things to consider:

  • Use of profiles for different searches

  • Use of profiles for different audiences

  • Use of dedicated profiles for search as a service (to avoid pollution of analytics)

  • Use of a new profile on a meta collection vs. setting up a completely separate meta collection.

10. Configuring Funnelback to index additional file types

The filter types downloaded and indexed by Funnelback by default depend on the type of collection.

Funnelback generally supports the indexing of HTML, Microsoft Office (Word/Excel/Powerpoint), RTF and text documents out of the box.

Additional file types are supported by the Apache Tika framework used by Funnelback, or by adding a custom external filter.

Before you add an additional filetype consider if any useful information can be obtained from the document:

  • Keep in mind that the filter converts the document to text and this is what is indexed by Funnelback. Some file formats don’t make a lot of sense to index as the extracted text is non-existent or not meaningful. For this sort of document Funnelback will be relying on available metadata to identify and rank the document.

  • Funnelback doesn’t have any built-in support for OCR conversion of scanned documents (this is possible to implement using a custom filter, and possibly even with Tika’s Tesseract support, but is likely to be slow and resource-intensive, and the quality of the extracted text may not be adequate).

  • Funnelback will usually exclude any unconverted binary documents from the index, however there are some instances where this is desirable and options available to include these documents (though only document metadata such as the URL will be available in the index and search results).

10.1. Additional file formats using Apache Tika

The binary formats are converted to text using Apache Tika - which supports a large number of document formats.

These formats can easily be added to the filetypes indexed by Funnelback as long as Tika can process the file format.

When considering additional formats to index remember that Funnelback can only index text that is extracted from the file which limits the useful set of file formats to add to the search. For many file formats document metadata will be the only useful text that can be extracted.

Tika’s list of supported formats changes regularly so be sure to check the correct Tika version for the version of Funnelback that will be installed.

Tika also advertises support for OCR capabilities - however this is untested and would add a significant processing overhead to the crawl.

10.2. Additional file formats using an external converter

Funnelback includes support to call an external binary to perform the binary to text conversion of the document.

The external program must be compatible with the OS that Funnelback is installed on and must be executable silently from the command line.

Funnelback will call the external program passing the binary document as a parameter and the program will need to return the extracted text to Funnelback.

The use of external converters is generally discouraged as there may be a significant impact on performance as a separate system process is run for each document that is being filtered.

11. Enterprise SEO and ranking

  • Avoid the temptation to include everything. Instead focus on content that is useful.

  • Use robots.txt, robots meta tags and Funnelback no-index tags to control what Funnelback indexes.

  • SEO Auditor and the automated tuning don’t really work in an enterprise search scenario that includes DLS (because DLS means different users see different things).

  • Ranking within enterprise collections is often a lot poorer and manual intervention is often required to improve the results

  • Ranking evidence is much poorer (than say for a web site) due to enterprise repositories having fewer ranking indicators.

  • Use same site suppression and other diversification options to prevent result sets being flooded by results from one index.

  • Use collection weighting to up or down weight collections (-cool.21).

  • Use query independent evidence to weight sections within a site based on the URL structure.

  • Use scoring mode 2 (-sco=2[]) for metadata heavy collections.

12. Other considerations

  • Once initial crawls are complete calculate the index sizes (size of live idx folders) to estimate the total index size. This will inform on an appropriate amount of system memory.

  • Ensure spelling is configured for appropriate metadata fields

  • File locking is a big problem on Windows.

  • Close all your open Windows when you log out.

  • Consider disabling the meta_dependencies phase on all but one collection (the last collection to update) especially if Windows is used.

  • Process Explorer can be used to find out what process is locking a file.

  • Successful debugging on Windows really needs Cygwin or a similar tool like Cmder (so you can read/grep large log files). Windows Powershell also allows you to grep/tail log files.

12.1. Gotchas

  • If collections are combined into a meta collection the username used for different systems must be consistent

  • Some collection types (e.g. TRIM, Windows fileshares) require the Windows version of Funnelback

  • Some collection types (e.g. TRIM, Windows fileshares) require user impersonation - this means direct access to Funnelback is required (no going via Matrix REST asset or similar)

  • Check what type of authentication is used in the customer’s environment and check that we support this for single sign on. (e.g. Active Directory). Don’t assume that AD is used on a Windows site.

  • The Funnelback server must have delegation rights and Kerberos must be used for impersonation to work correctly

  • Query completion and knowledge graph does not enforce DLS

12.2. Some real world examples

  • One customer site used Oracle Identity Management for their Windows authentication.

  • One customer was sold DLS for ‘their intranet' - it turned out that the intranet was unsupported for DLS.

  • One customer had a major panic when HR records appeared in their enterprise search. This turned out to be a case of incorrectly set permissions which the search engine exposed.

  • One customer had servers on one Windows domain and users on a different domain.

  • One customer wanted to have their Enterprise Funnelback server located in a DMZ but the DMZ did not allow access to the internal repositories.

  • One organisation was using more than one Active Directory.

  • A customer attempted to install Funnelback onto a server that hosted the organisation’s intranet and DBMS. Funnelback failed to start (due to a conflict in the configured ports required for Funnelback and the intranet’s web server, and performance issues due to the sharing of hardware by the different systems.

13. Review questions

  • What are some of the key challenges that you’ll face when implementing enterprise search? Server architecture

  • Describe the different roles that a Funnelback server can assume.

  • If a search is performing poorly what are some options available for improving the performance?

13.2. Web

  • What are some of the things that you need to account for when crawling an intranet that requires authentication?

13.3. Fileshares

  • What is the difference between filecopy and local collection types?

  • Can you index a fileshare on Linux? Are there any limitations?

13.4. TRIM

  • Why do you need to create a push collection when configuring TRIM?

  • If you add TRIM to a meta collection which collection(s) do you choose?

  • When indexing TRIM you require at least two collections - explain the difference between the collections.

13.5. Directories

  • Could I use a directory collection to produce a building map showing staff locations? What would be required?

  • Could I use a directory collection to provide a search of available resources such as rooms and printers?

13.6. Databases

  • Explain the limitations of indexing a SQL database.

  • What database systems are supported by Funnelback?

13.7. Push collections

  • When would you use a push collection?

  • What is the main advantage of using a push collection?

  • What sort of error handling is required when working with push collections?

13.8. Security

  • Discuss some of the options available for securing a search? What are the constraints for each option?

13.9. DLS

  • Explain what DLS is.

  • When is it appropriate to use DLS? And what are the costs?

  • What are the major components required when adding DLS support to a repository?

  • What is the benefit from increasing the length of time that a user key is cached?

  • What is the trade-off?

  • What happens if a user’s permissions change and are different to what’s recorded in the index?

  • Which common Funnelback features do not support DLS?

13.10. Social media

  • What social media channels does Funnelback support?

  • How would you add support for another social media channel?

13.11. Troubleshooting

  • A customer complains that they are seeing items returned in a DLS enabled search that they believe they should not have access to. Explain how you would resolve this.

14. Exercises

You have been asked to design an enterprise search for a customer. You are provided with the following high-level requirements:

Search to cover:

  • TRIM

  • fileshares

  • intranet

  • internet

  • staff directory

  • social

  • Google Drive

Exercise 1: Clarify requirements

You are to meet with the customer to find out more details about the project. Compile a list of questions that you will ask the customer. Include some followup questions that you will ask depending on the answers provided.

If the training session is instructor led then ask quesstions to the instructor who will play the role of the customer to clarify your questions.

Exercise 2: Produce a design
  1. Based on the requirements obtained produce a high-level design of the enterprise search solution.

  2. If the training is self directed make a note of any assumptions you have made based on the followup questions you may have asked the customer.

  3. Start to produce a detailed design of the search solution.