This course is currently in draft.

Introduction

This course is aimed at system administrators responsible for setup and maintenance of a Funnelback instance.

Special tool tips or hints will appear throughout the exercises as well, which will provide extra knowledge or tips for completing the exercise. They will look like:

This box is used to provide links to further reading available on the current topic.
This box is used to provide tips or hints related to the current topic.
This box is used to provide important advice relating to the current topic. This includes advice to be aware of to avoid common errors.
This box contains advice specific to the Linux version of Funnelback.
This box contains advice specific to the Windows version of Funnelback.

What this workshop will cover:

  • Installation and patching

  • Application configuration

  • Funnelback services

  • Web and backend administration

  • Funnelback filesystem

  • Application logging

  • Multi-server environments

  • Licence management

  • User management

  • Scheduling updates

  • Backup and monitoring

  • Securing Funnelback

  • Upgrading Funnelback

  • Performance tuning

  • Maintenance tasks

  • Capacity planning

Prerequisites to completing the course:

Students are expected to have completed the following courses:

  • FUNL201, FUNL202, FUNL203, FUNL401

1. Installing and patching

1.1. Installing Funnelback

Prior to commencing a Funnelback installation, ensure that you have met the following pre-requisites:

  • Supported OS (Windows Server 2012/2016 or RH Linux/Centos 7) installed on a suitable server (or servers if performing a multi-server installation)

  • Access as administrator or root for installation

  • A copy of the Funnelback installer, obtained from the Funnelback download site.

  • A valid Funnelback licence key,

Check required ports are free (e.g. port 80/443 is not used by something else).

  • Open required ports on firewall

Installations on Linux require:

  • The following font packages: Freetype, fontconfig, dejavu-fonts-common, dejavu-sans-fonts, dejavu-sans-mono-fonts, dejavu-serif-fonts, dejavu-lgc-sans-mono-fonts

  • Increasing the open file handle limit

Installations on Windows also require:

  • A valid Windows system account (used for scheduling updates)

  • SMTP server details (optional, required if Funnelback is to send out status emails)

  • The .NET Framework v4+ (if indexing HP TRIM/HP RM)

  • The HP TRIM/HP RM desktop client (if indexing HP TRIM/HP RM)

1.2. Downloading Funnelback

Funnelback can be downloaded from http://download.funnelback.com/

Access to files on https://download.funnelback.com requires a valid download site user account and a current valid licence key.

1.3. Using the Funnelback installer

Installation of Funnelback uses an installer that can be run interactively, or unattended for scripted installation.

Funnelback installers are available for Linux and Windows with official support for the following operating system versions:

  • Red Hat Enterprise Linux 6 (64 bit) or CentOS 6 (64 bit)

  • Red Hat Enterprise Linux 7 (64 bit) or CentOS 7 (64 bit)

  • Microsoft Windows Server 2016 (64 bit)

  • Microsoft Windows Server 2012 (64 bit)

Tutorial: Install Funnelback on Linux

This tutorial starts with a blank Linux server running CentOS 7

  1. Log in to the server as the root user

  2. Ensure SELinux is disabled or running in permissive mode. Check if SELinux enforcement is on:

    getenforce
  3. If it returns enabled the following command will disable it until the next reboot:

    setenforce 0
  4. Then edit the SELinux config file and set it to disabled to make the change permanent.

    nano /etc/sysconfig/selinux
  5. Set the value to

    SELINUX=disabled
  6. Increase the open file handle limit.

  7. Install the following packages onto the Funnelback server

    yum install -y wget libstdc++.i686 libxslt mailx tar cronie freetype fontconfig
    yum install -y dejavu-fonts-common dejavu-sans-fonts dejavu-sans-mono-fonts dejavu-serif-fonts dejavu-lgc-sans-mono-fonts
  8. Create a search user and group. Funnelback should be installed and managed as the search user. In this example the search user’s password is set to se@rch

    groupadd -g 10000 search
    useradd -g search -u 10000 search
    echo search:se@rch | chpasswd
  9. Change the permissions on the Funnelback installer to make it executable

    chmod a+x ./funnelback-{funnelback_version}-linux-x64-installer.run
  10. Run the Funnelback installer

    ./funnelback-{funnelback_version}-linux-x64-installer.run
  11. When prompted for the installation directory leave as the default: /opt/funnelback.

    exercise install funnelback on linux 01
  12. When prompted for the Funnelback user use search. Note: it is best practice to use a standalone search user for Funnelback installations - the use of root is not recommended.

    exercise install funnelback on linux 02
  13. HTTP search port: 80. This is the server port used for delivering Funnelback search results over HTTP.

    exercise install funnelback on linux 03
  14. HTTPS search port: 443. This is the server port used for delivering Funnelback search results over HTTPS. The HTTPS search port cannot be used to serve administration functions.

    exercise install funnelback on linux 04
  15. Administration server port: 8443. This is the server port used for delivering Funnelback administration services including the administration interface and APIs. Access to Funnelback on this port requires a valid Funnelback administration account, or an API key. This port should not be used for the delivery of Funnelback search results over HTTPS.

    exercise install funnelback on linux 05
  16. Enter your Funnelback licence key.

    exercise install funnelback on linux 06
  17. Enter details for an administrator account - this account is used to access the Funnelback administration interface.

    • Administration account username: admin

    • Administration account password: admin

      exercise install funnelback on linux 07
  18. Mail settings - leave these as default to use the server’s inbuilt sendmail service. This is used to send out update status emails (by default on failure only) as well as analytics emails if addresses are configured.

    • Mail server: leave as default

    • SMTP Port: 25

    • SMTP Username: leave as default

    • SMTP Password: leave as default

      exercise install funnelback on linux 08
  19. Continue with the installation. The following screenshots show the remaining dialogs.

    exercise install funnelback on linux 09
    exercise install funnelback on linux 10
    exercise install funnelback on linux 11
    exercise install funnelback on linux 12
Tutorial: Install Funnelback on Windows

This exercise starts with a freshly installed Windows server running Windows Server 2016.

Prior to installing Funnelback on Windows, please ensure that the server’s date, time and time-zone have been set.

Note: you should install Funnelback onto the drive as where you plan to house the collections. This usually means that Funnelback will be installed in a location such as D:\Funnelback rather than on the C:\ drive.
  1. Run the funnelback-15.24.0-windows-installer.exe as an administrator (right-click on the installer and choose Run as administrator)

    exercise install funnelback on windows 01
  2. Specify directory where Funnelback will be installed. Install Funnelback to c:\funnelback.

    You should install Funnelback to the drive where Funnelback will store all of it’s data. This is often the non-system drive on Windows systems (e.g. d:\funnelback).

    Installation within the program files folders should be avoided as these folders have special permissions applied by Windows.

    exercise install funnelback on windows 02
  3. Specify the port to use for delivering the public search interface via http. Normally you would choose the default value of port 80.

    exercise install funnelback on windows 03
  4. Specify the port to use for delivering the public search interface via https. Normally you would choose the default value of port 443.

    exercise install funnelback on windows 04
  5. Specify the port to use for serving the Funnelback’s administration interface. Normally you would choose the default value of port 8443. This port must be different from the public search ports defined in the previous two steps.

    exercise install funnelback on windows 05
  6. Enter your Funnelback license key.

    exercise install funnelback on windows 06
  7. Specify the username and password for the Funnelback administration account. Additional accounts can be created from the administration interface.

    exercise install funnelback on windows 07
  8. (Optional) Provide mail server details - This step can be skipped and added at a later stage, providing mail server details will allow Funnelback to send email reports when collections are updated.

    exercise install funnelback on windows 08
  9. Provide Windows account details. A valid Windows account is required to set up Funnelback’s initial scheduled tasks. The account details are only required for the setup and are not stored.

    exercise install funnelback on windows 09
  10. (Optional) Download .NET Framework - This step can be skipped and installed at a later stage, the .NET Framework is required for Funnelback to crawl and access TRIM repositories.

    exercise install funnelback on windows 10
  11. Installation complete - you will now be able to access the Funnelback administration User Interface using the administration account details created in step 4, which will be opened in a browser window.

    exercise install funnelback on windows 11

1.4. Verifying installations

Once complete, several additional environment variables will now be defined in the search user’s environment:

  • $SEARCH_HOME (Linux) or %SEARCH_HOME% (Windows): defaults to the install folder (e.g. /opt/funnelback under Linux or D:\Funnelback under Windows)

  • $HTTP_SEARCH_HOME (Linux) or %HTTP_SEARCH_HOME% (Windows): defaults to the install folder (e.g. /opt/funnelback under Linux or D:\Funnelback under Windows)

Ensure that the public search interface is working as expected by visiting (locally):

http://localhost<:PUBLIC-SEARCH-HTTP-PORT>/s/search.html?collection=funnelback_documentation
https://localhost<:PUBLIC-SEARCH-HTTPS-PORT>/s/search.html?collection=funnelback_documentation

or:

http://<FUNNELBACK-SERVER-NAME><:PUBLIC-SEARCH-HTTP-PORT>/s/search.html?collection=funnelback_documentation
https://<FUNNELBACK-SERVER-NAME><:PUBLIC-SEARCH-HTTPS-PORT>/s/search.html?collection=funnelback_documentation

The port values are the ones you defined in the installer.

Ensure that the administration interface is working as expected logging in to (locally):

https://localhost<:ADMIN-PORT>/search/admin

or:

https://<FUNNELBACK-SERVER-NAME><:ADMIN-PORT>/search/admin

1.5. Automated installations

The Funnelback installer supports a mode that allows installation to be run without any user interaction. This allows the installation to be scripted, for example into a VM provisioning script.

The information that is interactively configured in a standard installation can be provided to the installer by command line options. For an unattended installation the following is required at a minimum:

  • Funnelback licence key

  • administration user password

The example command below would perform the same installation as performed in the above tutorial, without any prompts. This allows for the installation to be included in a VM build script.

  • run the Funnelback installer in automated (unattended) mode

  • installing it in /opt/funnelback.

  • using ports 80 and 443 for the public search interface

  • using default SMTP settings

  • creating a Funnelback administration interface user (admin, with password admin)

  • as the search user

The following command can be run on Linux to install Funnelback in unattended mode:

funnelback-{funnelback_version}-linux-installer.run --mode unattended --unattendedmodeui minimal --license_key '<FUNNELBACK-LICENCE-KEY>' --admin_username admin --admin_password 'admin' --prefix /opt/funnelback --search_port 80 --search_https_port 443 --unix_web_user search --schedule_username search --schedule_password not_used --smtp_server localhost

The following command can be run on Windows to install Funnelback in unattended mode:

funnelback-{funnelback_version}-windows-installer.exe --mode unattended --unattendedmodeui none --license_key '<FUNNELBACK-LICENCE-KEY>' --admin_password 'admin' --prefix 'C:\funnelback' --search_port 80 --search_https_port 443 --unix_web_user search --schedule_username vagrant --schedule_password vagrant --smtp_server smtp.example.com

1.6. Patching Funnelback

Patches are regularly released for Funnelback and are available for registered users from download.funnelback.com.

The patches are distributed as .tar.gz files and are cumulative, containing all the previous patch files for a given version of Funnelback.

The details of each patch are documented on the website, and also within the $SEARCH_HOME/VERSION/patches folder included with the patch file.

Application of a patch is currently a manual process.

Tutorial: Installing a patch

Patches are downloaded from https://download.funnelback.com

Ensure that you download a patch for the installed version of Funnelback.
  1. Read the release notes for the patches for details on what the patch is addressing and any specific instructions (such as files that require deletion).

  2. Stop all of the Funnelback services. (see the section on starting/stopping/restarting Funnelback services)

  3. Replace all of the files in the Funnelback installation with the files included in the patch. The folder structure of the .tar.gz archive mirrors the Funnelback folder structure. The following command overwrites all existing files with replacements from the patch file. On a production system it is a good idea to back up any files prior to replacement.

    On Linux use the following command to decompress the 15.24.0.03.tar.gz file and overwrite existing files:

    sudo -i -u search tar zxvf ./{funnelback_version}.03.tar.gz -C /opt/funnelback
    On Windows: decompress the .tar.gz file with a tool like 7zip then copy the files over the top of the existing files.
  4. It may be necessary to delete some files (usually specific version .jar files that have been updated to a later version) - the release notes will detail any files that need to be removed.

  5. Start all of the Funnelback services (see the section on starting/stopping/restarting Funnelback services). Once started the patch should be successfully applied.

  6. Inspect the contents of $SEARCH_HOME/VERSION and observe that a patches folder containing information about each patch has been created within the folder.

2. Application configuration

2.1. Set global server configuration

Never change values in global.cfg.default as these will be overwritten on upgrade.

The global configuration settings for Funnelback are stored within the global.cfg configuration file.

global.cfg is created and populated during installation and records information such as the web server ports and host names as multi-server configuration.

Default values for the global.cfg are listed in the $SEARCH_HOME/conf/global.cfg.default file.

Tutorial: Configure the server hostnames

This exercise modifies the server names used by Funnelback when generating urls within the administration interface.

Funnelback uses the hostnames configured in the global.cfg to generate links used in various parts of the system.

  1. Log in to the administration interface and view the analytics report for the analytics demo collection.

  2. Click on the queries report then one of the query terms. Hover over the search results link from the actions.

  3. Log in to the server as the search user.

  4. Edit $SEARCH_HOME/conf/global.cfg and update/add the following:

    urls.admin_hostname=admin.mycompany.com
    urls.search_hostname=search.mycompany.com
  5. Changes to global.cfg require a restart of Funnelback services. Restart the Funnelback services (see the section on starting/stopping/restarting Funnelback services).

  6. Links that are generated will now include the configured hostnames. e.g. Observe the search results link generated from the analytics reports uses search.mycompany.com

  7. The admininstration interface links will appear relative to the URL you are searching on, but the admin URL will be used in links that are generated external to the admin interface (e.g. report links included in analytics emails).

2.2. Setting server-wide configuration defaults

Never change values in collection.cfg.default as these will be overwritten on upgrade.

Default values for all the collection.cfg settings are listed in the $SEARCH_HOME/conf/collection.cfg.default file.

These values are inherited by the server’s collections, with values specified in the collection.cfg file for the specific collection overriding the defaults.

An administrator can override defaults listed in the collection.cfg.default file by creating the $SEARCH_HOME/conf/collection.cfg and adding any values that should replace the server-wide defaults. These values can still be overridden in individual collection.cfg files for the different hosted collections.

$SEARCH_HOME/conf/collection.cfg is preserved during an upgrade.

Values that are commonly set to custom server defaults include:

admin_email=<comma-separated list of default emails to send update emails to>
max_download_size=<maximum file size to download in MB for web crawls>
Tutorial: Configure server wide collection configuration defaults

This exercise will set a default admin email address for collection updates and configure a few other defaults.

  1. Log in to the server as the search user.

  2. Edit (or create) $SEARCH_HOME/conf/collection.cfg and update/add the following:

    admin_email=webmaster@mydomain.com
    crawler.request_timeout=30000
    crawler.max_timeout_retries=3
  3. Save the collection.cfg.

This has the effect of changing server defaults for the admin email address as well as configuring the crawler to have an increased default timeout and retry of connections that timeout.

These settings will now apply to all collections on the Funnelback server unless the collection-level collection.cfg overrides the value.

2.3. Configuring the mail server

Funnelback can be configured to send out emails containing collection update statuses or analytics reports.

Under Linux the sending of email is handled by the server’s built-in mail service.
Under Windows BLAT must be configured to set an SMTP server to use for sending of mail. This can be set during the installation but can also be configured post-installation.

Funnelback can automatically redirect specified domain names to a particular collection automatically.

This means that several domains (e.g. search.example.com, mysearch.example.com.au) can be configured to reference the Funnelback server, with each one redirecting to a different search.

The redirects apply as soon as the configuration file is saved.

This exercise configures the Funnelback server to redirect requests on three different domains to different default searches.

  1. Log in to the Funnelback server and create or edit the $SEARCH_HOME/conf/redirects.cfg file

  2. Add the following to the file then save:

    search.example.com=/s/search.html?collection=example-web&profile=web
    mobile.search.example.com=/s/search.html?collection=example-web&form=mobile
    api.search.example.com=/s/search.json?collection=example-api

Once configured a search for http://search.example.com/ will automatically redirect to http://search.example.com/s/search.html?collection=example-web&profile=web and so on.

2.5. Configuring search analytics

2.5.1. Reporting blacklist

The reporting blacklist can be used to prevent specified queries and IP addresses being included in Funnelback analytics reports.

Terms contained within the blacklist are compared as a case insensitive exact match against the query.

The blacklist only accepts complete single IP addresses (IP ranges and partial addresses are not supported).

The reporting blacklist can be set at server, collection or profile/frontend service level.

The server-wide reporting blacklist should be used to configure IP addresses and words that should be excluded from the analytics for all collections on the server.

This will include things like:

  • IP addresses of monitoring services

  • IP addresses of local/internal users (this is common to include if you wish to just see analytics based on external users to an organisation’s website)

  • Organisational banned words - terms that should be excluded from appearing in any analytics report.

Collection-level and profile-level blacklist items will contain similar items, however these will be specific to the organisation or profile (e.g. monitoring traffic from a specific organisation relevant only to the collection in question).

Editing of the server-wide and profile-level reporting blacklist files is currently only available via the backend.

Tutorial: Edit the server-wide reporting blacklist
  1. Log in to the Funnelback server and create a $SEARCH_HOME/conf/reporting_blacklist.cfg file.

  2. Add the following then save the file. This will ignore some test queries as well as any query originating from localhost or the organisation’s Paris office:

    # Misc testing queries
    "nagios index check"
    !padrenullquery
    !padrenull
    !null
    test
    testing
    
    # Disregard localhost queries
    @127.0.0.1
    
    # Paris office
    @123.4.56.789
  3. In order for these changes to be applied all analytics databases on the server will need to be rebuilt. Add the following to $SEARCH_HOME/conf/collection.cfg. This will disable incremental updating of analytics reports.

    analytics.reports.disable_incremental_reporting=true
  4. Log in to the administration interface and select update analytics from the system menu.

  5. Ensure all the collections are ticked then click the update button. Note: this can take a long time if the server hosts a lot of collections, or the collections contain a lot of analytics data.

  6. Once the analytics update is complete edit $SEARCH_HOME/conf/collection.cfg and either remove the analytics.disable_incremental_reporting option or set the value to false analytics.reports.disable_incremental_reporting=false

A full rebuild of affected analytics databases is required to apply reporting blacklist changes to historical data.

For server-wide reporting_blacklist.cfg changes this will require all collection databases to be rebuilt.

For collection and profile level reporting_blacklist.cfg files this will require only the collection’s database to be rebuilt.

2.5.2. Reporting email

The analytics system can be configured, per collection, to send out notification emails which provide a regular analytics summary, or trend alert notifications.

Email notifications can be configured for the following:

  • Query report summary - this is basically a PDF of the analytics overview screen. The notification frequency can be chosen.

  • Trend alert notifications - these can be sent when a spike in queries is detected.

The alerts are sent to one or more email addresses.

Tutorial: Set up analytics notifications
  1. Log in to the administration interface and switch to the desired collection.

  2. Select edit analytics email settings from the analytics tab.

  3. The edit screen allows you to set a sender, list of emails to send the reports and the frequency of the analytics summary and trend alert notifications.

  4. Once you’ve selected the desired settings apply the changes by saving.

  5. Email alerts will then be sent out for future analytics updates (as long as the email server is configured correctly). 

3. Funnelback services

Funnelback utilises a number of services that are installed when Funnelback is installed onto a server.

3.1. Starting/stopping/restarting services

Starting and stopping of services in Funnelback requires administrator access to the server.

Funnelback uses four services:

  • funnelback-daemon

  • funnelback-jetty-webserver

  • funnelback-graph

  • funnelback-redis

All services need to be running for correct functioning of Funnelback.

The order in which services are started and stopped is not important.

Under Linux:

  • Services are installed with other system services.

  • Services are started (as the root user) using

    /bin/systemctl start <servicename>
  • Services are stopped (as the root user) using

    /bin/systemctl stop <servicename>
  • Services are restarted (as the root user) using

    /bin/systemctl restart <servicename>

Under Windows:

  • Services are installed with other system services.

  • Services are controlled (as administrator) using the Microsoft Windows Services Console (services.msc) in the same manner as other Windows services.

3.2. Adjusting service memory allocation

Memory defaults which are written to the service files can be adjusted with the daemon.max_heap_size, neo4j.max_heap_size, jetty.max_heap_size and jetty.max_metaspace_size global configuration options.

Values for these settings are in the form accepted by Java’s -Xmx option (e.g. 128m).

After the settings are changed the following needs to be run - this re-installs the services with the updated settings:

$SEARCH_HOME/bin/setup/start_funnelback_on_boot.pl

3.3. Funnelback Jetty webserver service

Jetty is Funnelback’s embedded web server. It is used to provide search and administration interfaces and associated APIs.

3.3.1. Changing Jetty ports

The ports that Jetty binds to are configured in $SEARCH_HOME/conf/global.cfg.

Jetty is configured by default to serve the search interface on ports 80 (HTTP) and 443 (HTTPS). The administration interface is served from port 8443 (HTTPS).

To change the ports edit the global.cfg and update the values below to use the desired ports, then restart the funnelback-jetty-webserver service.

jetty.search_port=80
jetty.search_port_https=443
jetty.admin_port=8443

The administration and public search interfaces are provided by different Jetty instances - as a result the administration and public HTTPS interfaces bind to different ports.

The jetty.search_port_https and jetty.admin_port need to bind to different ports.

While possible, it is not recommended to serve administration and public HTTPS search from the same port, as this requires reconfiguring the administration instance of Jetty to also return the public HTTPS search results.

All accesses to Funnelback made on the administration port require authentication using a valid Funnelback administration user account.

3.3.2. Installing an SSL certificate

When Funnelback is installed a self-signed SSL certificate is generated.

This will cause a certificate warning to be displayed by web browsers when a user attempts to access the search or administration interfaces.

A signed SSL certificate can be installed into Jetty.

  1. Once you have received a certificate from a CA or client (or a user generated one), it needs to be loaded into a JSSE keystore.

  2. Use keytool to load a certificate in PEM format to a keystore.

    e.g. For a file called server.crt containing the following. (Please note that the example below isn’t a valid certificate):

    -----BEGIN CERTIFICATE-----
    AMC0jMnWSJeVfVEtpaolQ2fYIxSjEFn8EGNR7fXejW7P0GCmY1mX36AGkS1ujiOD
    VTETMBEGA1UECBMKU29tZS1TdGF0ZTEhMB8GA1UEChMYSW50ZXJuZXQgV2lkZ2l0
    cyBQdHkgTHRkMB4XDTE1MDcwODA1MDEwOVoXDTE2MDcwNzA1MDEwOVowRTELMAkG
    A1UEBhMCQVUxEzARBgNVBAgTClNvbWUtU3RhdGUxITAfBgNVBAoTGEludGVybmV0
    IFdpZGdpdHMgUHR5IEx0ZDCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEB
    yoiu4jkht7rg3ysAKJYfg73GFSCOluiefgli7W3CBKadefFC37EasfdgweFH83FD
    J80Pr1jzj2NrSgbgox3oHpQPGPYU76+ftWTc7URNgFzSjL1JKJfoUkIchIr/S7U6
    X9MLz7OIeLjYIq51w1abgMWoHoj06pkrPH8jtmGBL/Pef+NHftlSxw6cfkfH3X3e
    Y5hDPMdJqYNMDF0iLQ6Nzt1nrr7yBugFGaKvhwqf/LGPzVwr3sORddvW7CLtbcCq
    SgbdV2xreglX7jNxb8BpxyAWZuIT0UXwtHyyB6IEQ6DKO+EcsMVQbkbh5Bm5HI8I
    2aBsNcRaCAwWI4CZUMwtZFUCAwEAATANBgkqhkiG9w0BAQUFAAOCAQEAozv107pE
    TdkTkyRk6Nh1syjcnL5j3R44Ks0H9AvPEFT6g0f11Ch5NZCJ5slk5puG0nWwIIJ3
    WulmQS8rU9Fs0eCyhSggtnx1zZmB6Ni+5j2FL53qWYeIvIgnYDMKb8z3ev4rayLf
    S87ZerxFrZM0oDoNvedVZmuUN2bKY1jG45ZUs+DRz52Y9KAdUgEA6XpzNxV+sMSC
    JAfmCX6GnQzcWqlLvkfpLzZlvgeb2Vdr1giEicl6M/ABquzDBlSNc3ORS1M5pBca
    ygCS1nN2ma6snwbjoueL7We8KE0/ujSJ+ASUqwTZm7z5C8MSpsaqTwJx7xzVtQRY
    iYFzGpW+4oiMVQ==
    -----END CERTIFICATE-----
  3. Run the following command to load the certificate above into a custom JSSE keystore in /opt/funnelback/web/conf (There will already be a default keystore file in /opt/funnelback/web/conf):

    keytool -keystore <installdir>/web/conf/custom-keystore -import -alias jetty -file server.crt
  4. Update the global configuration to reference the custom keystore. Edit $SEARCH_HOME/conf/global.cfg and update or add the following:

    jetty.ssl.keymanager-password=password
    jetty.ssl.keystore-path=web/conf/custom-keystore
    jetty.ssl.keystore-password=password
    jetty.ssl.truststore-path=web/conf/custom-keystore
    jetty.ssl.truststore-password=password
  5. Restart Jetty and check $SEARCH_HOME/log/ and check the Jetty logs for any errors.

  6. Browse to https://server-name/s/search.html and ensure that the browser returns the correct certificate information.

3.4. Funnelback Redis service

This service runs REDIS, an in-memory storage system. This is required for correct function of the administration interface.

3.5. Funnelback daemon service

This provides several subservices used by various Funnelback components:

  • FilterService: Provides an HTTP interface to the text filtering system. Required for TRIMPush collections

  • GenerateCertificates: Generates SSL certificates needed by Matrix collections

  • RemoveStaleLocks: Remove collection update locks left over after an unexpected system reboot.

  • TRIMWatcher: Watch any changes in TRIM records to re-gather them as soon as possible. Required for advanced TRIMPush collections with a TRIM event processor.

  • WebDavService: Provides a WebDAV interface to the Funnelback installation folder, to support transferring files in multi-server configurations. Note: not to be confused with the WebDav access to configuration files introduced in Funnelback 15.20.

3.6. Funnelback graph service

This service runs Neo4J. This is required for Funnelback knowledge graph.

4. Web administration

Funnelback’s administration interface provides access to a number of system administration functions:

  • Collection overview

  • System overview

  • System logs

  • User management

  • Update scheduling

  • Collection-level configuration and logs

  • Licence key management

5. Backend administration

Backend access to the Funnelback server (via SSH or Windows remote desktop) is required for some system administration tasks including:

  • System configuration

  • Installation, patching and upgrades

  • Backup and monitoring

6. Funnelback filesystem

6.1. Funnelback home folder

The Funnelback home folder, or installation root, is the base folder where the Funnelback software has been installed. This is the folder referred to by the $SEARCH_HOME environment variable, which appears in many configuration files. (If on Windows the environment variable is %SEARCH_HOME%).

The home folder, $SEARCH_HOME usually points to the following location:

C:\Funnelback or D:\Funnelback

/opt/funnelback/

6.2. Top level folders

The following folders sit within the Funnelback home folder ($SEARCH_HOME):

conf

Global and collection-specific configuration files.

data

Collection-specific index, log and data files.

admin

Admin related files, administration users and collection-specific reports (WCAG, analytics, data reports).

web

Web server configuration and logs. Also includes the macro libraries and log files related to the modern UI.

log

Global log files.

bin

Funnelback binaries.

lib

Funnelback libraries.

share

Shared files such as custom collection templates, security plugins and language files.

linbin

Linux specific helper binaries. Only exists on the Linux version of Funnelback.

wbin

Windows specific helper binaries. Only exists on the Windows version of Funnelback.

6.3. Funnelback configuration folders

Note: all paths below are relative to $SEARCH_HOME/conf/.

<COLLECTION-ID>

Collection-specific configuration files.

<COLLECTION-ID>/@workflow

Collection-specific workflow scripts.

Note: this folder does not exist by default

<COLLECTION-ID>/@groovy

Collection-specific Groovy filters.

Note: this folder does not exist by default.

<COLLECTION-ID>/<PROFILE-ID> and <COLLECTION-ID>/<PROFILE-ID>\_preview

Collection profile/frontend service folder. These exist in pairs and correspond to the live and preview views.

Contains profile/frontend service-specific configuration including templates, best bets, synonyms.

Note: only folders for the _default profile will exist by default.

<COLLECTION-ID>/<PROFILE-ID>/web and <COLLECTION-ID>/<PROFILE-ID>\_preview/web

Web resources folder. Used for storage of custom static web resources (such as CSS, JS and image files referenced locally from templates). These exist in paris and correspond to the live and preview views.

Corresponding web path: http://<FUNNELBACK-SERVER>/s/resources/<COLLECTION-ID>/<PROFILE-ID>/

6.4. Funnelback reporting folders

Note: all paths below are relative to $SEARCH_HOME/admin/.

reports/<COLLECTION-ID>

Analytics and collection update history database files.

data_report/<COLLECTION-ID>

Data reports (web collections only).

6.5. Funnelback data folders

Note: all paths below are relative to $SEARCH_HOME/data/.

<COLLECTION-ID>

Cached data and log files for collection.

<COLLECTION-ID>/archive

Historic query and click logs used for analytics generation.

<COLLECTION-ID>/live / <COLLECTION-ID>/offline

Live and offline views.

Live view (contains all files relating to currently live search index).

Offline view (contains all files relating to previously live search index OR currently running update OR last failed update).

Note: live and offline are symbolic links to folders one and two.

<COLLECTION-ID>/one / <COLLECTION-ID>/two

These are the physical folders that contain the live and offline views and are the targets of the live and offline symbolic links.

<COLLECTION-ID>/log

Collection-level update log files and modern UI logs. Location of the top level update-<COLLECTION-ID>.log file.

<COLLECTION-ID>/databases

Database history files for the accessibility auditor.

<COLLECTION-ID>/live/data / <COLLECTION-ID>/offline/data

Live/offline view cached data. Used when building indexes, and for supplying cached pages.

<COLLECTION-ID>/live/idx / <COLLECTION-ID>/offline/idx

Live/offline view index files.

<COLLECTION-ID>/live/log / <COLLECTION-ID>/offline/log

Live/offline view log files. Detailed logs for the live and offline views. Includes collection update logs.

<COLLECTION-ID>/live/databases / <COLLECTION-ID>/offline/databases

Live/offline view database files. Includes web crawler database (where appropriate), content auditor summaries and recommender databases (where enabled).

<COLLECTION-ID>/knowledge_graph

Generated knowledge graph files for loading into Neo4j. Organised by profile.

7. Application level logging

Funnelback produces a vast array of log files which are extremely useful for debugging issues with Funnelback.

The following reference provides information on most of the log files produced by Funnelback.

8. Creating a multi-server deployment

8.1. Server roles

Funnelback can be scaled to multiple servers, with different servers performing different roles (although all servers in a Funnelback cluster are fully capable Funnelback installations capable of both server roles).

8.1.1. Administration / crawl server

An administration / crawl server is responsible for providing administration services for the Funnelback multi-server cluster.

Administration services include:

  • Web administration

  • APIs

  • Analytics and reporting

The server is also responsible for the gathering of all content and building of the search indexes. These indexes are pushed out to query processor machines at the end of an update.

Analytics and audit reports are also managed by the administration server, with query logs pulled from the query processors during an update.

Funnelback supports a single active administration server. A passive failover server can be configured for redundancy.

8.1.2. Query processor

A query processor server is responsible for serving search results to end users.

Funnelback supports multiple query processors. The query processors can be load balanced and placed across multiple physical locations.

9. Configuring multiple servers

Funnelback supports various multi-server architectures. The examples below should provide the necessary tools to configure other architectures.

For the following sections we will configure a cluster containing:

  • A single administration/crawl server (ad1.mydomain.com)

  • A single failover administration server (ad2.mydomain.com)

  • Two query processors (qp1.mydomain.com and qp2.mydomain.com)

9.1. Configuring the administration / crawl server

  1. Configure the global configuration to specify the query processor(s) and failover administration servers. Add the server names to the query_processors option in $SEARCH_HOME/global.cfg on ad1.mydomain.com. e.g.

    query_processors=qp1.mydomain.com,qp2.mydomain.com,ad2.mydomain.com
  2. Disable the WebDAV service (this prevents any other machine in the cluster from writing files to the admin server). In $SEARCH_HOME/global.cfg override the default setting by adding:

    daemon.services=GenerateCertificates,FilterService
  3. Create a script for syncing the administration servers. Create $SEARCH_HOME/conf/sync_ad.bat (which calls $SEARCH_HOME/conf/sync_ad.groovy). See below for code.

  4. Configure a scheduled task to perform configuration replication to the failover admin server. Open the Windows task scheduler and create a new task with the following attributes.

    • Name: Sync configuration

    • Trigger: At system startup, repeat every 10 minutes indefinitely

    • Action: D:\funnelback\conf\sync_ad.bat <FAILOVER ADMIN SERVER>

    • Run the task as a user that has permissions over the Funnelback folders (usually the same service account that is used to schedule tasks).

      On Linux you would create an equivalent cron job.

  5. (optional) start the task. Task will be automatically started on boot

  6. Enable the REMOTE publish target in $SEARCH_HOME/conf/file-manager.ini as described in the multiserver configuration instructions.

  7. Configure the workflow publish hook command by editing $SEARCH_HOME/conf/collection.cfg and adding the line:

    workflow.publish_hook=$SEARCH_HOME/bin/publish_hook.pl

9.2. Configuring query processors

  1. Configure the server secret to match the primary admin. In $SEARCH_HOME/global.cfg replace the server_secret line with the following (note this value is read from the server_secret defined on the primary admin server):

    # set same server secret as ad1
    server_secret=<SERVER SECRET FROM PRIMARY ADMIN>
  2. Disable the WebDAV service (this prevents any other machine in the cluster from writing files to the admin server). In $SEARCH_HOME/conf/global.cfg add

    daemon.services=GenerateCertificates,FilterService
  3. Open the crontab (Linux) or Windows Task Scheduler (Windows) and disable the Update outliers and Update reports tasks.

  4. Set the admin UI to operate in read only mode. This prevents any changes being made via the administration interface on the query processor. In $SEARCH_HOME/conf/global.cfg add the following line:

    admin.read-only-mode=true

9.3. Configuring failover administration

  1. Configure the query processors: In $SEARCH_HOME/conf/global.cfg add (ensure the query_processors line is commented out). This is uncommented in the event that failover admin is needed to act as the primary admin.

    #query_processors=qp1.mydomain.com,qp2.mydomain.com,ad1.mydomain.com
  2. Configure the server secret to match the primary admin. In $SEARCH_HOME/conf/global.cfg replace the server_secret line with the following (note this value is read from the server_secret defined on the primary admin server):

    # set same server secret as ad1
    server_secret=<SERVER SECRET FROM PRIMARY ADMIN>
  3. Open the crontab (Linux) or Windows Task Scheduler (Windows) and disable the update outliers and update reports tasks.

  4. Set the admin UI to operate in read only mode. In $SEARCH_HOME/conf/global.cfg add the following line:

    # Set read only mode to false when making this server the primary admin
    admin.read-only-mode=true
  5. Configure a scheduled task to perform configuration replication to the failover admin server

  6. Create $SEARCH_HOME/conf/sync_ad.bat (which calls $SEARCH_HOME/conf/sync_ad.groovy). See below for code.

    Under Linux: * Create a cron job scheduled to run the sync_ad command every 10 minutes

    Open the Windows task scheduler and create a new task with the following attributes:

    • Name: Sync configuration

    • Trigger: At system startup, repeat every 10 minutes indefinitely

    • Action: D:\funnelback\conf\sync_ad.bat ad02.org.com

    • Run the task as a user that has permissions over the Funnelback folders (usually the same service account that is used to schedule tasks).

  7. Disable the task - this is only required for when the server is used as a primary admin server

9.4. sync_ad.groovy

This script is used to recursively transfer a specified list of folders to another server within the cluster.

import com.funnelback.common.config.*
import com.funnelback.mediator.core.webdav.*
import java.nio.file.*

/*
 * Command line tool to transfer arbitrary files to a remote Funnelback server
 * using the WebDAV protocol. It assumes that:
 *
 * - Both Funnelback instances share the same server_secret in global.cfg
 * - The WebDAV service is up and running on the remote side
 * - Access to the WebDAV port (usually 8076) has been opened in the firewall
 *
 * Only files from within the local Funnelback install can be transferred.
 */

// Port of the remote WebDAV service
def final DEFAULT_WEBDAV_PORT = 8076;

// Setup command line options
def cli = new CliBuilder(usage: getClass().simpleName+' -h <remote_host> <file or folder> [file or folder] ...')
cli.with {
    header = "Transfer a file to a remote Funnelback server.\n" + "Both Funnnelback servers must have the same server_secret"
    h longOpt: "host", argName: 'hostname', args:1, type: String.class, "Remote host name"
    p longOpt: "port", argName: 'port', args:1, type: Number.class, "Remote host WebDAV port (Defaults to $DEFAULT_WEBDAV_PORT)"
    d longOpt: "delete", "Delete remote files that don't exist locally (Default: false)"
    s longOpt: "smart", "Do not transfer file that already exist remotely, based on the filesize (Default: false)"
    r longOpt: "recursive", "Transfer folders recursively (Default: false)"
    v longOpt: "verbose", "Verbose mode"
}

options = cli.parse(args)
if (!options) return;
//if (!options.h || !options.arguments()) {
if (!options.h) {
    // No host name, or no file to transfer
    cli.usage()
    return;
}

def port = DEFAULT_WEBDAV_PORT
if (options.p) port = options.p

fs = FileSystems.getDefault()
searchHome = fs.getPath(new File(System.getenv("SEARCH_HOME")).absolutePath)

def config = new GlobalOnlyConfig(searchHome.toFile())               // Global config data, for the server_secret
webdav = new FunnelbackSardine(config.value("server_secret"))    // WebDAV client, to push files
urlHelper = new WebDAVUrlFactory(options.'host', port)           // Helper to build URL to the remote machine

def seeds = options.arguments()
//def seeds = ["d:\\funnelback\\conf\\","d:\\funnelback\\admin\\","d:\\funnelback\\databases\\"]

// For each file or folder...
//options.arguments().each {
seeds.each() {
	println "Seed: " + it
	seed = new File(it)
	files=seed.listFiles()

	files.each() {
		transfer(it)

	}
}

public transfer(it){

		// The actual file
		def file=it
		//def file = new File(it)
		// A path object makes it easer to manipulate
		def path = fs.getPath(file.absolutePath)

		if (! file.exists()) {
			println "Skipping '$path' as it doesn't exist"
		} else if (file =~ /global\.cfg$/) {
			println "Skipping '$path' as it's global.cfg"
		} else if (file =~ /company-info\.mapdb/) {
			println "Skipping '$path' as it's company-info.mapdb"
		} else if (! path.startsWith(searchHome)) {
			println "Skipping '$path' as it doesn't belong to SEARCH_HOME ($searchHome)"
		} else {
			def targetUrl = urlHelper.root() + searchHome.relativize(path).toString().replace("\\", "/")
			println "Transferring '$path' to '$targetUrl'"

			try{
			if (file.isDirectory()) {
				println "Inside a directory"
				println file
				filesDirectory=file.listFiles()
				filesDirectory.each() {
					transfer(it)
				}
			} else {
				try{
					webdav.put(targetUrl, file);
					println "Transfer complete for '$path' to '$targetUrl'"
				}catch(all){
					System.out << ("FAIL ")
				}
			}
			}
			catch(Exception e){
				System.out << ("ERROR: ")
				System.out << (e.getMessage() + "\n")
			}
		}
}

9.5. sync_ad.bat

This script is used to replicate configuration to the failover administration server. It is called by a scheduled task that executes every 10 minutes.

Usage: d:\funnelback\conf\sync_ad.bat <FAILOVER ADMIN SERVER>

@ECHO off
REM Command to configuration to failover admin server.  Runs every 10 minutes via a scheduled task.
REM USAGE: sync_ad.bat <failover admin host>

IF "%~1"=="" (
@ECHO Usage: sync_ad.bat ^<failover_admin_host^>
EXIT /B 1
) ELSE IF "%SEARCH_HOME%"=="" (
@ECHO ERROR: SEARCH_HOME is not defined
EXIT /B 1
) ELSE (
SET JAVA_HOME=%SEARCH_HOME%\wbin\java
%SEARCH_HOME%\tools\groovy\bin\groovy -cp "%SEARCH_HOME%\lib\java\all\funnelback-common.jar;%SEARCH_HOME%\lib\java\all\funnelback-mediator-core.jar;%SEARCH_HOME%\lib\java\all\sardine-314.jar;%SEARCH_HOME%\lib\java\all\log4j-1.2-api-2.6.2.jar;%SEARCH_HOME%\lib\java\all\log4j-api-2.6.2.jar;%SEARCH_HOME%\lib\java\all\log4j-core-2.6.2.jar;%SEARCH_HOME%\lib\java\all\jedis-2.1.0.jar;%SEARCH_HOME%\lib\java\all\commons-pool-1.5.5.jar;%SEARCH_HOME%\lib\java\all\commons-io-2.4.jar;%SEARCH_HOME%\lib\java\all\httpclient-4.3.2.jar;%SEARCH_HOME%\lib\java\all\httpcore-4.3.2.jar;%SEARCH_HOME%\lib\java\all\commons-collections-3.2.jar;%SEARCH_HOME%\lib\java\all\slf4j-log4j12-1.6.4.jar;%SEARCH_HOME%\lib\java\all\slf4j-api-1.6.4.jar;%SEARCH_HOME%\lib\java\all\commons-codec-1.9.jar;%SEARCH_HOME%\lib\java\all\guava-21.0.jar;%SEARCH_HOME%\lib\java\all\commons-csv-1.0.jar;%SEARCH_HOME%\lib\java\all\commons-lang3-3.3.2.jar" %SEARCH_HOME%\conf\sync_ad.groovy -h %1 -s -d -v -r %SEARCH_HOME%\conf\ %SEARCH_HOME%\admin\ %SEARCH_HOME%\databases\
)

9.6. sync_conf.bat

This helper script is used to replicate configuration for a specified collection to a specified server within the cluster.

Usage: d:\funnelback\conf\sync_conf.bat <remote server> <collection ID>

@ECHO off
REM Command to sync the conf folder for a specific collection to the specified remote host.
REM USAGE: sync_conf.bat <remote host> <collection>

IF "%~1"=="" (
@ECHO Usage: sync_conf.bat ^<remote_host^> ^<collection^>
EXIT /B 1
) ELSE IF "%~2"=="" (
@ECHO Usage: sync_conf.bat ^<remote_host^> ^<collection^>
EXIT /B 1
) ELSE IF "%SEARCH_HOME%"=="" (
@ECHO ERROR: SEARCH_HOME is not defined
EXIT /B 1
) ELSE (
SET JAVA_HOME=%SEARCH_HOME%\wbin\java
%SEARCH_HOME%\tools\groovy\bin\groovy -cp "%SEARCH_HOME%\lib\java\all\funnelback-common.jar;%SEARCH_HOME%\lib\java\all\funnelback-mediator-core.jar;%SEARCH_HOME%\lib\java\all\sardine-314.jar;%SEARCH_HOME%\lib\java\all\log4j-1.2-api-2.6.2.jar;%SEARCH_HOME%\lib\java\all\log4j-api-2.6.2.jar;%SEARCH_HOME%\lib\java\all\log4j-core-2.6.2.jar;%SEARCH_HOME%\lib\java\all\jedis-2.1.0.jar;%SEARCH_HOME%\lib\java\all\commons-pool-1.5.5.jar;%SEARCH_HOME%\lib\java\all\commons-io-2.4.jar;%SEARCH_HOME%\lib\java\all\httpclient-4.3.2.jar;%SEARCH_HOME%\lib\java\all\httpcore-4.3.2.jar;%SEARCH_HOME%\lib\java\all\commons-collections-3.2.jar;%SEARCH_HOME%\lib\java\all\slf4j-log4j12-1.6.4.jar;%SEARCH_HOME%\lib\java\all\slf4j-api-1.6.4.jar;%SEARCH_HOME%\lib\java\all\commons-codec-1.9.jar;%SEARCH_HOME%\lib\java\all\guava-21.0.jar;%SEARCH_HOME%\lib\java\all\commons-csv-1.0.jar;%SEARCH_HOME%\lib\java\all\commons-lang3-3.3.2.jar" %SEARCH_HOME%\conf\sync_ad.groovy -h %1 -s -d -v -r %SEARCH_HOME%\conf\%2
)

10. Licence management

11. Managing users

  • Access to functionality within the administration interface is via granular permissions and system properties

  • Permissions and properties cover allowed collections, profiles and licences as well as edit a view permissions for configuration, reporting and updating.

  • Permissions and properties for different types of users can be assigned to a reusable role.

  • Users can inherit their permissions from one or more roles which are combined by adding the granted permissions.

11.1. User roles

When defining permissions for roles it is a good idea to define two types of roles to enable a maximum amount of reusability in the roles:

  1. roles that define the content scope - allowed collections, profiles and licences

  2. roles that define the user type - allowed configuration, reporting and updating permissions

11.2. User creation

Access to the Funnelback administration interface is controlled via administration interface user accounts.

The administration interface user accounts are Funnelback-specific accounts (LDAP/domain accounts are not currently supported).

Funnelback administration users have access to various administration functions depending on the set of permissions applied to the user. Sets of permissions can be grouped into roles and applied to a user. This allows default permission sets to be defined and reused across many users.

Funnelback ships with five default permission sets:

  • default-administrator: full access to all available permissions.

  • default-analytics: read only access to analytics.

  • default-marketing: provides access to functionality available via the marketing dashboard.

  • default-support: provides access to functionality suitable for users in a support role.

  • default-implementer: provides access to functionality suitable for most implementers.

Roles can also be used to define sets of resources (such as collections, profiles and licences) that may be shared by multiple users.

When creating a user it is best practice to separate the resources and permissions into different roles and combine these. This allows you to create a role that groups the collections, profiles and licences available for a set of users and combine this with one or more roles that define the permission sets that apply for the user.

Tutorial: Create a resources role

A resources role defines a set of collections that a user has permissions to access. A resources role can be granted to a number of users from the same organisation with differing levels of access as this is defined in the permissions roles.

  1. Log in to the administration interface as an admistrator and select manage users and roles from the system menu.

    tutorial create a user 01
  2. Select the manage roles tile.

    tutorial create a role 01
  3. Roles available to current user are listed in the table. Click the add button to create a new role.

    tutorial create a resources role 01
  4. Enter a name for the role (e.g. resources-foodista) the click create role.

    tutorial create a resources role 02
  5. The role management screen allows the attributes of the role to be defined. As this is a resources role we will only need to complete the general information, collections and profiles sections of the screen.

  6. Under general information set a description by clicking the pencil icon that appears when you hover over the description field. Enter a description. e.g. Resources that can be managed by the foodista organisation.

    tutorial create a resources role 03
  7. Under collections define the set of collections that the users should be able to manage. (e.g. select the foodista collection)

    tutorial create a resources role 04
  8. Optional: Under profiles define the set of profiles that the users should be able to manage. Setting this value will depend on the context of how the service is being used. If the search applies to a single organisation that has a single set of users that should have access to the same set of profiles then enter this here. However you may need to define separate roles for the profiles if you are running a service that caters to groups of users that require different profiles (such as a whole of government search that has profiles defined for different agencies). For this select Access to all profiles on this server.

  9. The role is saved every time you make a change.

Tutorial: Create a permissions role

A permissions role defines a set of permissions that can be granted to a user. A user may be granted multiple permissions roles when they are created with all of the individual permissions combined based on the set of roles assigned.

  1. Log in to the administration interface as an admistrator and select manage users and roles from the system menu.

    tutorial create a user 01
  2. Select the manage roles tile.

    tutorial create a role 01
  3. Roles available to current user are listed in the table. Click the add button to create a new role.

    tutorial create a permissions role 01
  4. Enter a name for the role (e.g. permissions-campaign-manager) then click create role.

    tutorial create a permissions role 02
  5. The role management screen allows the attributes of the role to be defined. As this is a permissions role we will not complete the collections and profiles sections of the screen as these will be managed using a resources role.

  6. Under general information set a description by clicking the pencil icon that appears when you hover over the description field. Enter a description. e.g. Campaign managers have access to best bets, curator and search analytics.

    tutorial create a permissions role 03
  7. Under permissions define the set of permissions that the users should be able to manage. (e.g. select the sec.best-bet, sec.curator and sec.reporter permissions). Details for each permission are displayed when hovering over the small ? icon beside each permission.

    tutorial create a permissions role 04
  8. The role is saved every time you make a change.

Tutorial: Create a user

This tutorial shows how to create a user and assign existing roles.

  1. Log in to the administration interface as an admistrator and select manage users and roles from the system menu.

    tutorial create a user 01
  2. Click the manage users button.

    tutorial create a user 02
  3. Create a user with the name john@foodista. Click on the add new button.

  4. On the general information screen, enter basic user details, then click the Next: create password button.

    • Username: john@foodista

    • Full name: John Smith

    • Email: jsmith@foodista.com

    tutorial create a user 03
  5. Enter searchanalyticsarecool as the password the click the Next: what roles does this user belong to? button.

    tutorial create a user 04
  6. Select the appropriate roles for the user. Roles are sets of permissions that are granted allowing the user to have access to different sets of functionality within Funnelback. Select the resources-foodista and permissions-campaign-manager roles. Observe how the derived permissions change as the roles are turned on and off.

    tutorial create a user 05
  7. The next screen allows you to assign roles that can manage this user. We’ll skip this for now. Click on the review user button to see a summary of the user setup.

    tutorial create a user 06
  8. Click the create user button to finish the user setup. The different panels on the screen allow you to edit the various permissions assigned to the user.

    tutorial create a user 07
  9. Verify that the user has been created by checking that the user is listed on the manage users screen. Select manage users from the breadcrumb trail.

    tutorial create a user 08

12. Scheduling updates

Collection and analytics updates are scheduled from the administration interface.

12.1. Scheduling collection updates

Tutorial: Schedule a collection update

There are two ways of scheduling collection updates.

Method 1:

  1. Log in to the administration interface and switch to the collection you wish to schedule.

  2. Select schedule automatic updates from the update tab.

  3. Fill in the desired update time for your collection. Other collection updated times are shown to assist in choosing a suitable time.

Method 2:

  1. Log in to the administration interface and select schedule automatic updates from the system menu.

  2. Fill in the desired update time for your collections.

Tutorial: Manage collection updates
  1. Log in to the administration interface

  2. Select schedule collection updates from the system menu

  3. A table showing all the collections and their scheduled update times is shown with the ability for the times to be edited inline.

12.2. Scheduling analytics updates

Analytics for all hosted collections are bulk updated via a single scheduled update task.

The update analytics task will be automatically scheduled to run a 2am daily when Funnelback is installed. This can be changed as follows:

  1. Log in to the administration interface and select schedule automatic updates from the system menu.

  2. Update the schedule for the query reports update task.

Tutorial: Excluding a collection from automatic analytics updates

Collections can be excluded from the scheduled analytics update by setting a configuration option. This prevents automatic update of the collection’s analytics (though it is still possible to start a manual analytics update on the collection).

  1. Log in to the administration interface and switch to the collection that you wish to not update.

  2. Edit the collection.cfg and add the following:

    analytics.scheduled_database_update=false
  3. Save the collection.cfg

Under Linux:

  • Updates scheduled via the administration interface are registered as cron jobs for the search user.

  • The schedule can be updated either via the administration interface, or by editing the crontab.

Under Windows:

  • Updates scheduled via the administration interface are registered in the Windows task scheduler, with tasks running as the user that is set when scheduling the task.

  • The user used to schedule tasks under Windows should have a non-expiring password otherwise updates will fail silently when the password expires.

  • The schedule can be updated either via the administration interface, or by editing the tasks in the Windows task scheduler.

12.3. Scheduling knowledge graph updates

Knowledge graph updates can only be scheduled by manually creating a crontab entry (Linux) or task scheduler task (Windows) that calls the /admin-api/task/v1/queue/UPDATE_KNOWLEDGE_GRAPH admin API.

13. Backing up funnelback

Ensure that any backup software does not lock the Funnelback live and offline data folders while any updates are running otherwise this can cause updates to fail. Either exclude the Funnelback data/<COLLECTION-ID>/live and data/<COLLECTION-ID>/offline folders from the backups or ensure that the backup process runs when no scheduled updates are running.

13.1. Determining backup priorities

13.1.1. Collection-level backups

Several Funnelback components can be backed up independently of each other. Generally, backup priorities are focused on restoring:

  • Collection configuration: gatherer and indexer configuration, front-end configuration.

  • Administration: analytics, accessibility reporting, data reporting, search logs

For each collection, determine a set of backup tiers, aligned with collection update schedules and target restoration times, the following table might be appropriate for backing up folders within $SEARCH_HOME:

Tier Backup interval Retention period Functionality Directories Notes

T1

Hourly

2 days

Query processing

/conf/

T2

Daily

1 year

Users

Custom filters

/admin/users

/lib/java/groovy

T3

Daily

4 weeks

Query and Click Logs

Query and accessibility reports

/data/<COLLECTION-ID>/archive

/admin/reports/<COLLECTION-ID>/

Name Directories Backup tiers

Data reports

/admin/data_report

T4

Users

/admin/users

T1,T2

Binaries

/bin

/linbin

/lib

T4

Configuration

/conf

T1

Web

/web

T4

Query & click logs

/data/<COLLECTION-ID>/archive/

T2

Indexes

/data/<COLLECTION-ID>/live (a sym link to either /data/<COLLECTION-ID>/one or /data/<COLLECTION-ID>/two)

T4

If collections indexes in /data/<COLLECTION-ID>/live/ are not backed up, a collection will only have query processing functionality restored after the next collection update.

13.1.2. Application-level backups

Any custom folders within $SEARCH_HOME created as part of an implementation should also be placed in one of the tiers suggested above.

Funnelback’s collection update schedule writes to the operating system’s schedule configuration. These OS-level configuration files should also be backed up.

13.2. Restoring from backups

Revisiting our service priorities, our first goal is to ensure that the Funnelback is serving search results correctly:

  1. From the command line restore /conf folder

  2. From the command line, restore /data/<COLLECTION-ID>/live folder

  3. From the public interface, run a test query against a restored collection:

    http://<FUNNELBACK-SERVER>/s/search.html?collection=<COLLECTION-ID>&query=TEST

It should be noted that indexes created on one operating system cannot be used on a different operating system.

Our next priority should be to restore access to reports and analytics in the Funnelback administration interface:

  1. From the Command Line, restore /admin/users

  2. From the Command Line, restore /admin/reports/<COLLECTION-ID>/

  3. From the marketing dashboard, confirm that analytics are available for your restored collection(s) when logging in as the restored user(s)

Finally, restore the data reports and archived logs for each collection:

  1. From the Command Line, restore /admin/data_report

  2. From the Command Line, restore /data/<COLLECTION-ID>/archive

  3. From the Admin UI, confirm that data reports are available for your restored collection(s)

13.3. Push collection backups

Push collections can’t be directly backed up by storing the files that are located on the file system

Details on the backup and restore process for push collections require the snapshotting of the collection.

13.4. Review questions

  1. Explain why a collection’s offline view might not be worth backing up?

  2. Funnelback defaults to generating search analytics databases on a daily basis using /data/<COLLECTION-ID>/live/log/clicks_queries.log as input. What is the impact of not backing up clicks_queries.log?

  3. The backup tier table above suggests that live indexes only be backed up on a weekly basis. What scenarios would justify reducing this timeframe?

  4. Assume that your backup regime did not have capacity for Tier 4 backups. What would be the alternative restoration method? How would you estimate time to restoration?

  5. Assume that Funnelback services are not stopped during a backup. What are the risks of backing up collections while they are being updated and/or queried?

  6. Assume a multi-server Funnelback setup (one crawler / administration machine, two query processors) where the Crawling / Administration server and your backup environment has become corrupted. How would you attempt to restore functionality?

14. Monitoring Funnelback

The Funnelback software can be configured to send out an email indicating if an update has failed or completed successfully. This is built in to the product but requires the server to be configured to send via an SMTP server.

The Funnelback server can also be monitored using a monitoring service such as Nagios.

Monitoring checks should implement any authentication required to gain access to the pages that form part of the check.

The following sections are suggested checks.

14.1. Service availability

14.1.1. Query service - HTTP

Request

HTTP request to http://<FUNNELBACK-SERVER>/s/search.html?collection=<collection-name>

Response

Check for HTTP 200 OK response code

14.1.2. Query service - HTTPS

Request

HTTP request to https://<FUNNELBACK-SERVER>/s/search.html?collection=<collection-name>

Response

Check for HTTP 200 OK response code

14.1.3. Admin service (Classic)

Request

HTTP request to https://<FUNNELBACK-SERVER>:8443/search/admin/

Response

Check for HTTP 200 OK response code

14.1.4. Admin service

Request

HTTP request to https://<FUNNELBACK-SERVER>:8443/a/

Response

Check for HTTP 200 OK response code

14.2. Service response times

Request

HTTP/S request to http(s)://<FUNNELBACK-SERVER>/s/search.html?collection=<collection-name>&query=<known-query>

Response

Measure response time

14.3. Query check

Request

HTTP/S request to http(s)://<FUNNELBACK-SERVER>/s/search.html?collection=<collection-name>&query=<known-query>

Response

Check response for presence of an expected result. Eg. for query=mawson a response containing the word lakes may be expected.

14.4. Index freshness

Request

HTTP/S request to http(s)://<FUNNELBACK-SERVER>/s/search.xml?collection=<collection-name>&query=<known-query>

Response

extract Report last updated: DD MMM YYYY value from the page source.

14.5. Push collection checks

The Funnelback push API provides a number of calls grouped under the Push API health heading that can be used to monitor the status of push collections.

14.6. Analytics database age

Request

HTTP request to https://<FUNNELBACK-SERVER>:8443/search/admin/analytics/Dashboard?collection=<collection-name>

Response

extract Report last updated: DD MMM YYYY value from the page source.

14.7. Server attributes

Monitoring of server attributes (disk space, CPU load, memory utilisation etc.) should be performed using standard monitoring procedures.

15. Securing Funnelback

Tutorial: Suppress the collection listing screen
  1. The collection listing screen is displayed when you access the search.html endpoint without any parameters. e.g. http://training-search.clients.funnelback.com/s/search.html. The default behaviour is to return a list of collections that are hosted on the server.

  2. This screen is defined by the no-collection template. To override this to return a blank screen create a $SEARCH_HOME/web/templates/modernui/no-collection.ftl with the following contents:

    <!doctype html>
    <head></head>
    <body></body>
    </html>
  3. The change takes effect as soon as the file is saved.

16. Firewalling

If a firewall is configured on the Funnelback server it must allow inbound access to the search and administration interfaces

Outbound access will need to be allowed to the facilitate Funnelback’s access to the repositories being indexed (e.g. for HP TRIM this often means allowing port 1137)

If the service is extended to a multi-server setup then ports to allow WebDAV access between the servers will need to be opened. The tables below summarise ports that are used for provision of Funnelback services, and commonly used for access to various repository types.

17. Upgrading funnelback

17.1. Before you upgrade

Funnelback will attempt to automatically update as much as possibly however it is sometimes necessary to perform some manual upgrade steps.

Read the release notes – for each version between the version you are currently running and the version you are upgrading to.

The release notes highlight any manual changes that are required between versions as well as other things that can affect a successful upgrade.

17.2. Prepare Funnelback for upgrade

If you are upgrading from Funnelback 14.0 or newer Funnelback must be prepared for upgrade.

Preparing Funnelback for upgrade places any hosted push collections into the correct state for an upgrade.

17.2.1. Prepare for upgrade via the administration interface

Funnelback can be prepared by selecting prepare for upgrade from the system menu within the administration interface.

17.2.2. Prepare for upgrade via the push API

Access to the administration interface is not always possible when working on a Funnelback installation.

The push API can also be used to prepare Funnelback for upgrade without access to the admin interface.

This is the preferred way to prepare for an upgrade if you don’t have access to the administration interface (via a web browser).

POST /push-api/v1/upgrade/prepare

An interface to access this call directly is available within Funnelback’s administration interface (https://<FUNNELBACK-SERVER>:<ADMIN-PORT>/search/admin/api-ui/) under the Push API tab in the push-api-collection section.

This can be called using CURL (curl -X POST https://<FUNNELBACK-SERVER>:<ADMIN-PORT>/push-api/v1/upgrade/prepare).

Note: Jetty must be running on the server for this call to work.

Once the API call completes you can run the installer.

18. Performance tuning

18.1. General collection configuration and gatherer settings

Within the Funnelback application are a number of features that allow performance to be tuned in crawling, indexing and query processing.

18.1.1. Crawling/gathering

Crawl settings can be used to decrease the time required to download content from a target repository. Please note that crawl times will still be dependent upon the maximum possible response rate of the targeted repository (e.g. if the TRIM repository can only process 1 record per second).

Settings available vary depending on gatherer type.

Analyse the gatherer logs
  • Exclude items that are not required for the search index or are generating errors into the logs.

  • If there are lots of repeated URLs it is likely that the Javascript link extractor is the cause. Either disable the Javascript link extractor, or disable the items that are resulting in repeated errors.

  • Look out for crawler traps and add appropriate exclusions.

  • Adjust the allowed maximum download size to an appropriate value. (default is 3MB)

Multithreaded crawling

When crawling web sites, the Funnelback crawler will only submit one request at a time per server by default (e.g. wait for the first request to Site A to be completed, before requesting another page from Site A). If the website server has additionally capacity, the Funnelback crawler can be configured to send simultaneous requests per server (e.g. request two pages at a time from Site A). Site profiles allow this to be controlled on a per-host basis.

TRIM gathering has similar settings – in peak times the crawler will run single-threaded, but can be configured to run with multiple threads in off-peak times (or all the time).

Incremental updates

When first updating a web collection, Funnelback will download all files matching its include patterns. For subsequent updates however, the crawler can make HEAD requests to the targeted web server, requesting only the file size in bytes of the document. If this size is the same as the previous time the file was downloaded, the crawler skips a complete request (GET).

Other collection types employ similar techniques. E.g. TRIM collections only gather items that have changed since a defined time. Fileshare collections will copy content from cache if it hasn’t changed since the last update.

Disable the accessibility and content auditor reports
Accessibility and content auditing requires a lot of document processing during the crawl – to parse and analyse each document.

Disabling these reports means that there is no need to perform this extra document filtering.

Request throughput

Web collections only. When requesting content from a web site, the Funnelback crawler will wait between each subsequent request. The crawler can be configured to wait for greater or lesser periods, to consider the period as static or dynamic and to increase or decrease timeout periods.

In addition the crawler can be configured to retry connections to timed out pages. This setting may increase the likelihood of downloading certain pages (e.g. CMS pages that take a long time to generate) at the cost of increasing the overall update time required. Site profiles can be used to define this on a per-host basis.

WARC format

Funnelback historically stored downloaded content in a folder structure that mirrored the data source. This resulted in a file on the filesystem for each document that was downloaded.

Recent versions of Funnelback are now configured to use WARC storage, which causes all of the downloaded content to be stored within a single compressed file.

At indexing time, the Funnelback indexer will only have to open one file, rather than many, decreasing the time required. WARC storage also decreases hard disk size requirements, and is much more efficient for copying processes as a single large file is being copied compared to numerous small files.

18.1.2. Indexer settings

The indexer has various settings that can affect the time required to build the search indexes. The settings are predominantly memory-related.

18.1.3. Query processor and user interface settings

When a query is issued to Funnelback padre-sw (the Funnelback query processor binary) takes the query and uses all the query processor options to run the query against the index. padre-sw returns an XML packet which is returned to the UI before being transformed into the data model that underpins the modern UI.

The following techniques can be used to optimise the query processing within Funnelback. Optimising the query will save on processing time and memory usage and is encouraged for every implementation.

Reduce the size of the XML returned by padre-sw

This is controlled by the query processor options that are applied to the query. Query processor options are sourced from the following:

  • collection.cfg: query_processor_options configuration value

  • padre_opts.cfg: query_processor_options configuration value

  • faceted_navigation.cfg: qpopts value

  • CGI parameters passed in with the query

Try to address the following:

  • Minimise the value of num_ranks. Most of the time this is just limited to 10 results. However if you’re just interested in result counts (e.g. for faceted nav extra searches) then set num_ranks=1

  • Always use DAAT mode where possible (-daat=<non-zero value>), and set DAAT to the lowest appropriate value

  • If system generated summaries are not required then disable them. This done by setting the SM value to either -SM=meta (use metadata summaries) or -SM=off (turns off summaries altogether)

  • If using -SM=meta ensure that the returned summary fields (-SF option) list only those metadata classes that are being used in the templates. Avoid using wildcard parameters such as -SM=[.*]

  • Minimise the size of the metadata and summary buffers (-MBL, -SBL values respectively).

  • If best bets are not required disable them (v14.2 and earlier) (-bb=false). In v15 this is disabled by turning off curator (see next section).

  • Limit gscope counts to just those scopes required (e.g. -countgbits=0,1,2 vs -countgbits=all)

  • Limit the depth of -counturls (used by URL facets)

  • Turn of spelling suggestions (-spelling=off)

  • If results ranking is not required then use -vsimple=on

  • If contextual navigation is unused ensure it it disabled (contextual_navigation.cfg)

  • If quicklinks are unused ensure it is disabled (quicklinks.cfg)

  • Disable query blending if unusued (-qsup=off)

  • If result collapsing is unused ensure it is disabled. If used then minimise the -collapsing_num_ranks and only apply -collapsing_SF values to those metadata values that are being used in the template.

  • Consider scoping the result set where appropriate

  • If using geospatial search limit the result set using the -maxdist parameter

  • Turn off the query syntax tree (-show_qsyntax_tree=off)

  • For enterprise search ensure that user key caching is configured and set to an appropriate value.

  • Disable extra searches that are not required.

  • Ensure hook scripts only run on the appropriate searches (e.g. disable them for extra searches if not required).

Optimise the Modern UI

Consider doing the following:

  • Minimise the number of extra searches that run as part of any query.

  • Disable curator by setting the curator=off CGI parameter.

  • If extra searches are required always use extra results (per query) in preference to extra search (per result) as the overheads are much smaller.

  • Always use extra results (per query) extra searches in preference to other methods of running extra searches (e.g. via ajax)

  • If lookups are required as part of results formatting (e.g. postcode lookup) then consider loading this data into the custom data element of the data model instead of performing a query.

  • Minimise the amount of processing required within hook scripts (e.g. title rewrites cans be done via a filter script at crawl time rather than dynamically at query time)

  • If integrating with search.json or search.xml consider removing data model elements that you are not using via hook scripts.

Use an XSL transform to display XML records to the end user instead of a custom Freemarker template

Using an XSL transform avoids the overheads from running a padre query as the XSL transform is accessed directly by the cache controller.

User key caching

User keys can be cached by Funnelback for a set period of time to speed up query processing. This means that a user’s keys will be looked up in the key cache instead of being requested from the target repository at request time. This is typically set to cache for 24 hours.

The trade-off is that key information can become out of date when a user’s permissions change. Changes in user permissions won’t be detected until the cache expires meaning that a user may see results that should not be in result set (or not see results that should be), depending on the difference in permissions. Note: this won’t affect the user’s real permissions on a document – this check is always enforced by the target system at the time the item is requested. The only effect is on whether or not a result appears in the result set for the given user.

18.2. Multi server architecture

Funnelback can also be scaled to multiple servers (e.g. installing Funnelback on additional servers in a load balanced environment, or splitting server roles – having a server dedicated to admin/crawling and servers dedicated to serving queries) and vertically (e.g. provide more RAM / CPU cores / HDD resources).

18.3. Virtual machine optimisation

Funnelback is commonly deployed on virtualised server environments. When deploying onto a virtual server there are a number of factors that can dramatically affect the performance of the Funnelback:

  • Where possible, dedicate the resources. Latency when accessing resources (particularly memory) can have a dramatic effect on performance, both during updates and query processing.

  • Do not over-allocate resources on the VM host. If the server has X GB of RAM or Y CPUs then don’t allocate more than this in total to all the VMs.

  • Leave enough resources on the VM host for OS functions (the VM host normally handles IO for all the VM clients).

  • Access to high speed disks will improve performance.

19. Installing database drivers

Funnelback ships with JDBC database drivers for PostgreSQL.

Other databases are supported via JDBC, but this requires installation of the relevant JDBC drivers.

Database drivers are installed on Funnelback by copying the relevant jar file to a specific folder on the Funnelback server.

Funnelback will automatically detect the driver once the file is copied in to the correct location and has the appropriate permissions.

Make sure you only have one version of the JDBC driver for any particular database present in the dbgather folder otherwise you may get errors when you attempt to start the database update. Remove other versions of the same driver, or rename them so they don’t have a .jar file extension.
Tutorial: Install a database driver

To install a database driver:

  1. Download the driver files from the database vendor and copy the relevant JDBC .jar file to $SEARCH_HOME/lib/java/dbgather/ on the Funnelback server.

  2. Ensure that the file is owned by the search user and has group and owner read/write permissions, and public read permission.

    chown search:search $SEARCH_HOME/lib/java/dbgather/mydriver.jar
    chmod 664 $SEARCH_HOME/lib/java/dbgather/mydriver.jar

20. Maintenance tasks

20.1. Rename a collection / profile / service

The renaming of collections/profiles/services is not recommended and risks losing analytics and WCAG data. Consider using generic naming when creating collections (e.g. For the Department of Health consider using a name of health instead of DOH for the collection name. Using a generic name future proofs any need to change names.

If a new name is required it is better to create a new collection and migrate the configuration and analytics from the old collection to the new collection.

20.2. Rebuild the analytics database

  1. Log in to the administration interface and switch to the affected collection

  2. Edit the collection.cfg by selecting browse collection configuration files from the administer tab, then click on the collection.cfg link in the file manager display.

  3. Disable incremental analytics updates by adding the following line then save the file:

    analytics.reports.disable_incremental_reporting=true
  4. Select update analytics now from the analytics tab.

  5. The analytics reports for the collection will contain data once the database has finished rebuilding. This can take a while for large collections.

  6. During the rebuild messages are logged to

    $SEARCH_HOME/data/<COLLECTION-ID>/log/update_reports.log
  7. Once the analytics build is finished edit the collection.cfg and remove the configuration line added above, or set

    analytics.reports.disable_incremental_reporting=false

20.3. Migrating analytics from one collection to another

  1. Move the query and click log files from $SEARCH_HOME/data/<SOURCE-COLLECTION>/archive to $SEARCH_HOME/data/<DESTINATION-COLLECTION>/archive.

  2. Log into the administration interface and switch to <DESTINATION-COLLECTION>.

  3. Rebuild the analytics database for <DESTINATION-COLLECTION>. (Note: a full rebuild is required - don’t just update the analytics report).

20.4. Moving a collection from one Funnelback server to another

20.4.1. Non-push collections

The following folders hold configuration and data for a Funnelback collection:

  • $SEARCH_HOME/conf/<COLLECTION-ID>

  • $SEARCH_HOME/data/<COLLECTION-ID>

  • $SEARCH_HOME/admin/reports/<COLLECTION-ID>

  • $SEARCH_HOME/admin/data_report/<COLLECTION-ID>

Of these only the $SEARCH_HOME/conf/<COLLECTION-ID>/ folder is absolutely required, along with $SEARCH_HOME/data/<COLLECTION-ID>/archive which contains all the historical query data (if analytics should be preserved on the new server).

  1. Transfer the collection’s configuration folder from the old server to the new server. It’s normally easiest to compress the collections conf folder on the source machine, transfer the file then decompress it inside the conf folder on the target machine. If you have $SEARCH_HOME/conf/<COLLECTION-ID> on the source machine you should end up with $SEARCH_HOME/conf/<COLLECTION-ID> on the target machine.

  2. If you are moving from Linux to Windows of vice versa edit the collection.cfg file and ensure any expanded references to the $SEARCH_HOME are converted to the equivalent for the target OS. (e.g. If going from Windows to Linux you might convert any references to d:\Funnelback to /opt/funnelback). Any workflow will need to be similarly updated and any bash/shell scripts converted to equivalents that are suitable for the OS.

  3. Create the collection’s folder structure by running

    $SEARCH_HOME/bin/create_collection.pl $SEARCH_HOME/conf/<COLLECTION-ID>/collection.cfg -l=<LICENCE-ID>
  4. If you don’t know your licence ID you can omit the -l option and attach the licence from the administration interface after you’ve run the create collection command. However until a licence is assigned the collection will not be able to be updated.

  5. Once the collection has an assigned licence you should be able to update the collection. However, you may wish to migrate the collection’s data and analytics across as well.

  6. If you wish to bring across all of the existing data transfer the collection’s complete data folder from the source machine and unpack it on the target machine ensuring paths and symlinks are preserved.

  7. If you wish to only transfer the collection’s historical query data then just transfer the $SEARCH_HOME/data/<COLLECTION-ID>/archive folder

  8. If you wish to transfer existing analytics data across then transfer the admin folders listed above.

  9. If Funnelback on the source and target machines are both the same version (and same OS) you should be able to start querying the collection (assuming the complete data folder was transferred).

  10. If Funnelback on the source and target machines are different versions or you didn’t transfer all of the data across then you will need to update the collection. The type of update you need to run will depend on what you’ve copied. If you copied the complete data folders across then you should be able to rebuild the index from the data you’ve copied – run a reindex of the live view. Upon completion you should have a working index. If you didn’t copy the complete data folder across (and just copied archive logs, or nothing at all) then you’ll need to run a normal update of the collection – just start an update from the administration interface.

  11. If you just copied the archive logs across and wish to rebuild the analytics database start an analytics update from the analytics tab for the collection.

20.4.2. Push collections

If you need to move a push collection you should ensure that the source and target machines are running the same Funnelback version.

  1. If possible block access to the push collections so no more add/remove operations can be run against the push collection (e.g. define a firewall rule to prevent access).

  2. Snapshot the push collection by calling the snapshot push API call.

    PUT /push-api/v1/collections/<COLLECTION-ID>/snapshot/<SNAPSHOT-NAME>
  3. On the target server create a new push collection with exactly the same collection ID.

  4. Ensure the new collection is stopped by checking the state of the collection

    GET /push-api/v1/collections/<COLLECTION-ID>/state
  5. If the collection is running then stop it

    POST /push-api/v1/collections/<COLLECTION-ID>/state/stop
  6. Copy the configuration folder for the push collection ($SEARCH_HOME/conf/<COLLECTION-ID>/) from the old server to the new server, overwriting all files.

  7. Copy the snapshot from the old server (from $SEARCH_HOME/data/<COLLECTION-ID>/snapshot/<SNAPSHOT-NAME>) to the live folder for the collection on the new server ($SEARCH_HOME/data/<COLLECTION-ID>/live/)

  8. Start the push collection on the new server

  9. Rebuild the analytics.

20.5. Starting, stopping or restarting a collection update

The administration interface allows an administrator to start, stop and restart collection updates.

Note: this does not apply to push collections or meta collections.

  • Meta collections are never updated

  • Push collection updates are controlled via the push collection API.

20.5.1. Starting an update

Collection updates are started either by an automatic schedule (see scheduling updates) or by manually starting an update from the administration interface. The types of available update modes will depend on the type of collection.

20.5.2. Stopping an update

A running collection update can be stopped by switching to the collection in the administration interface and clicking the stop update button from the update tab, or by clicking the stop icon in the actions column under the collection overview listing.

20.5.3. Restarting a stopped update

An update for a collection may be restarted by selecting advanced update from the update tab for a collection, then choosing the phase where to restart the update.

you can restart an update from a different point to where you stopped it. For example:

  • Restart an update that was halted at crawl time at the indexing phase to just build an index on whatever was crawled so far.

  • Restart an update that completed but timed out (after swapping the views) at the crawl phase to continue crawling from where it got up to when the crawl timeout was reached.

20.5.4. Killing and resetting a stalled update

If an update becomes stuck it can usually be stopped via the administration interface (se instructions above for stopping an update). If this fails to stop the update and it remains in a state where the update is stopping then the following steps can be used to reset the update:

  1. Log in to the server as the search user and check the running processes to see if there is an update running for the collection. e.g. ps -fewjaH | grep <COLLECTION-ID>.

  2. If there is an update running (there will be an update.pl process running with the collection ID as part of the arguments) you may need to kill the process.

  3. If there isn’t a process running for the update or you have killed it then you can safely clear the locks on the collection. This can be done in two ways:

    1. From the administration interface navigate to the update tab for the collection that is stuck and click on the clear logs link that appears as part of the stop update button

    2. or run the following command on the command line /opt/funnelback/bin/mediator.pl ClearLocks collection=<COLLECTION-ID>

  4. Refresh the administration interface and the lock should be cleared.

  5. Optionally start or restart the update.

20.6. Administration interface read only mode

The administration interface can be put into a read only mode. When read only mode is enabled administration interface users are unable to make any configuration changes.

This allows a change freeze to be enforced on a Funnelback server enabling maintenance or upgrade work to be carried out.

Enabling read only mode requires back end access to the Funnelback server.

To enable read only mode:

Log in to the Funnelback server and edit the global.cfg, adding the following line:

admin.read-only-mode=true

To disable read only mode set the option to false (or remove it completely from global.cfg):

admin.read-only-mode=false

The administration interface displays a banner indicating that read only mode is active when read only mode is enabled. Changes to read-only mode should be reflected as soon as the global.cfg is saved.

20.7. Install a new licence key

The licence key manager can be accessed from the administration interface’s system menu and allows an administrator to add, remove and update licence keys.

The Funnelback admin API also provides a number of API calls that can be used to manage Funnelback licences.

Make sure you keep your licence keys safe. Once installed you cannot retrieve the licence key string from Funnelback and you will require the key string again if you move Funnelback to a new server.

20.7.1. Update an existing licence key

  1. Select manage licences from the system menu.

  2. Choose the licence that you wish to update from the list of installed licences. Fields that can be edited display a pencil icon at the end of the field. To install an updated licence click the update icon that appears after the expires field and paste in the new licence key.

20.7.2. Install an additional licence key

  1. Select manage licences from the system menu.

  2. Click the add new button and complete the fields.

  3. After the licence is installed it can be associated with existing collections or added to new collections as they are created.

20.7.3. Remove a licence key

  1. Select manage licences from the system menu.

  2. Locate the licence that should be removed from the list of installed licences.

  3. Click the trash can icon that appears on the right hand side of the entry when hovering over the licence.

20.8. Set the administration interface banner message

A banner message can be displayed within the administration interface.

To set the content of the banner create a file $SEARCH_HOME/share/include/motd.txt containing the HTML code for the banner.

e.g. The following motd.txt file presents the licence message banner that is displayed within the training VM.

<p style="background-color:pink;border-style:solid;border-width:1px;padding:5px;font-weight:bold">This VM is licenced for training purposes only.</p>
set the administration interface banner message 01

21. Capacity planning

Several rules of thumb for estimating hardware resources for a given Funnelback deployment may assist, however, measuring resource usage during actual production deployments is likely to give a more accurate indication of any increases (or decreases) of hardware requirements over time.

Each Funnelback deployment will vary significantly from another with respect to:

  • Collection size

  • Collection type

  • Index size

  • Content types

  • Number of collections

  • Frequency of collection updates

  • Aggressiveness of collection updates

  • Query volume

  • Types of queries

  • Logging history

  • Reporting history

Each metric will have different requirements from underlying hardware and network resources.

Most reports on these metrics are available directly from the Funnelback Admin UI. Several OS-level figures can be generated from the command line.

21.1. Number of collections

The number of collections on its own is not necessarily a useful metric for capacity planning - a collection that is never updated or never queries only consumes hard disk resources.

Several methods are available to view the total number of collections:

  • From the Mediator, output results of the 'List Collections' command in your web browser:

    https://localhost:8443/search/admin/mediator/ListCollections
  • From the Admin UI, click the 'View full collection status' button.

21.2. Collection size

The size of a collection can be measured by:

  • The number of documents in a collection and/or

  • The uncompressed size of an index of a collection

  • The size on disk of the collection (index, logs, cached document copies)

Several methods are available to determine these metrics.

  • From the Admin UI, view the full collection status, noting the 'Num. Docs' column on the resulting report (Collection Overview > View full collection status)

  • Alternatively, from the Public UI, view the JSON output for a test query:

    http://localhost:8080/s/search.json?collection=funnelback_documentation&query=test
  • Observe the 'collectionSize' value (/response/resultPacket/details/collectionSize), displaying the the number of documents in the index (1067):

    collectionSize: ".../data/funnelback_documentation/live/idx/index: 17.3 MB, 1067 docs",

21.3. Index size

An index’s size on disk can best be estimated using operating system-level commands.

  • From the command line calculate the size of the live and offline indexes for the 'funnelback_documentation' collection using:

$ du -hsc -s $SEARCH_HOME/data/funnelback_documentation/live/idx -s $SEARCH_HOME/data/funnelback_documentation/offline/idx

This command outputs:

7.9M    /opt/funnelback/data/funnelback_documentation/live/idx
7.8M    /opt/funnelback/data/funnelback_documentation/offline/idx
16M     total

Giving us a total size of 16MB on disk for both the live and offline views of the index.

21.3.1. Push collections

21.4. Cached documents size

By default, most collection types keep filtered, compressed versions of indexed documents on disk in WARC store format.

  • From the command line calculate the size of the live and offline cached documents for the example collection using:

    $ du -hsc -s $SEARCH_HOME/data/example/live/data -s $SEARCH_HOME/data/example/offline/data

21.5. Log size

Funnelback retains logs for collection updates, click and query logs and application logs.

  • From the command line, calculate the size of all logs associated with the funnelback_documentation collection using:

    $ du -hsc -s $SEARCH_HOME/data/funnelback_documentation/live/log -s $SEARCH_HOME/data/funnelback_documentation/offline/log -s $SEARCH_HOME/data/funnelback_documentation/log
  • From the command line, calculate the size of all application-level logs associated with Funnelback using:

    $ du -hsc -s $SEARCH_HOME/log -s $SEARCH_HOME/web/logs

21.6. Collection update monitoring

Detailed statistics are available for the current and next-most-current updates for a given collection, showing the resources (memory, bandwidth, threads) used during an update.

  • From the administration interface, observe the monitor graph for the live view of a collection (Administer > Browse Log Files > Live log files > monitor.log (Graph)):

21.7. Collection update history

A collection update history shows the occurrences and duration of a collection’s update phases over time - typically, the 'gather' phase will take longer than other phases, but each phase can be examined in isolation.

  • From the Admin UI, observe a collection update history graph (analyse > view collection update history) (alternatively, from the collection overview > view full collection status, click a collection’s graph button):

  • Ideally, a collection’s update history times should be consistent - incremental growth in update times is expected as a collection’s contents grow over time. The graph above shows sporadic update times and wildly inconsistent update durations, requiring further investigation.

21.8. Query volume and history

Query volumes are best examined using several reports from usage analytics in the Funnelback administration interface

  • From the administration interface display the query volume over the life of the service (Admin > Analyse > View Query Reports Dashboard > Select Report: Usage Summary > Select Timeframe: All Time)

  • From the administration interface display the queries per hour over the life of the service (Admin > Analyse > View Query Reports Dashboard > Select Report: Queries per Hour > Select Timeframe: All Time)