This course is currently in draft.

Introduction

This course is for solution architects and frontend developers and provides an introduction to designing and implementing a website search using Funnelback.

Special tool tips or hints will appear throughout the exercises as well, which will provide extra knowledge or tips. They will look like:

This box is used to provide links to further reading available on the current topic.
This box is used to provide tips or hints related to the current topic.
This box is used to provide important advice relating to the current topic. This includes advice to be aware of to avoid common errors.
This box contains advice specific to the Linux version of Funnelback.
This box contains advice specific to the Windows version of Funnelback.

What this workshop will cover:

  • Requirements gathering

  • Solution design

Prerequisites to completing the course:

  • Funnelback 203 training.

  • Experience with Funnelback implementation.

1. Background

This course looks at what is involved in designing a website search solution and provides methodology for gathering requirements and translating these into a functional design.

The course is aimed primarily at solution architects who are designing the search on behalf of a customer, but the information is also useful if you are designing the search for yourself.

There are a number of things that should be considered when planning and designing a website search.

Each of these should be considered as part of the process.

It is critically important to understand the purpose of the search and the problems it aims to solve.

This should be considered before thinking about anything else because the design of an effective search is fully dependent on this.

As part of understanding the purpose of the search:

  • Consider the audience that you are targetting with the search - e.g.

    • Is the audience technical or non-technical? (this will impact decisions such as whether or not to offer an advanced search)

    • Is the audience highly familiar with the internal workings of your organisation/site? (e.g. understands internal language or how things are organised or are they the general public?)

    • Is the search about answering very specific questions or generally targetting more general questions with relevant answers?

  • Consider how the content may need to be scoped

2.2. Understand the current pain points

Understanding the current pain points is an extremely important factor in delivering an effective search.

Think about the business goals that the search is trying to solve and focus on the pain points in the current solution.

If the goal is to replace an existing search then determine what currently works well and what causes frustration.

2.3. Mock up search results screens

It is always worth undertaking an exercise to create some basic mock ups of the search results screens.

This is a really important step that helps the understanding of the customer’s requirements and exposes underlying data requirements. It also helps the customer to understand more abstract requirements (such as the need for metadata to power certain things that are often requested.

3. High level design goals

When working through the requirements always try to deliver a solution that makes use of standard functionality within Funnelback.

Always keep in mind the purpose of the search and the problems that it should be solving.

3.1. Question requirements which do not make sense

This may sound quite obvious, but it is one area where many solution designers fail the customer.

It is important to remember that your expertise is a valuable service that the customer is paying for and that sometimes the customer may not really know what they want (even if they think they do know).

So, don’t treat the customer’s requirements as gospel and blindly agree to all the requests. While you must understand the customer’s perceived requirements do not be afraid to provide advice when requirements don’t contribute to the achieving the purpose of the search or if there is a better way to achieve the desired outcome. Explain why you feel a requirement will be detrimental to the search and if the customer still insists on the requirement being included then at least they have made an informed decision.

If designing a search for yourself carefully consider the requirements and assess each for the benefit it will provide when compared with any increase in complexity to the search design. This will help to determine which of the requirements are really important and which ones are just nice to have (and in fact which ones might actually diminish the effectiveness of the search).

3.2. Keep it simple

There are many reasons why keeping it simple will contribute to the effectiveness of a search. These include:

  • The solution focuses on solving the core problems. This makes the search more effective and easier to use as users are not presented with features that provide little benefit.

  • The solution is easier to understand and maintain and support

3.3. Avoid customisation

Customisation, beyond the result templates should be avoided where possible as it can affect the ability to upgrade easily and can also have other unintended side effects on the correct functioning of the search.

Where possible avoid solutions that:

  • Require custom filters, especially if there is a dependency on the structure of the markup of the content.

  • Require manipulation of the query or response data via hook scripts.

  • Require customisation of the faceted navigation or query completion behaviour.

The following can be used as part of the argument for using standard functionality:

  • Implementation time is reduced so it’s cheaper for the customer to make use of standard functionality.

  • Upgrades are simplified as standard functionality will be upgraded automatically by Funnelback when an upgrade occurs. This reduces the cost of upgrading to a newer version of Funnelback.

  • Bugs within standard functionality can be reported to the developers and patched at no cost to the customer (depending on the nature of the bug). Bugs in custom functionality require custom development and are unsupported.

4. Initial investigation

There will usually be some very high-level requirements or goals that are known when starting on a search solution design project

It is often useful to spend a bit of time using this information to conduct and initial investigation of the site or sites that are to be searched and to put together a rapidly built prototype that can be used to assist both yourself and the customer in understanding what can be achieved with the search and the available data.

4.1. Creating a prototype

The first step to creating a useful prototype is to spend some time familiarising yourself with any freely available content and building up a picture of what is possible (given the available data). What could be used to provide relevant functionality for a search given any known design goals.

For example, if the broad goal is to replace an existing site’s search start by looking at the following:

  • Have a look at any existing search to see what features the search has - this will provide a good starting point for any prototype. Evaluate the effectiveness of each feature in terms of the current and potential effectiveness.

  • Have a look at the customer’s available content and evaluate this for potential useful search features.

  • This could include the potential to get access to database content that might populate parts of a customer website or investigate what social media channels are used by the customer. YouTube in particular is often a very useful addition to a customer site search as it provides another information channel that is often not hosted on their site.

  • Investigate what metadata is available and what might contribute to enhancement of the search.

  • Sometimes it is worth mocking up some content to also prototype how a feature might work if the information is not available or not directly available to the search. Having a working prototype will provide a useful reference for the customer and also for the requirements gathering session.

5. Requirements gathering

Requirements gathering is the process of determining all the details that are required for the search. The requirements will inform how the search is built. The requirements gathering should be undertaken with the owner and representatives of other important users of the search.

5.1. Consider what repositories should be covered

Determine all the different content repositories that should be covered by the search. This is determined by the individual result items and what is expected to be a possible search result.

Are there any useful social media repositories that would enhance the search?

Is there any database/XML/CSV/JSON record data that would be useful to add to the search (such as a people directory)?

For each repository:

  • Think about what should be included and excluded from the repository. The search will provide more accurate results if ephemeral content is not included as this produces noise in the search index. E.g. consider excluding pages that just list items and focus on indexing the items themselves. Educate the customer on effective use of robots.txt, robots meta tags and also Funnelback no index tags.

  • Does the site have an appropriate sitemap file that could be used?

  • Are there any calendars or existing searches/directory browse functions that are likely to be crawler traps and should be excluded?

  • Are there any binary file types that should be included?

  • Is 10MB an appropriate maximum file size to accept for the index?

5.2. The search results screens

When investigating the requirements for a search always start with an exercise to determine all the different search results screens that will be required.

For most searches this will just be a single search results screen but for more complicated searches Funnelback may be required to power a number of search results pages.

A search results page is any listing of search results that are returned by Funnelback and includes:

  • Standard search results listings (either as templated HTML or as JSON/XML). This is any interactive search results page and includes site search as well as any specific searches that may be delivered (such as a course finder or people directory).

  • Search powered content (listings or content pages where the Funnelback search index is used as a database to populate content. This could include:

  • browse listings

  • CMS content pages that are populated from data provided by an embedded query to Funnelback.

  • GeoJSON data provided via search used to power a map.

  • Data export options that are populated with search results (e.g. Download as CSV)

The following questions should be considered for each of the search results screens identified.

5.2.1. What is the purpose of this search interface?

As previously mentioned understanding the purpose of the search is the key to providing an effective solution.

  • Where does this interface fit in the overall solution to the stated purpose of the overall search?

5.2.2. What is the intended audience of this search interface?

Understanding the audience of the search interface will inform some of the key design decisions and the features that will be suitable.

  • If the audience is highly technical then features such as advanced search forms may be appropriate.

  • If the audience is made up of general users then a simpler interface that guides the user with features such as faceted navigation will be more appropriate.

If the search is targeting too many different audience groups it may be worth providing different searches for different audiences. This could be achieved via different audience profiles that contain different content profiles or ranking settings.

Determine what the overall content set should be for the search interface. The set of information appropriate for the search interface doesn’t have to be the entire dataset and may include a subset of the data sources indexed, and even a subset of items within specific data sets.

  • What repositories/data sources are included?

  • Are binary files expected to be included? If so what file formats?

  • Is the default (10MB) file size limit acceptable? Does this need increasing?

5.2.4. Do filters need to be provided on the search results?

Use this as an opportunity to demonstrate how standard faceted navigation works and determine if standard functionality will deliver the required functionality.

filtering of the search results should be considered more than just faceted navigation. It may include additional controls such as slider widgets or date range pickers that are used to filter the search results.

Concentrate on filters that provide benefit to search users (e.g. a file type filter is quite easy to provide, but often provides little benefit to users who are generally more interested in finding content based on the topic with little care for the format.

  • What filters are required?

  • From where will the data be sourced? (e.g. metadata requirements for facets, numeric metadata for any ranges)

5.2.5. Auto completion

Think about how query completion works and determine if standard simple (organic) suggestions are sufficient or is there a requirement for structured (rich), faceted (query builder) or concierge (multi-channel)

If structured suggestions are required, what are the data sources?

If concierge is required, what are the different autocompletion datasets? Wireframe the concierge completion and ensure all the desited metadata exists in the source content.

5.2.6. Is an advanced/custom search form required?

Unless the audience is a group that are familiar with advanced search (such as librarians or legal practitioners) then discourage the use of an advanced or custom search form citing research that finds normal users get better results from a simple search form.

Explain that filters (faceted navigation) will provide the advanced search functionality, but without the traps that result in users specifying queries that return no results. Facets 'guide' the user to create an advanced query without them needing to understand and navigate an advanced search screen.

5.2.7. Wireframe the results screen

Draw wireframes of the results screen.

Is the design to be responsive? Or a separate mobile template required? (Yes to either of these will require additional wireframes). A separate template may also require a separate profile.

Is the customer providing the design? (HTML markup patterns, CSS, JavaScript). The customer should be responsible for providing these unless specific arrangements are made with Funnelback.

Ensure you capture:

  • High level page layout

  • Result formats (i.e. a mini wireframe for each type of result)

  • Consider where each 'field' is going to be sourced

  • Don’t forget about thumbnail URLs. If a thumbnail is required where does it come from (e.g. metadata field) or what are the rules to construct the URL to the image using information available to a search result.

  • Don’t forget about the no-results page format.

  • Consider appropriate stencils that can be used and articulate how these will work.

  • Non-standard filetypes that are required (we only index the following by default: html, MS Office (word/excel/powerpoint), RTF, PDF and text documents). Make sure these can be filtered successfully before agreeing to include them.

  • Included features? Try to understand the goals of the search and use this information to advise the customer on features that would enhance their search. When talking to the customer don’t get bogged down in Funnelback terminology. Explain in plain English what a feature does as a widget that is required in the results template might be implemented using a combination of features that work together. E.g. you might discuss a 'filter panel' that comprises of faceted navigation and some additional filter controls (such as a date picker or range slider). Things to consider include:

  • Spelling suggestion message location

  • Query blending message location

  • Filter (facets) interaction

  • Related searches (contextual navigation)

  • Query completion

  • Quick links? Quick links search box?

  • Best bets

  • Saved searches/search history (sessions/shopping cart)

  • Extra search panels

5.2.8. Synonyms / thesauri

Is there a requirement for thesauri / synonyms?

5.2.9. Does the search interface have any geospatial or mapping components?

Is the data geo-coded? If so is it in the correct format?

Postcode or location to geocode mapping requires additional work

Does the search need to be aware of the user’s current location?

Is there a relevance component that is relative to a location?

Is there a requirement to plot the search results on a map?

5.2.10. Is any business logic required for the search interface?

This includes capturing any rules required for predictive segmentation or results curation as well as other business logic that will need to be implemented via workflow or hook scripts.

Be wary of the client having a long wish list and don’t be afraid to educate the client on why what they want might not be a good thing.

5.2.11. Identify if there is any custom functionality that needs to be developed

This isn’t a question for the customer, but something that needs to be noted for anything that is going to require custom implementation - this will be invaluable when planning the implementation.

You don’t need to consider basic template customisation.

For anything that is identified as custom, if there is an alternative approach that uses standard functionality to provide similar functionality then discuss this with the customer highlighting the benefits of using the standard functionality. Benefits include cost, maintenance overheads, ease of upgrade etc. Custom functionality may need to be re-implemented on upgrade or for other reasons (such as custom functionality that integrates with a 3rd party API that may change with little notice).

Capture as much detail as possible.

Can it be done vs should it be done

Can the custom functionality be implemented in Funnelback?

Should the custom functionality be implemented? Discuss with the customer if there are reasons why it would be better to be removed from the requirements or implemented outside of Funnelback. When thinking about this it is often necessary to think of the bigger picture and how the search fits into the overall system. The custom functionality may be better delivered using a different tool or not at all. There may be significant on-going maintenance costs for a piece of functionality and this needs to be considered. Just because it can be implemented doesn’t mean that it should be implemented.

5.2.12. Integration method

Discuss integration options and determine how the search interface will be integrated with the end system.

  • standalone search

  • standalone search with remote included headers/footers

  • integrated search (i.e. partial template that returns a HTML code chunk)

  • custom JSON or XML

  • direct access to JSON/XML (clearly explain the disadvantages of this approach)

The disadvantages of integrating with the standard JSON/XML endpoints include:

  • Any Funnelback features that rely on front-end implementation (such as faceted navigation, sessions, pagination controls) must be implemented by the code that interprets the JSON or XML.

  • New features that are added to Funnelback or those that were not implemented will not be available until further work is done to interpret the new data model elements and format the data appropriately.

5.2.13. HTTPS access

Do the search results need to be available via HTTPS?

Will an SSL certificate need to be supplied by the customer?

  • Funnelback installs a self-signed SSL certificate by default. This will need to be replaced with a signed SSL certificate to avoid browser warnings.

  • This will depend on the requirements. If a subdomain is required on the customer’s domain then they will need to supply a certificate.

  • If the search is hosted in the Funnelback Cloud then a certificate is only required if using a customer’s subdomain.

5.3. Repositories

  • For any non-standard repositories/data sources, where is the data coming from? (e.g. database query, XML). Is this a supported data-source?

  • Do any of the repositories require authentication in order to access the content?

  • If more than one collection requires authentication are the usernames in common? Single sign on means only one username can be applied across the repositories.

  • Are there any subsets of these that should be excluded?

Capture the different repositories that will be included in the overall search. This information should be derivable from the search interface discussion that has occurred.

The following should be completed for each repository.

5.3.1. What is the data source?

Is it a supported collection type or will custom gather code be required? (e.g. to download and process XML)

For websites - check that they are crawlable (e.g. not obscured by JavaScript, search boxes (i.e. only way to get to content pages is via a search) or robots.txt)

How large is the data source? For web collections get approximate document counts. For XML/database style collections find out the number of records. For enterprise collection type (e.g. fileshares, TRIM, Sharepoint) find out the number of documents, and also the size of the repository.

5.3.2. Data source access

How do we access the data? (e.g. standard web crawl, via a SOAP API, database crawl, XML export etc)

Make special note of

  • Any non-standard requirements.

  • Any authentication involved in accessing the data source.

5.3.3. External metadata

If it’s a CMS is external metadata required to ensure that word and PDF files have the correct titles, descriptions etc.

5.3.4. Sessions and history

Is there a requirement for sessions and history?

Sessions and history is a front-end feature that closely integrates with the default Freemarker template. When using sessions and history Funnelback should be used to return the full results page. It is possible to import the JavaScript into an existing template but it is quite complicated and not supported.

5.3.5. Security

Is any authentication required to get to the documents? If so what type of authentication is used? Is it single sign on? Do we need to perform document level security? This requires more detailed analysis to ensure that the content is crawlable.

5.3.6. Other requirements

Are there any other special requirements? E.g.

  • Client access to the Funnelback FTP/SFTP (this should only be used as a last resort) - always try to get data files made accessible from client infrastructure that can be gathered by an update-time pull operation such as a curl request.

5.4. Reporting requirements

5.4.1. Analytics

What analytics reports are required. It is typically useful to have separate analytics for each (interactive) search interface.

Understanding this will feed into the profiles that are required as part of the design.

5.4.2. Content auditing

Are content audit reports required for any collections?

Are there any custom metadata fields that should be reported on (in summary graphs and also in the search results tables? - note: specific requirements for content auditor customisation may require additional time in the build quote.

Are data reports required for any web collections?

5.4.3. Accessibility auditing

Are WCAG reports required for any web collections?

Do the WCAG reports need to be segmented so that different groups of pages are reported on separately?

5.5. Administration requirements

  • What level of administration access is desired? (e.g. ability to edit templates, just view the reports)

  • Required users (beyond delivering two users with report and edit/report roles)

  • Level of administration access. Note the admin user limitations in the hosted environment (if applicable)

5.6. Hosting requirements

How will the search solution be hosted?

  • Is the Funnelback instance hosted on the Funnelback cloud? If yes, is it shared or dedicated?

  • If it’s hosted then there may be limitations on what an administration user can do.

  • Is the Funnelback instance installed on customer hardware? If yes:

  • Is remote (e.g. VPN) access to the system available?

  • Determine the recommended OS (repository types and business requirements will determine this)

  • Number of servers and roles of each (e.g. admin/crawl, QP, DR, dev, test, prod)

  • Licencing considerations (e.g. licence size, number of licences)

5.7. Update frequency

Are there any specific requirements for update frequency?

In the Funnelback cloud, indexes are refreshed on a daily cycle at an unfixed time (based on server load) as long as the repository doesn’t take longer than a day to crawl and index. This isn’t negotiable for shared hosted customers.

If there are any requirements to crawl on a fixed schedule or more/less frequently then upgrading the hosting to dedicated hosting may be required (will require discussion with the account manager).

Is there any requirement for instant updating (web and DB collections only)? Or push collections?

6. Search solution design

Design of the search solution requires a high level of understanding of what Funnelback features are available and how all of the functionality available within Funnelback works.

The design process involves taking the customer facing requirements and turning it into and internally facing functional design that specifies a solution that can be built using Funnelback.

Where the customer facing design focusses on functionality and the user interfaces from an end user perspective, the functional design needs to convert this into something that can be built and must be specified from an implementer’s perspective. This means there will be a focus on collections, and specific Funnelback features used to deliver the functionality outlined in the requirements.

This section provides guidance on designing a web search solution once the requirements have been specified.

6.1. High level design

6.1.1. Analyse the requirements

Start by analysing the different search interfaces noting:

  • The data sources required to power this interface

  • A list of metadata fields that will need to be extracted from each data source to be able to populate all the different fields in the output.

  • Any data sources that need to be excluded from the search results listing.

  • Any data sources that are used to provide an extra search.

  • Any data sources that are used to generate auto-completion

  • Any other sources of auto-completion.

6.1.2. Determine the collection structure

The data sources identified when analysing the requirements will suggest a suitable collection structure for the solution.

The collection structure will be determined by the following factors:

  • The type of repository - different repository types must be set up as separate collections. E.g. web collections must be separated from social media collections

  • Web data sources can be grouped into a single collection or split into several web collections depending on requirements. In general, it is best to group all the web collections together if feasible as the ranking will be improved by any cross-site links. Splitting is required if:

  • Update frequency needs to be varied (e.g. a section needs to be updated very frequently compared to the rest of the content.

  • Authentication needs to be varied across sites (though site profiles may allow you to do what you need in a single collection).

  • Use meta collections to serve the different frontends unless the search is really simple (e.g. a single web collection). For most searches a single meta collection with multiple profiles will suffice, but in some circumstances you may require multiple meta collections (e.g. if the collections are using collection level security and have different requirements for this). An advantage of using a meta collection to provide the search interfaces is a separation between collections that hold indexes of the various data sources and the interface which often combines different data sources together.

6.1.3. Identify suitable profiles / frontend services

Most of the time it makes sense to designate a meta collection as the collection that a user of the search interacts with. This meta collection will service the different search interfaces.

Separate profiles will typically be required for each separate search interface as defined in the requirements, and also for any generated auto-completion. This is to allow for separate best bets, synonyms, curator rules and ranking to be applied. It also is required to provide separate analytics, content and accessibility reporting.

Profiles that are not set as services can be used for:

  • Generation of auto-completion

  • Search as a service (though the service will need to be created if you wish to edit templates or setup synonyms/best bets/curator rules

6.1.4. Analytics

Be mindful of what will end up in the search analytics.

Make use of the system query field to hold values that shouldn’t show in the analytics (like metadata constraints or queries that are designed to return everything) or use a separate profile to serve 'canned' searches.

6.1.5. Custom functionality

For each of the pieces of custom functionality identified during requirements gathering:

  • Think about how the feature might be constructed and how it will integrate with Funnelback.

  • Try to maximise the use of standard functionality

  • Where possible design the functionality in a re-usable way and consider contributing the code to the Funnelback GitHub site. Custom functionality usually falls into the following categories:

  • Custom gatherers and pre-gather workflow: code to implement a custom way of fetching the data for the data source.

  • Document filters and post-gather or pre-index workflow: code used to apply transformations or extract information from the gathered data prior to it being indexed.

  • Post-index workflow: code used to generate data from the search index

  • Query-time workflow: hook scripts that run when a query is executed

Workflow is not supported on push collections
Custom gatherers and pre-gather workflow
  • For data sources that are web accessible use a web collection if feasible. This includes single file XML/CSV/JSON data sources where the data file is available via a web server.

  • For data sources that are accessed via an API a custom collection that uses a custom gatherer is recommended.

  • Pre-gather curl commands should generally be limited to fetching external metadata or data that is used to supplement (e.g. via filters) the gathered content.

Document filters and post-gather/pre-index workflow
  • For each data source consider if there are any requirements for filters or post-gather/pre-index workflow.

  • Filters can be used to analyse, extract information from or transform the documents after they are gathered, but before they are indexed.

  • When filtering HTML documents implement Jsoup filters if possible.

  • Try to restrict each filter to performing a single function.

Post index workflow
  • This is generally used to generate configuration from the search index (such as generating auto-completion CSV files).

Query-time workflow (hook scripts)
  • Hook scripts are generally used to modify the data model.

  • The pre-process and pre-datafetch hook scripts can be used to manipulate the search question (the set of parameters that are passed to Funnelback to define the query that is to be run).

  • The post-datafetch and post-process hook scripts are used to modify the response that is returned by the query processor.

  • The extra searches hook script is used to modify the parameters that are used to setup any extra searches that run as part of the query.

6.2. Functional design

The functional design (or functional specification) is a document that encapsulates the design in a format that maps the requirements into something that can be built with Funnelback.

Produce and overview or high-level design to show how all the collections fit together to form the solution.

Work through each of the collections capturing the Funnelback-specific detail required to satisfy the requirements.

Start with the collections that define the data sources, then the collections handling the search interfaces.

6.2.1. Data source collections

For each collection:

  • Define a suitable collection type

  • Define the seed URLs (or equivalent)

  • Define the include and exclude rules

  • Also consider included file types

  • For web collections

  • consider if accessibility auditing should be enabled

  • consider if Sitemap.xml support should be enabled

  • For multi-site crawls consider if there are any site profile settings that should be defined

  • For multi-site crawls consider if quick links should be enabled

  • Define the metadata that must be extracted

  • Consider which fields should be treated as indexable content, and if any fields contain special information (geo-spatial or numeric metadata).

  • Define any indexer options that should be applied

6.2.2. Frontend meta collections

  • Define the profiles that will be used

  • Define the set of data source collections that should be included

  • Define the methods of integration and search endpoints that will be accessed

  • Define the any access restrictions that need to be applied

For each profile:

  • Is the profile to be designated as a front-end service?

  • Define any scoping of the data sources that should be applied - this includes a sub-set of data source collections, and even sub-sets of the data within a specific data source.

  • Define any ranking or display options that should be applied

  • Define any faceted navigation and the corresponding source of the facet categories

  • Define what query completion is required, and what the sources are

  • Consider which features are required:

  • Query blending

  • Stemming

  • Spelling suggestions

  • Curator rules

  • Best bets

  • Synonyms

  • Quick links

  • Search sessions/history

  • etc.

  • Consider extra search requirements

  • Define the source collection

  • Define any query processor options to apply

  • Define the no-results screen

  • Consider any specific content auditing requirements

  • Specify the integration method and template wireframe (if applicable)

  • Specify the metadata fields that are required to render the template or satisfy the frontend requirements.

  • Specify any query-time hook scripts that are required and outline the purpose and provide pseudo-code for the algorithm.

6.2.3. Data source requirements

There are often changes that must be made at the data source as part of the search solution.

  • Define any external metadata feeds that are required (to associate metadata with binary content such as PDFs) and specify where these will be sourced. They will probably require workflow to fetch. External metadata should ideally be supplied in the correct format otherwise additional workflow will be required to transform it. External metadata should also be validated as any errors in the file will cause the indexing to fail.

  • Define any API keys, user accounts etc. that are required for the search solution.

  • Consider if there are any requirements to create a page to source headers and footers from

  • Specify any recommendations for updates to robots.txt or robots meta tags in source web content.

  • Funnelback no index tags should be added to source html templates to suppress headers, footers and navigation from being included in the search index if feasible. This will improve the result quality by reducing noise within the index.

Workshop: Events search for the Sydney Opera House

Sydney Opera House has chosen Funnelback to deliver a search of upcoming events on their publicly-facing website. According to their web analytics, events are the most frequently searched-for item on their website. Historically, users seem to search by:

  • term

  • artist

  • location

  • date

  • any combination of the above.

Exercise 1: Investigate the data sources
  1. Investigate the Sydney Opera House website and formulate ideas for how an events search might be delivered.

    Note down:

    • possible data sources

    • any gaps in data that will be required

    • a high level plan of how this might be implemented

    • any follow-up questions that should be asked of the client

  2. Build a prototype/demo based on your initial concept.

Exercise 2: Gather requirements

Run a requirements gathering session interactively with the trainer. The outcome of this will be used for your solution design.

If there is no trainer to act as the customer run through the session and answer the questions as though you were the customer, noting down the responses.

Don’t forget to ask the questions that you noted down and also have a discussion with the customer regarding functionality that you would advise should be descoped or altered.

In trainer-led requirements gathering some feedback will be provided on the questions asked and if there were any gaps in the questioning.

Exercise 3: Design a solution

Design a solution for the Sydney Opera House events search based on the outcome of the requirements gathering.

Produce the following project deliverables including:

  • Requirements gathering documentation

  • Solution design

7. Designing knowledge graph

Knowledge graph can be used to supplement a search by adding a browse function that allows a user to explore the relationships between nodes in the graph.

Knowledge graph has significant metadata requirements and will be most successful when there is scope to modify and define the metadata at the content sources.

7.1. Identify candidate entities

The first step in setting up a knowledge graph is to identify items that can be treated as entities (or nodes).

An entity is basically an object that has a type (known as a node label) and a series of identifying values (known as the node names).

For example, each item in a staff directory may be treated as entities of type person that have various values made up of combinations of their first and last names. A person named John Smith may have several named defined such as ‘John Smith', ‘J Smith', ‘Smith, John', ‘Smith, J'.

Entities also have other attributes. E.g. a person entity might have attributes for phone number, address, people that they manage and office location.

These other attributes are used when defining relationships between the entities and are also fields that may be displayed when browsing the knowledge graph.

It is often useful to capture basic information for each entity type when planning a knowledge graph.

It is a good idea to capture the following for each type:

  • Fields containing suitable NodeName values.

  • Fields suitable for attributes, either for presentation in the graph interface or for defining relationships.

  • Field containing a suitable thumbnail URL

  • A suitable icon class. Icons from the Font Awesome 5 free icon set are available in the interface without any special configuration.

E.g.

Entity type Node names Attributes Thumbnail image source Icon (Font Awesome)

Person

  • FirstName+" "+LastName

  • LastName+", "+FirstName

  • EmailAddress

  • UserId

  • Name

  • JobTitle

  • PhoneNumber

  • EmailAddress

  • PostalAddress

  • OfficeLocation

  • Biography

ThumbnailUrl

user

Document

  • FileName

  • ReportTitle

  • ISBN

  • FileName

  • Series

  • ISBN

  • ReportTitle

  • Author

Not available

file

Event

  • og:title

  • og:title

  • StartDate

  • EndDate

  • EventType

  • Organiser

og:image

calendar-alt

7.2. Identify relationships

There are two types of relationships that are tracked by Funnelback knowledge graph.

Mentions relationships are automatically identified by Funnelback when the knowledge graph is built. A mentions relationship is an unstructured relationship and occurs when one of the values defined in an entity A’s node names is found in the text of entity B. In this case entity B mentions entity A. A mentions relationship is only generated if there are not user defined relationships between the two entities.

When the graph is built each mentions relationship will be recorded in both directions as an incoming or outgoing relationship.

  • The outgoing form of the relationship is the relationship from the source node to the target node.

  • The incoming form of the relationship is the relationship from the target node back to the source node.

User-defined relationships are configured via the Funnelback administration interface and occur when one of the values defined in entity A’s node names appears as a value in a specific attribute of entity B. Using the person example above we can define a manager/manages relationship between two person objects. If any of the node names appears as a value in the people that they manage field then a manages relationship can be recorded: Person B manages Person A. A reverse relationship is also defined implicitly: Person A is managed by Person B.

A relationship is defined from the source entity (A) to a metadata field in the target entity (B). This is set up as a single named relationship. When the graph is built this named relationship will be recorded in both directions as an incoming or outgoing relationship in the same way as for a mentions relationship.

The relationship is only defined when one of entity A’s node name values appears as an exact (not substring) match to a value in the nominated metadata field of entity B. The value of the metadata field of entity B is split using the standard delimiting rules of Funnelback and the node name must fully match one of these split values.

A relationship matrix is a convenient way to map out how each entity type can be related to another entity type, and the field that contains the information (indicated below in brackets) e.g.

Source entity type

Target entity type

Person

Document

Event

Person

  • is manager of (manages)

  • is managed by (manager)

  • is author of

  • is contributor to

  • is referenced by

  • is organiser of

  • is attendee of

Document

  • is authored by (author)

  • contains contributions by (contributor)

  • references (@mentions)

  • references

  • is referenced by

-

Event

  • is organised by (organiser)

  • is attended by (attendees)

-

-

7.3. Identify search result filters

Search result filters can be presented on the search results screen of the knowledge graph widget.

These reflect the faceted navigation configured for the profile.