./lib/search.ts annotated source

Back to index

        

libsearch is the core text search algorithm that I've polished and reused over the years across many of my personal projects for fast and simple full-text search, packaged into a small single-purpose JavaScript library.

For how libsearch works, how to import and use in your own project, and canonical documentation, check out the GitHub repository page.

9

To turn every potential query into a regular expression, we need to be able to escape characters that are significant in RegExp.

12function escapeForRegExp(text: string): string {
13    return text.replace(/[.*+?^${}[\]()|\\]/g, '\\$1');
14}
15

Utility function for sorting an array by some predicate, rather than a comparator function. This implementation assumes by(it) is very cheap.

18function sortBy<T>(items: T[], by: (_it: T) => any): T[] {
19    return items.sort((a, b) => {
20        const aby = by(a);
21        const bby = by(b);
22        if (aby < bby) {
23            return 1;
24        }
25        if (bby < aby) {
26            return -1;
27        }
28        return 0;
29    });
30}
31

The search function takes:

  • items, the list of items to search
  • query, the search query text
  • by, which is a predicate function that takes an item from the items list and returns the string that should be matched with the query
  • options, a dictionary of options:

Options include

  • caseSensitive, which is self-explanatory
  • mode: which is 'word', 'prefix', or 'autocomplete' ('autocomplete' by default), determining the way in which partial matches are processed
43export function search<T>(items: T[], query: string, by: (_it: T) => string = x => String(x), {
44    caseSensitive = false,
45    mode = 'autocomplete',
46}: {
47    caseSensitive?: boolean;
48    mode?: 'word' | 'prefix' | 'autocomplete';
49} = {}) {

countMatches counts the number of times regexp occurs in the string s. We need this information for ranking, where documents that mention the keyword more times (relative to the total word count of the document) are ranked higher.

54    function countMatches(s: string, regexp: RegExp): number {
55        let i = 0;
56        while (regexp.exec(s) !== null) {
57            i ++;
58        }
59        return i;
60    }
61

We chunk up the query string into a list of "words", each of which will become a regular expression filter.

64    const words = query
65        .trim()
66        .split(/\s+/)
67        .filter(s => s !== '');
68

Short-circuit if the search query is empty -- return the original list. This is a sensible default because in most apps this corresponds to the "home view" of the list, where a search has not been performed.

72    if (words.length === 0) {
73        return items;
74    }
75

For every word in the search query, we're going to keep track of every document's TF-IDF value in this map, and aggregate them together by the end for sorting.

79    const tfidf = new Map<T, number>();
80

Iterate through every word in the query and progressively filter down items to just the documents that match every query word.

83    const results = words.reduce((results, word, i) => {
84        const isLastWord = i + 1 === words.length;
85        const regexp = new RegExp(
86            '(^|\\W)'
87                + escapeForRegExp(word)
88                + ((mode === 'autocomplete' && isLastWord) || mode === 'prefix' ? '' : '($|\\W)'),

The 'u' flag for Unicode used to be used here, but was removed because it was (1) across-the-board too slow, and removing it made a statistically significant speed improvement, and (2) caused at least Chrome to have strange performance cliffs in unpredictable ways where certain RegExp operations would take 10s of ms.

95            caseSensitive ? 'mg' : 'img'
96        );
97        return results.filter(result => {
98            const text = by(result);
99            const count = countMatches(text, regexp);
100            if (count === 0) {
101                return false;
102            }

Compute the TF-IDF value for this word, and add it to this result's TF-IDF value so far.

105            tfidf.set(
106                result,
107                (tfidf.get(result) || 0)
108                    + (count / text.length * Math.log(items.length / results.length))
109            );
110            return true;
111        })
112    }, items);
113

Sort the results list by our ranking metric, TF-IDF.

115    return sortBy(results, result => tfidf.get(result));
116}
117
118