Package 'levitate' reference manual

Title:	Fuzzy String Comparison
Description:	Provides string similarity calculations inspired by the Python 'thefuzz' package. Compare strings by edit distance, similarity ratio, best matching substring, ordered token matching and set-based token matching. A range of edit distance measures are available thanks to the 'stringdist' package.
Authors:	Lewin Appleton-Fox [aut, cre, cph]
Maintainer:	Lewin Appleton-Fox <[email protected]>
License:	GPL-3
Version:	0.2.0
Built:	2024-11-21 03:04:55 UTC
Source:	https://github.com/lewinfox/levitate

Get the best matched string from a list of candidates

Description

Given an input string and multiple candidates, return the candidate with the best score as calculated by .fn.

Usage

lev_best_match(input, candidates, .fn = lev_ratio, ..., decreasing = TRUE)
lev_best_match(input, candidates, .fn = lev_ratio, ..., decreasing = TRUE)

Arguments

`input`	A single string
`candidates`	One or more candidate strings to score
`.fn`	The scoring function to use, as a string or function object. Defaults to `lev_ratio()`.
`...`	Additional arguments to pass to `.fn`.
`decreasing`	If `TRUE` (the default), the candidate with the highest score is ranked first. If using a comparison `.fn` that computes distance rather than similarity, or if you want the worst match to be returned first, set this to `FALSE`.

Value

A string

Examples

lev_best_match("bilbo", c("frodo", "gandalf", "legolas"))
lev_best_match("bilbo", c("frodo", "gandalf", "legolas"))

String distance metrics

Description

Uses stringdist::stringdistmatrix() to compute a range of string distance metrics.

Usage

lev_distance(a, b, pairwise = TRUE, useNames = TRUE, ...)
lev_distance(a, b, pairwise = TRUE, useNames = TRUE, ...)

Arguments

`a`, `b`	The input strings
`pairwise`	Boolean. If `TRUE`, only the pairwise distances between `a` and `b` will be computed, rather than the combinations of all elements.
`useNames`	Boolean. Use input vectors as row and column names?
`...`	Additional arguments to be passed to `stringdist::stringdistmatrix()` or `stringdist::stringsimmatrix()`.

Value

A numeric scalar, vector or matrix depending on the length of the inputs. See "Details".

Details

This is a thin wrapper around stringdist::stringdistmatrix() and mainly exists to coerce the output into the simplest possible format (via lev_simplify_matrix()).

The function will return the simplest possible data structure permitted by the length of the inputs a and b. This will be a scalar if a and b are length 1, a vector if either (but not both) is length > 1, and a matrix otherwise.

Other options

In addition to useNames stringdist::stringdistmatrix() provides a range of options to control the matching, which can be passed using .... Refer to the stringdist documentation for more information.

Examples

lev_distance("Bilbo", "Frodo")

lev_distance("Bilbo", c("Frodo", "Merry"))

lev_distance("Bilbo", c("Frodo", "Merry"), useNames = FALSE)

lev_distance(c("Bilbo", "Gandalf"), c("Frodo", "Merry"))
lev_distance("Bilbo", "Frodo")

lev_distance("Bilbo", c("Frodo", "Merry"))

lev_distance("Bilbo", c("Frodo", "Merry"), useNames = FALSE)

lev_distance(c("Bilbo", "Gandalf"), c("Frodo", "Merry"))

Ratio of the best-matching substring

Description

Find the best lev_ratio() between substrings.

Usage

lev_partial_ratio(a, b, pairwise = TRUE, useNames = TRUE, ...)
lev_partial_ratio(a, b, pairwise = TRUE, useNames = TRUE, ...)

Arguments

`a`, `b`	The input strings
`pairwise`	Boolean. If `TRUE`, only the pairwise distances between `a` and `b` will be computed, rather than the combinations of all elements.
`useNames`	Boolean. Use input vectors as row and column names?
`...`	Additional arguments to be passed to `stringdist::stringdistmatrix()` or `stringdist::stringsimmatrix()`.

Value

A numeric scalar, vector or matrix depending on the length of the inputs.

Details

If string a has length len_a and is shorter than string b, this function finds the highest lev_ratio() of all the len_a-long substrings of b (and vice versa).

Examples

lev_ratio("Bruce Springsteen", "Bruce Springsteen and the E Street Band")

# Here the two "Bruce Springsteen" strings will match perfectly.
lev_partial_ratio("Bruce Springsteen", "Bruce Springsteen and the E Street Band")
lev_ratio("Bruce Springsteen", "Bruce Springsteen and the E Street Band")

# Here the two "Bruce Springsteen" strings will match perfectly.
lev_partial_ratio("Bruce Springsteen", "Bruce Springsteen and the E Street Band")

String similarity ratio

Description

String similarity ratio

Usage

lev_ratio(a, b, pairwise = TRUE, useNames = TRUE, ...)
lev_ratio(a, b, pairwise = TRUE, useNames = TRUE, ...)

Arguments

`a`, `b`	The input strings
`pairwise`	Boolean. If `TRUE`, only the pairwise distances between `a` and `b` will be computed, rather than the combinations of all elements.
`useNames`	Boolean. Use input vectors as row and column names?
`...`	Additional arguments to be passed to `stringdist::stringdistmatrix()` or `stringdist::stringsimmatrix()`.

Value

A numeric scalar, vector or matrix depending on the length of the inputs.

Details

This is a thin wrapper around stringdist::stringsimmatrix() and mainly exists to coerce the output into the simplest possible format (via lev_simplify_matrix()).

Examples

lev_ratio("Bilbo", "Frodo")

lev_ratio("Bilbo", c("Frodo", "Merry"))

lev_ratio("Bilbo", c("Frodo", "Merry"), useNames = FALSE)

lev_ratio(c("Bilbo", "Gandalf"), c("Frodo", "Merry"))
lev_ratio("Bilbo", "Frodo")

lev_ratio("Bilbo", c("Frodo", "Merry"))

lev_ratio("Bilbo", c("Frodo", "Merry"), useNames = FALSE)

lev_ratio(c("Bilbo", "Gandalf"), c("Frodo", "Merry"))

Score multiple candidate strings against a single input

Description

Given a single input string and multiple candidates, compute scores for each candidate.

Usage

lev_score_multiple(input, candidates, .fn = lev_ratio, ..., decreasing = TRUE)
lev_score_multiple(input, candidates, .fn = lev_ratio, ..., decreasing = TRUE)

Arguments

`input`	A single string
`candidates`	One or more candidate strings to score
`.fn`	The scoring function to use, as a string or function object. Defaults to `lev_ratio()`.
`...`	Additional arguments to pass to `.fn`.
`decreasing`	If `TRUE` (the default), the candidate with the highest score is ranked first. If using a comparison `.fn` that computes distance rather than similarity, or if you want the worst match to be returned first, set this to `FALSE`.

Value

A list where the keys are candidates and the values are the scores. The list is sorted according to the decreasing parameter, so by default higher scores are first.

Examples

lev_score_multiple("bilbo", c("frodo", "gandalf", "legolas"))
lev_score_multiple("bilbo", c("frodo", "gandalf", "legolas"))

Matching based on common tokens

Description

Compare stings based on shared tokens.

Usage

lev_token_set_ratio(a, b, pairwise = TRUE, useNames = TRUE, ...)
lev_token_set_ratio(a, b, pairwise = TRUE, useNames = TRUE, ...)

Arguments

`a`, `b`	The input strings
`pairwise`	Boolean. If `TRUE`, only the pairwise distances between `a` and `b` will be computed, rather than the combinations of all elements.
`useNames`	Boolean. Use input vectors as row and column names?
`...`	Additional arguments to be passed to `stringdist::stringdistmatrix()` or `stringdist::stringsimmatrix()`.

Value

A numeric scalar, vector or matrix depending on the length of the inputs.

Details

Similar to lev_token_sort_ratio() this function breaks the input down into tokens. It then identifies any common tokens between strings and creates three new strings:

x <- {common_tokens}
y <- {common_tokens}{remaining_unique_tokens_from_string_a}
z <- {common_tokens}{remaining_unique_tokens_from_string_b}

and performs three pairwise lev_ratio() calculations between them (x vs y, y vs z and x vs z). The highest of those three ratios is returned.

Examples

x <- "the quick brown fox jumps over the lazy dog"
y <- "my lazy dog was jumped over by a quick brown fox"

lev_ratio(x, y)

lev_token_sort_ratio(x, y)

lev_token_set_ratio(x, y)
x <- "the quick brown fox jumps over the lazy dog"
y <- "my lazy dog was jumped over by a quick brown fox"

lev_ratio(x, y)

lev_token_sort_ratio(x, y)

lev_token_set_ratio(x, y)

Ordered token matching

Description

Compares strings by tokenising them, sorting the tokens alphabetically and then computing the lev_ratio() of the result. This means that the order of words is irrelevant which can be helpful in some circumstances.

Usage

lev_token_sort_ratio(a, b, pairwise = TRUE, useNames = TRUE, ...)
lev_token_sort_ratio(a, b, pairwise = TRUE, useNames = TRUE, ...)

Arguments

`a`, `b`	The input strings
`pairwise`	Boolean. If `TRUE`, only the pairwise distances between `a` and `b` will be computed, rather than the combinations of all elements.
`useNames`	Boolean. Use input vectors as row and column names?
`...`	Additional arguments to be passed to `stringdist::stringdistmatrix()` or `stringdist::stringsimmatrix()`.

Value

A numeric scalar, vector or matrix depending on the length of the inputs.

Examples

x <- "Episode IV - Star Wars: A New Hope"
y <- "Star Wars Episode IV - New Hope"

# Because the order of words is different the simple approach gives a low match ratio.
lev_ratio(x, y)

# The sorted token approach ignores word order.
lev_token_sort_ratio(x, y)
x <- "Episode IV - Star Wars: A New Hope"
y <- "Star Wars Episode IV - New Hope"

# Because the order of words is different the simple approach gives a low match ratio.
lev_ratio(x, y)

# The sorted token approach ignores word order.
lev_token_sort_ratio(x, y)

Weighted token similarity measure

Description

Computes similarity but allows you to assign weights to specific tokens. This is useful, for example, when you have a frequently-occurring string that doesn't contain useful information. See examples.

Usage

lev_weighted_token_ratio(a, b, weights = list(), ...)
lev_weighted_token_ratio(a, b, weights = list(), ...)

Arguments

`a`, `b`	The input strings
`weights`	List of token weights. For example, `weights = list(foo = 0.9, bar = 0.1)`. Any tokens omitted from `weights` will be given a weight of 1.
`...`	Additional arguments to be passed to `stringdist::stringdistmatrix()` or `stringdist::stringsimmatrix()`.

Value

A float

Details

The algorithm used here is as follows:

Tokenise the input strings
Compute the edit distance between each pair of tokens
Compute the maximum edit distance between each pair of tokens
Apply any weights from the weights argument
Return 1 - (sum(weighted_edit_distances) / sum(weighted_max_edit_distance))

Examples

lev_weighted_token_ratio("jim ltd", "tim ltd")

lev_weighted_token_ratio("tim ltd", "jim ltd", weights = list(ltd = 0.1))
lev_weighted_token_ratio("jim ltd", "tim ltd")

lev_weighted_token_ratio("tim ltd", "jim ltd", weights = list(ltd = 0.1))

Weighted version of `lev_token_set_ratio()`

Description

Weighted version of lev_token_set_ratio()

Usage

lev_weighted_token_set_ratio(a, b, weights = list(), ...)
lev_weighted_token_set_ratio(a, b, weights = list(), ...)

Arguments

`a`, `b`	The input strings
`weights`	List of token weights. For example, `weights = list(foo = 0.9, bar = 0.1)`. Any tokens omitted from `weights` will be given a weight of 1.
`...`	Additional arguments to be passed to `stringdist::stringdistmatrix()` or `stringdist::stringsimmatrix()`.

Value

Float

Weighted version of lev_token_sort_ratio()

Description

This function tokenises inputs, sorts tokens and computes similarities for each pair of tokens. Similarity scores are weighted based on the weights argument, and a total similarity score is returned in the same manner as lev_weighted_token_ratio().

Usage

lev_weighted_token_sort_ratio(a, b, weights = list(), ...)
lev_weighted_token_sort_ratio(a, b, weights = list(), ...)

Arguments

`a`, `b`	The input strings
`weights`	List of token weights. For example, `weights = list(foo = 0.9, bar = 0.1)`. Any tokens omitted from `weights` will be given a weight of 1.
`...`	Additional arguments to be passed to `stringdist::stringdistmatrix()` or `stringdist::stringsimmatrix()`.

Value

Float

Package 'levitate'

Help Index

Get the best matched string from a list of candidates

Description

Usage

Arguments

Value

See Also

Examples

String distance metrics

Description

Usage

Arguments

Value

Details

Other options

Examples

Ratio of the best-matching substring

Description

Usage

Arguments

Value

Details

Examples

String similarity ratio

Description

Usage

Arguments

Value

Details

Examples

Score multiple candidate strings against a single input

Description

Usage

Arguments

Value

See Also

Examples

Matching based on common tokens

Description

Usage

Arguments

Value

Details

See Also

Examples

Ordered token matching

Description

Usage

Arguments

Value

See Also

Examples

Weighted token similarity measure

Description

Usage

Arguments

Value

Details

See Also

Examples

Weighted version of lev_token_set_ratio()

Description

Usage

Arguments

Value

See Also

Weighted version of lev_token_sort_ratio()

Description

Usage

Arguments

Value

See Also

Weighted version of `lev_token_set_ratio()`