Title: | Fuzzy String Comparison |
---|---|
Description: | Provides string similarity calculations inspired by the Python 'thefuzz' package. Compare strings by edit distance, similarity ratio, best matching substring, ordered token matching and set-based token matching. A range of edit distance measures are available thanks to the 'stringdist' package. |
Authors: | Lewin Appleton-Fox [aut, cre, cph] |
Maintainer: | Lewin Appleton-Fox <[email protected]> |
License: | GPL-3 |
Version: | 0.2.0 |
Built: | 2024-11-21 03:04:55 UTC |
Source: | https://github.com/lewinfox/levitate |
Given an input
string and multiple candidates
, return the candidate with the best score as
calculated by .fn
.
lev_best_match(input, candidates, .fn = lev_ratio, ..., decreasing = TRUE)
lev_best_match(input, candidates, .fn = lev_ratio, ..., decreasing = TRUE)
input |
A single string |
candidates |
One or more candidate strings to score |
.fn |
The scoring function to use, as a string or function object. Defaults to
|
... |
Additional arguments to pass to |
decreasing |
If |
A string
lev_best_match("bilbo", c("frodo", "gandalf", "legolas"))
lev_best_match("bilbo", c("frodo", "gandalf", "legolas"))
Uses stringdist::stringdistmatrix()
to compute a range of
string distance metrics.
lev_distance(a, b, pairwise = TRUE, useNames = TRUE, ...)
lev_distance(a, b, pairwise = TRUE, useNames = TRUE, ...)
a , b
|
The input strings |
pairwise |
Boolean. If |
useNames |
Boolean. Use input vectors as row and column names? |
... |
Additional arguments to be passed to |
A numeric scalar, vector or matrix depending on the length of the inputs. See "Details".
This is a thin wrapper around stringdist::stringdistmatrix()
and mainly exists to coerce the
output into the simplest possible format (via lev_simplify_matrix()
).
The function will return the simplest possible data structure permitted by the length of the
inputs a
and b
. This will be a scalar if a
and b
are length 1, a vector if either (but
not both) is length > 1, and a matrix otherwise.
In addition to useNames
stringdist::stringdistmatrix()
provides a range of options to control
the matching, which can be passed using ...
. Refer to the stringdist
documentation for more
information.
lev_distance("Bilbo", "Frodo") lev_distance("Bilbo", c("Frodo", "Merry")) lev_distance("Bilbo", c("Frodo", "Merry"), useNames = FALSE) lev_distance(c("Bilbo", "Gandalf"), c("Frodo", "Merry"))
lev_distance("Bilbo", "Frodo") lev_distance("Bilbo", c("Frodo", "Merry")) lev_distance("Bilbo", c("Frodo", "Merry"), useNames = FALSE) lev_distance(c("Bilbo", "Gandalf"), c("Frodo", "Merry"))
Find the best lev_ratio()
between substrings.
lev_partial_ratio(a, b, pairwise = TRUE, useNames = TRUE, ...)
lev_partial_ratio(a, b, pairwise = TRUE, useNames = TRUE, ...)
a , b
|
The input strings |
pairwise |
Boolean. If |
useNames |
Boolean. Use input vectors as row and column names? |
... |
Additional arguments to be passed to |
A numeric scalar, vector or matrix depending on the length of the inputs.
If string a
has length len_a
and is shorter than string b
, this function finds the highest
lev_ratio()
of all the len_a
-long substrings of b
(and vice versa).
lev_ratio("Bruce Springsteen", "Bruce Springsteen and the E Street Band") # Here the two "Bruce Springsteen" strings will match perfectly. lev_partial_ratio("Bruce Springsteen", "Bruce Springsteen and the E Street Band")
lev_ratio("Bruce Springsteen", "Bruce Springsteen and the E Street Band") # Here the two "Bruce Springsteen" strings will match perfectly. lev_partial_ratio("Bruce Springsteen", "Bruce Springsteen and the E Street Band")
String similarity ratio
lev_ratio(a, b, pairwise = TRUE, useNames = TRUE, ...)
lev_ratio(a, b, pairwise = TRUE, useNames = TRUE, ...)
a , b
|
The input strings |
pairwise |
Boolean. If |
useNames |
Boolean. Use input vectors as row and column names? |
... |
Additional arguments to be passed to |
A numeric scalar, vector or matrix depending on the length of the inputs.
This is a thin wrapper around stringdist::stringsimmatrix()
and mainly exists to coerce the
output into the simplest possible format (via lev_simplify_matrix()
).
The function will return the simplest possible data structure permitted by the length of the
inputs a
and b
. This will be a scalar if a
and b
are length 1, a vector if either (but
not both) is length > 1, and a matrix otherwise.
lev_ratio("Bilbo", "Frodo") lev_ratio("Bilbo", c("Frodo", "Merry")) lev_ratio("Bilbo", c("Frodo", "Merry"), useNames = FALSE) lev_ratio(c("Bilbo", "Gandalf"), c("Frodo", "Merry"))
lev_ratio("Bilbo", "Frodo") lev_ratio("Bilbo", c("Frodo", "Merry")) lev_ratio("Bilbo", c("Frodo", "Merry"), useNames = FALSE) lev_ratio(c("Bilbo", "Gandalf"), c("Frodo", "Merry"))
Given a single input
string and multiple candidates
, compute scores for each candidate.
lev_score_multiple(input, candidates, .fn = lev_ratio, ..., decreasing = TRUE)
lev_score_multiple(input, candidates, .fn = lev_ratio, ..., decreasing = TRUE)
input |
A single string |
candidates |
One or more candidate strings to score |
.fn |
The scoring function to use, as a string or function object. Defaults to
|
... |
Additional arguments to pass to |
decreasing |
If |
A list where the keys are candidates
and the values are the scores. The list is sorted
according to the decreasing
parameter, so by default higher scores are first.
lev_score_multiple("bilbo", c("frodo", "gandalf", "legolas"))
lev_score_multiple("bilbo", c("frodo", "gandalf", "legolas"))
Compare stings based on shared tokens.
lev_token_set_ratio(a, b, pairwise = TRUE, useNames = TRUE, ...)
lev_token_set_ratio(a, b, pairwise = TRUE, useNames = TRUE, ...)
a , b
|
The input strings |
pairwise |
Boolean. If |
useNames |
Boolean. Use input vectors as row and column names? |
... |
Additional arguments to be passed to |
A numeric scalar, vector or matrix depending on the length of the inputs.
Similar to lev_token_sort_ratio()
this function breaks the input down into tokens. It then
identifies any common tokens between strings and creates three new strings:
x <- {common_tokens} y <- {common_tokens}{remaining_unique_tokens_from_string_a} z <- {common_tokens}{remaining_unique_tokens_from_string_b}
and performs three pairwise lev_ratio()
calculations between them (x
vs y
, y
vs z
and
x
vs z
). The highest of those three ratios is returned.
x <- "the quick brown fox jumps over the lazy dog" y <- "my lazy dog was jumped over by a quick brown fox" lev_ratio(x, y) lev_token_sort_ratio(x, y) lev_token_set_ratio(x, y)
x <- "the quick brown fox jumps over the lazy dog" y <- "my lazy dog was jumped over by a quick brown fox" lev_ratio(x, y) lev_token_sort_ratio(x, y) lev_token_set_ratio(x, y)
Compares strings by tokenising them, sorting the tokens alphabetically and then computing the
lev_ratio()
of the result. This means that the order of words is irrelevant which can be
helpful in some circumstances.
lev_token_sort_ratio(a, b, pairwise = TRUE, useNames = TRUE, ...)
lev_token_sort_ratio(a, b, pairwise = TRUE, useNames = TRUE, ...)
a , b
|
The input strings |
pairwise |
Boolean. If |
useNames |
Boolean. Use input vectors as row and column names? |
... |
Additional arguments to be passed to |
A numeric scalar, vector or matrix depending on the length of the inputs.
x <- "Episode IV - Star Wars: A New Hope" y <- "Star Wars Episode IV - New Hope" # Because the order of words is different the simple approach gives a low match ratio. lev_ratio(x, y) # The sorted token approach ignores word order. lev_token_sort_ratio(x, y)
x <- "Episode IV - Star Wars: A New Hope" y <- "Star Wars Episode IV - New Hope" # Because the order of words is different the simple approach gives a low match ratio. lev_ratio(x, y) # The sorted token approach ignores word order. lev_token_sort_ratio(x, y)
Computes similarity but allows you to assign weights to specific tokens. This is useful, for example, when you have a frequently-occurring string that doesn't contain useful information. See examples.
lev_weighted_token_ratio(a, b, weights = list(), ...)
lev_weighted_token_ratio(a, b, weights = list(), ...)
a , b
|
The input strings |
weights |
List of token weights. For example, |
... |
Additional arguments to be passed to |
A float
The algorithm used here is as follows:
Tokenise the input strings
Compute the edit distance between each pair of tokens
Compute the maximum edit distance between each pair of tokens
Apply any weights from the weights
argument
Return 1 - (sum(weighted_edit_distances) / sum(weighted_max_edit_distance))
Other weighted token functions:
lev_weighted_token_set_ratio()
,
lev_weighted_token_sort_ratio()
lev_weighted_token_ratio("jim ltd", "tim ltd") lev_weighted_token_ratio("tim ltd", "jim ltd", weights = list(ltd = 0.1))
lev_weighted_token_ratio("jim ltd", "tim ltd") lev_weighted_token_ratio("tim ltd", "jim ltd", weights = list(ltd = 0.1))
lev_token_set_ratio()
Weighted version of lev_token_set_ratio()
lev_weighted_token_set_ratio(a, b, weights = list(), ...)
lev_weighted_token_set_ratio(a, b, weights = list(), ...)
a , b
|
The input strings |
weights |
List of token weights. For example, |
... |
Additional arguments to be passed to |
Float
Other weighted token functions:
lev_weighted_token_ratio()
,
lev_weighted_token_sort_ratio()
This function tokenises inputs, sorts tokens and computes similarities for each pair of tokens.
Similarity scores are weighted based on the weights
argument, and a total similarity score is
returned in the same manner as lev_weighted_token_ratio()
.
lev_weighted_token_sort_ratio(a, b, weights = list(), ...)
lev_weighted_token_sort_ratio(a, b, weights = list(), ...)
a , b
|
The input strings |
weights |
List of token weights. For example, |
... |
Additional arguments to be passed to |
Float
Other weighted token functions:
lev_weighted_token_ratio()
,
lev_weighted_token_set_ratio()