---
title: "levitate"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{levitate}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(levitate)
```

This article walks through an example of using `levitate` to compare text strings in the wild, and
aims to give you a feel for the pros and cons of the different string similarity measures provided
by the package.

`levitate` comes with `hotel_rooms` dataset that contains descriptions of the same hotel rooms from
two different websites, Expedia and Booking.com. The list was compiled by
[Susan Li](https://github.com/susanli2016) - all credit to her for the work.

```{r}
head(hotel_rooms)
```

Let's add columns to the dataset showing how the different algorithms score the two strings.

```{r}
df <- hotel_rooms

df$lev_ratio <- lev_ratio(df$expedia, df$booking)
df$lev_partial_ratio <- lev_partial_ratio(df$expedia, df$booking)
df$lev_token_sort_ratio <- lev_token_sort_ratio(df$expedia, df$booking)
df$lev_token_set_ratio <- lev_token_set_ratio(df$expedia, df$booking)
```

## A simple matching model

We can write a function to return the best match from a list of candidates.

```{r}
best_match <- function(a, b, FUN) {
  scores <- FUN(a = a, b = b)
  best <- order(scores, decreasing = TRUE)[1L]
  b[best]
}

best_match("cat", c("cot", "dog", "frog"), lev_ratio)
```

We can then use this to find out which of the Booking.com entries each of the functions choose for
each of the Expedia entries.

```{r}
best_match_by_fun <- function(FUN) {
  best_matches <- character(nrow(hotel_rooms))
  for (i in seq_along(best_matches)) {
    best_matches[i] <- best_match(hotel_rooms$expedia[i], hotel_rooms$booking, FUN)
  }
  best_matches
}

df$lev_ratio_best_match <- best_match_by_fun(FUN = lev_ratio)
df$lev_partial_ratio_best_match <- best_match_by_fun(FUN = lev_partial_ratio)
df$lev_token_sort_ratio_best_match <- best_match_by_fun(FUN = lev_token_sort_ratio)
df$lev_token_set_ratio_best_match <- best_match_by_fun(FUN = lev_token_set_ratio)
```

We can now see how many each algo got right.

```{r}
message("`lev_ratio()`: ", sum(df$lev_ratio_best_match == df$booking) / nrow(df))

message("`lev_partial_ratio()`: ", sum(df$lev_partial_ratio_best_match == df$booking) / nrow(df))

message("`lev_token_sort_ratio()`: ", sum(df$lev_token_sort_ratio_best_match == df$booking) / nrow(df))

message("`lev_token_set_ratio()`: ", sum(df$lev_token_set_ratio_best_match == df$booking) / nrow(df))
```