---
title: "Introduction to datadiff"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to datadiff}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(datadiff)
```

`datadiff` compares two datasets — a **reference** and a **candidate** — using
validation rules defined in a YAML file. It is built on top of
[pointblank](https://github.com/rstudio/pointblank) and supports exact
matching, tolerance-based numeric comparisons, text normalization, and row
count validation.

The typical workflow is:

1. Generate a YAML rules template from the reference dataset.
2. Edit the YAML to configure tolerances, ignored columns, etc.
3. Call `compare_datasets_from_yaml()` and inspect the result.

---

## 1. A first comparison

```{r first-comparison}
library(datadiff)

ref <- data.frame(
  id       = 1:4,
  revenue  = c(1000.00, 2000.00, 3000.00, 4000.00),
  category = c("A", "B", "C", "D"),
  active   = c(TRUE, TRUE, FALSE, TRUE)
)

cand <- data.frame(
  id       = 1:4,
  revenue  = c(1000.005, 2000.001, 3000.009, 4000.00),  # tiny differences
  category = c("a", "b", "c", "D"),                      # lowercase
  active   = c(TRUE, TRUE, FALSE, TRUE)
)
```

Generate a rules template and tune it:

```{r first-rules, results = "hide"}
rules_path <- tempfile(fileext = ".yaml")

write_rules_template(
  ref,
  key                        = "id",
  path                       = rules_path,
  numeric_abs                = 0.01,   # accept differences up to 0.01
  character_case_insensitive = TRUE    # ignore case for all char columns
)
```

Run the comparison:

```{r first-result}
result <- compare_datasets_from_yaml(ref, cand, key = "id", path = rules_path)
result$all_passed
```

---

## 2. Understanding the return value

`compare_datasets_from_yaml()` returns a list with six elements:

```{r return-structure}
names(result)
```

| Element | Description |
|---------|-------------|
| `all_passed` | `TRUE` if every validation step passed |
| `agent` | The configured pointblank agent (before interrogation) |
| `reponse` | The interrogated pointblank agent (full results) |
| `missing_in_candidate` | Columns present in reference but absent from candidate |
| `extra_in_candidate` | Columns present in candidate but absent from reference |
| `applied_rules` | The effective per-column rules that were applied |

### Inspecting applied rules

`applied_rules` shows the exact rules used for each column — useful to verify
that `by_name` overrides were applied correctly:

```{r applied-rules}
result$applied_rules$revenue
result$applied_rules$category
```

### Column presence

```{r col-presence}
result$missing_in_candidate
result$extra_in_candidate
```

### Accessing failing rows

When `all_passed` is `FALSE`, use `pointblank::get_sundered_data()` to
extract the rows that failed at least one validation step:

```{r failing-rows}
ref_fail <- data.frame(id = 1:5, value = c(1, 2, 3, 4, 5))
cand_fail <- data.frame(id = 1:5, value = c(1, 2, 99, 4, 99))  # rows 3 and 5 wrong

result_fail <- compare_datasets_from_yaml(ref_fail, cand_fail, key = "id")
result_fail$all_passed

# Rows that failed at least one step
failed_rows <- pointblank::get_sundered_data(result_fail$reponse, type = "fail")
failed_rows
```

The `type = "pass"` variant returns rows that passed all steps. This is
useful to understand the scope of the problem before investigating further.

### Controlling how many failing rows are extracted

For large datasets, extracting all failing rows can consume significant memory.
Three mutually exclusive parameters cap this:

```{r extract-params, eval = FALSE}
# Keep only the first 100 failing rows per validation step
result <- compare_datasets_from_yaml(ref, cand, key = "id",
                                     get_first_n = 100)

# Random sample of 50 failing rows per step
result <- compare_datasets_from_yaml(ref, cand, key = "id",
                                     sample_n = 50)

# 10% of failing rows, capped at 500
result <- compare_datasets_from_yaml(ref, cand, key = "id",
                                     sample_frac = 0.1, sample_limit = 500)

# Disable extraction entirely (fastest — only pass/fail counts are kept)
result <- compare_datasets_from_yaml(ref, cand, key = "id",
                                     extract_failed = FALSE)
```

---

## 3. Comparing without a YAML file

When `path = NULL` (the default), `datadiff` auto-generates rules from the
reference dataset structure. This is useful for a quick sanity check without
any configuration:

```{r no-yaml}
ref_quick  <- data.frame(id = 1:3, x = c(1.0, 2.0, 3.0), label = c("A", "B", "C"))
cand_quick <- data.frame(id = 1:3, x = c(1.0, 2.0, 3.0), label = c("A", "B", "C"))

# No path needed — rules are generated on the fly
result_quick <- compare_datasets_from_yaml(ref_quick, cand_quick, key = "id")
result_quick$all_passed
```

The auto-generated rules use near-exact numeric tolerance (`abs = 1e-9`) and
exact character matching — equivalent to calling `write_rules_template()` with
all defaults.

---

## 4. YAML rules in depth

`write_rules_template()` generates a fully annotated YAML. Here is a complete
example with all sections explained:

```yaml
version: 1

defaults:
  na_equal: yes           # treat NA == NA as a pass
  ignore_columns:         # columns excluded from comparison entirely
    - documentation
    - updated_at
  keys: id                # join key (single or composite)
  label: ref vs cand      # label shown in the pointblank report

row_validation:
  check_count: yes
  expected_count: ~       # null = use reference row count
  tolerance: 0            # exact match required

by_type:                  # rules applied to all columns of a given type
  numeric:
    abs: 1.0e-09          # near-exact by default
    rel: 0
  integer:
    abs: 0                # integers must match exactly
  character:
    equal_mode: exact
    case_insensitive: no
    trim: no
  date:
    equal_mode: exact
  datetime:
    equal_mode: exact
  logical:
    equal_mode: exact

by_name:                  # column-specific overrides (take precedence over by_type)
  id: []                  # no override — inherits integer rule
  revenue:
    abs: 0.01             # accept differences up to 0.01
  category:
    case_insensitive: yes
    trim: yes
```

Rules are **merged**: `by_name` entries extend or override `by_type` entries.
A field not listed in `by_name` keeps its `by_type` default.

| Column | Effective rule | Source |
|--------|---------------|--------|
| `id` | `abs: 0` | `by_type.integer` |
| `revenue` | `abs: 0.01, rel: 0` | `by_name` overrides `by_type.numeric` |
| `category` | `case_insensitive: yes, trim: yes` | `by_name` overrides `by_type.character` |

### Reading rules from a file

Use `read_rules()` to inspect what was actually loaded — useful for debugging
or building tooling on top of `datadiff`:

```{r read-rules}
loaded <- read_rules(rules_path)

loaded$defaults$na_equal
loaded$by_type$numeric
loaded$by_type$character
```

---

## 5. A realistic `by_name` example

The following dataset mixes several column types, each requiring a different
validation strategy:

```{r byname-example}
ref_full <- data.frame(
  id          = 1:4,
  price       = c(9.99, 19.99, 4.50, 149.00),      # numeric: small absolute tolerance
  quantity    = c(10L, 5L, 20L, 1L),                # integer: exact
  description = c("Widget A", "Widget B", "  Gadget", "TOOL"),  # needs trim + case
  in_stock    = c(TRUE, TRUE, FALSE, TRUE),          # logical: exact
  created     = as.Date(c("2024-01-01", "2024-01-02",
                           "2024-01-03", "2024-01-04"))
)

cand_full <- data.frame(
  id          = 1:4,
  price       = c(9.995, 19.99, 4.50, 149.00),  # row 1: diff = 0.005 < 0.01
  quantity    = c(10L, 5L, 20L, 1L),
  description = c("widget a", "Widget B", "Gadget", "tool"),  # case + spaces
  in_stock    = c(TRUE, TRUE, FALSE, TRUE),
  created     = as.Date(c("2024-01-01", "2024-01-02",
                           "2024-01-03", "2024-01-04"))
)
```

Build the YAML and write column-specific overrides:

```{r byname-rules, results = "hide"}
rules_full <- tempfile(fileext = ".yaml")

write_rules_template(
  ref_full,
  key                        = "id",
  path                       = rules_full,
  numeric_abs                = 1e-9,     # conservative default
  character_case_insensitive = FALSE,    # strict default for character
  character_trim             = FALSE
)

# Read, patch by_name, write back
rules_obj <- read_rules(rules_full)

rules_obj$by_name$price       <- list(abs = 0.01)           # ±0.01 for price
rules_obj$by_name$description <- list(case_insensitive = TRUE, trim = TRUE)

yaml::write_yaml(rules_obj, rules_full)
```

```{r byname-result}
result_full <- compare_datasets_from_yaml(ref_full, cand_full,
                                          key = "id", path = rules_full)
result_full$all_passed

# Verify the effective rules for each column
result_full$applied_rules$price
result_full$applied_rules$description
result_full$applied_rules$quantity
```

---

## 6. Numeric tolerance

### Formula

For every numeric column, the comparison uses a single combined threshold:

```
threshold = abs + rel × |reference_value|
PASS  if  |candidate − reference| ≤ threshold
```

### Absolute tolerance (`abs`)

The threshold is constant, independent of the magnitude of the values:

```{r abs-tolerance}
ref_num  <- data.frame(id = 1:3, price = c(1.00, 1000.00, 1e6))
cand_ok  <- data.frame(id = 1:3, price = c(1.005, 1000.005, 1e6 + 0.005))
cand_nok <- data.frame(id = 1:3, price = c(1.02,  1000.02,  1e6 + 0.02))

rules_abs <- tempfile(fileext = ".yaml")
write_rules_template(ref_num, key = "id", path = rules_abs, numeric_abs = 0.01)

compare_datasets_from_yaml(ref_num, cand_ok,  key = "id", path = rules_abs)$all_passed
compare_datasets_from_yaml(ref_num, cand_nok, key = "id", path = rules_abs)$all_passed
```

The same threshold `0.01` applies whether the value is `1` or `1 000 000`.

### Relative tolerance (`rel`)

The threshold is proportional to the reference value — useful when you want
to accept a percentage deviation:

```{r rel-tolerance}
rules_rel <- tempfile(fileext = ".yaml")
write_rules_template(ref_num, key = "id", path = rules_rel,
                     numeric_abs = 0, numeric_rel = 0.01)

# ref = 1000, diff = 9, threshold = 0.01 × 1000 = 10 → PASS
cand_pct <- data.frame(id = 1:3, price = c(1.009, 1009.0, 1e6 * 1.009))
compare_datasets_from_yaml(ref_num, cand_pct, key = "id", path = rules_rel)$all_passed
```

> **Warning**: if a reference value is `0`, the relative threshold is `0`
> and any difference will be flagged as an error. Use `abs` as a safety floor.

### Mixed mode

Combine both parameters when values span a wide range including near-zero:

```yaml
by_type:
  numeric:
    abs: 0.001   # floor: protects against false positives when ref ≈ 0
    rel: 0.005   # +0.5% for larger values
```

For `ref = 1 000 000`: `threshold = 0.001 + 0.005 × 1 000 000 = 5000.001`

> **Rule of thumb**: keep `rel: 0` (the default) unless you explicitly need
> a tolerance proportional to the magnitude of the data.

### IEEE 754 correction

Floating-point subtraction can introduce rounding errors:

```{r ieee754}
# In double precision, this is slightly above 0.01
100.01 - 100.00
```

`datadiff` automatically adds a correction of `8 × .Machine$double.eps × |ref|`
to the threshold to absorb these representation errors without meaningfully
widening the user-specified tolerance.

### `warn_at` and `stop_at`

These two parameters control the pointblank action thresholds, expressed as
the **fraction** of rows that fail a validation step:

```{r warn-stop, eval = FALSE}
result <- compare_datasets_from_yaml(
  ref, cand,
  key     = "id",
  warn_at = 0.05,   # warn if > 5% of rows fail any step
  stop_at = 0.20    # stop (error) if > 20% of rows fail any step
)
```

The default (`1e-14`) means that **any single failing row** triggers the
threshold, which is appropriate for data validation where zero differences
are expected. Raise these values if you want the report to remain green
while a small fraction of rows diverge.

---

## 7. Text comparison

Three independent options control character column comparison:

| Option | Effect |
|--------|--------|
| `case_insensitive: yes` | Convert both values to lowercase before comparing |
| `trim: yes` | Strip leading/trailing whitespace before comparing |
| `equal_mode: normalized` | Apply both transformations |

```{r text-comparison}
ref_txt  <- data.frame(id = 1:4, label = c("Hello", "World", "Foo", "Bar"))
cand_txt <- data.frame(
  id    = 1:4,
  label = c("hello", "  World  ", "FOO", "Baz")  # case, spaces, mismatch
)

# Strict: rows 1, 2, 3 fail
rules_strict <- tempfile(fileext = ".yaml")
write_rules_template(ref_txt, key = "id", path = rules_strict)
compare_datasets_from_yaml(ref_txt, cand_txt, key = "id",
                           path = rules_strict)$all_passed

# Relaxed: case + trim — only row 4 ("Baz" vs "Bar") fails
rules_relax <- tempfile(fileext = ".yaml")
write_rules_template(ref_txt, key = "id", path = rules_relax,
                     character_case_insensitive = TRUE,
                     character_trim             = TRUE)
compare_datasets_from_yaml(ref_txt, cand_txt, key = "id",
                           path = rules_relax)$all_passed
```

Column-level overrides in `by_name` apply only to the specified column,
leaving all other character columns unaffected.

---

## 8. Row count validation

The `row_validation` section checks that the candidate has the expected
number of rows.

```{r row-validation}
ref_rows  <- data.frame(id = 1:5, value = 1:5)
cand_ok   <- data.frame(id = 1:5, value = 1:5)   # 5 rows — exact match
cand_more <- data.frame(id = 1:7, value = 1:7)   # 7 rows — 2 extra

rules_count <- tempfile(fileext = ".yaml")
write_rules_template(ref_rows, key = "id", path = rules_count,
                     check_count_default         = TRUE,
                     expected_count_default      = 5,
                     row_count_tolerance_default = 0)

compare_datasets_from_yaml(ref_rows, cand_ok,   key = "id",
                           path = rules_count)$all_passed
compare_datasets_from_yaml(ref_rows, cand_more, key = "id",
                           path = rules_count)$all_passed
```

With a tolerance:

```{r row-tolerance}
rules_tol <- tempfile(fileext = ".yaml")
write_rules_template(ref_rows, key = "id", path = rules_tol,
                     check_count_default         = TRUE,
                     expected_count_default      = 5,
                     row_count_tolerance_default = 3)  # accept 5 ± 3

# 7 rows: |7 - 5| = 2 ≤ 3 → PASS
compare_datasets_from_yaml(ref_rows, cand_more, key = "id",
                           path = rules_tol)$all_passed
```

When `expected_count` is `null` in the YAML (or `expected_count_default = NULL`
in `write_rules_template()`), the reference row count is used as the target.

---

## 9. Handling NA values

The `na_equal` setting controls whether `NA == NA` is treated as a pass:

```{r na-handling}
ref_na  <- data.frame(id = 1:3, value = c(1.0, NA, 3.0))
cand_na <- data.frame(id = 1:3, value = c(1.0, NA, 3.0))  # identical NAs

# na_equal: yes (default) — NA == NA passes
rules_na_yes <- tempfile(fileext = ".yaml")
write_rules_template(ref_na, key = "id", path = rules_na_yes,
                     na_equal_default = TRUE)
compare_datasets_from_yaml(ref_na, cand_na, key = "id",
                           path = rules_na_yes)$all_passed

# na_equal: no — NA == NA fails
rules_na_no <- tempfile(fileext = ".yaml")
write_rules_template(ref_na, key = "id", path = rules_na_no,
                     na_equal_default = FALSE)
compare_datasets_from_yaml(ref_na, cand_na, key = "id",
                           path = rules_na_no)$all_passed
```

`na_equal` applies to all column types including numeric (with tolerance),
character, logical, and date columns.

---

## 10. Column management

### Ignoring columns

Columns listed in `ignore_columns_default` are excluded from comparison.
Presence/absence checks for those columns are also skipped:

```{r ignore-columns}
ref_ign  <- data.frame(id = 1:3, value = 1:3, updated_at = Sys.time())
cand_ign <- data.frame(id = 1:3, value = 1:3,
                       updated_at = Sys.time() + 3600)  # different timestamp

rules_ign <- tempfile(fileext = ".yaml")
write_rules_template(ref_ign, key = "id", path = rules_ign,
                     ignore_columns_default = "updated_at")

compare_datasets_from_yaml(ref_ign, cand_ign, key = "id",
                           path = rules_ign)$all_passed
```

### Missing and extra columns

Columns present in the reference but absent from the candidate generate a
dedicated failing step. Extra columns in the candidate are reported but do
not cause a failure:

```{r col-analysis}
ref_cols  <- data.frame(id = 1:2, a = 1:2, b = 1:2)
cand_cols <- data.frame(id = 1:2, a = 1:2, c = 1:2)  # b missing, c extra

result_cols <- compare_datasets_from_yaml(ref_cols, cand_cols, key = "id")
result_cols$missing_in_candidate   # b
result_cols$extra_in_candidate     # c
result_cols$all_passed             # FALSE: b is missing
```

### Utility: `analyze_columns()`

`analyze_columns()` exposes the column comparison logic independently — useful
for pre-flight checks before running the full validation:

```{r analyze-columns}
analysis <- analyze_columns(ref_cols, cand_cols,
                            ignore_columns = character(0))
str(analysis)
```

---

## 11. Key-based vs positional comparison

### With a key

A key column joins the candidate to the reference, handling different row
orders and unequal row counts gracefully:

```{r with-key}
ref_key  <- data.frame(id = 1:3, value = c(10, 20, 30))
cand_key <- data.frame(id = c(3, 1, 2), value = c(30, 10, 20))  # shuffled

result_key <- compare_datasets_from_yaml(ref_key, cand_key, key = "id")
result_key$all_passed
```

### Without a key (positional)

Rows are compared position by position. Both datasets must have the same
number of rows:

```{r positional}
ref_pos  <- data.frame(value = c(1.0, 2.0, 3.0))
cand_pos <- data.frame(value = c(1.0, 2.0, 3.0))

result_pos <- compare_datasets_from_yaml(ref_pos, cand_pos)
result_pos$all_passed
```

### Composite keys

Multiple columns can form a composite key:

```{r composite-key}
ref_comp <- data.frame(
  year  = c(2023, 2023, 2024),
  month = c(1, 2, 1),
  value = c(100, 200, 300)
)
cand_comp <- data.frame(
  year  = c(2024, 2023, 2023),
  month = c(1,    2,    1),
  value = c(300,  200,  100)
)

result_comp <- compare_datasets_from_yaml(ref_comp, cand_comp,
                                          key = c("year", "month"))
result_comp$all_passed
```

### Key in YAML vs key parameter

The `key` parameter to `compare_datasets_from_yaml()` takes precedence over
the `keys` field in the YAML `defaults` section. This lets you reuse a shared
YAML file while overriding the join key programmatically:

```{r key-override, results = "hide"}
rules_key <- tempfile(fileext = ".yaml")
write_rules_template(ref_comp, key = "year", path = rules_key)  # YAML says year

# Override at call time with the composite key
result_override <- compare_datasets_from_yaml(
  ref_comp, cand_comp,
  key  = c("year", "month"),   # overrides YAML
  path = rules_key
)
result_override$all_passed
```

### Duplicate key detection

If key values are not unique, `datadiff` warns before running the comparison:

```{r duplicate-keys, warning = TRUE}
ref_dup  <- data.frame(id = c(1, 1, 2), value = c(10, 11, 20))
cand_dup <- data.frame(id = c(1, 2),    value = c(10, 20))

tryCatch(
  compare_datasets_from_yaml(ref_dup, cand_dup, key = "id"),
  warning = function(w) message("Warning: ", conditionMessage(w))
)
```

---

## 12. Type mismatch detection

When a column has incompatible types in reference and candidate, `datadiff`
warns and adds a dedicated failing step — instead of silently coercing or
crashing:

```{r type-mismatch, warning = TRUE}
ref_type  <- data.frame(id = 1:2, year = c(2023L, 2024L))  # integer
cand_type <- data.frame(id = 1:2, year = c("2023", "2024")) # character

tryCatch(
  compare_datasets_from_yaml(ref_type, cand_type, key = "id"),
  warning = function(w) message("Warning: ", conditionMessage(w))
)
```

`integer` and `numeric` are treated as compatible types — tolerance arithmetic
works correctly across them and no mismatch is raised.

---

## 13. Utility functions

### `detect_column_types()`

Returns the `datadiff` type inferred for each column (`"integer"`,
`"numeric"`, `"character"`, `"date"`, `"datetime"`, `"logical"`):

```{r detect-types}
df_types <- data.frame(
  id        = 1L,
  amount    = 1.5,
  label     = "x",
  flag      = TRUE,
  day       = Sys.Date(),
  timestamp = Sys.time()
)

detect_column_types(df_types)
```

These are the same types used to match columns against `by_type` rules in the
YAML.

### `derive_column_rules()`

Shows the merged per-column rules for a given dataset and rules object —
equivalent to `result$applied_rules` but callable without running the full
comparison:

```{r derive-rules}
rules_obj2 <- read_rules(rules_path)
merged <- derive_column_rules(ref, rules_obj2)
merged$revenue
merged$category
```

### `analyze_columns()`

Already shown in section 10. Useful to quickly check which columns are common,
missing, or extra before committing to a full validation run.

### `preprocess_dataframe()`

Applies text normalization rules to a dataframe. Useful for inspecting what
the data looks like after normalization, before comparing:

```{r preprocess}
df_raw <- data.frame(label = c("  Hello ", "WORLD", "  Foo  "))
rules_norm <- list(
  label = list(equal_mode = "normalized",
               case_insensitive = TRUE,
               trim = TRUE)
)
preprocess_dataframe(df_raw, rules_norm)
```

### `add_tolerance_columns()`

Adds the `__absdiff`, `__thresh`, and `__ok` diagnostic columns to a
comparison dataframe. Useful for debugging which rows are right on the edge
of the tolerance threshold:

```{r tolerance-debug}
cmp <- data.frame(
  value            = c(1.005, 1.02, 1.0),
  value__reference = c(1.000, 1.00, 1.0)
)
rules_debug <- list(value = list(abs = 0.01, rel = 0))

cmp_annotated <- add_tolerance_columns(cmp, "value", rules_debug,
                                       ref_suffix = "__reference",
                                       na_equal   = TRUE)
cmp_annotated[, c("value__absdiff", "value__thresh", "value__ok")]
```

---

## 14. Language and locale

By default, pointblank reports are rendered in English. You can change
the language per call or globally for a session.

### Per call

```{r lang-per-call, eval = FALSE}
result_fr <- compare_datasets_from_yaml(
  ref, cand,
  key    = "id",
  lang   = "fr",
  locale = "fr_FR"
)
```

### Global option

Set once in your script or `.Rprofile` and all subsequent calls will use it:

```{r lang-global, eval = FALSE}
options(datadiff.lang   = "fr",
        datadiff.locale = "fr_FR")

# All calls now produce French reports without passing lang/locale every time
result <- compare_datasets_from_yaml(ref, cand, key = "id", path = rules_path)
```

Supported languages include `"en"`, `"fr"`, `"de"`, `"it"`, `"es"`, `"pt"`,
`"zh"`, `"ja"`, `"ru"`. See the pointblank documentation for the full list.

---

## 15. Working with large datasets

### Lazy tables (dbplyr)

Any SQL-backed table wrapped in `dplyr::tbl()` can be passed directly. The
join, normalization, and boolean expressions are pushed down to SQL — no data
is loaded into R until the final slim result table:

```{r lazy-dbplyr, eval = FALSE}
library(DBI)
library(dplyr)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
DBI::dbWriteTable(con, "reference", ref)
DBI::dbWriteTable(con, "candidate", cand)

tbl_ref  <- dplyr::tbl(con, "reference")
tbl_cand <- dplyr::tbl(con, "candidate")

result_lazy <- compare_datasets_from_yaml(
  tbl_ref, tbl_cand,
  key  = "id",
  path = rules_path
)

result_lazy$all_passed
DBI::dbDisconnect(con)
```

This works with any DBI-compatible backend: SQLite, PostgreSQL, Snowflake, etc.

### Arrow / Parquet (out-of-core)

For files too large to fit in RAM, pass `arrow::open_dataset()` directly — the
package handles the Arrow → DuckDB conversion internally with a single private
connection:

```{r large-arrow, eval = FALSE}
library(arrow)

ds_ref  <- arrow::open_dataset("path/to/reference/")
ds_cand <- arrow::open_dataset("path/to/candidate/")

# Generate a template from the schema (no data loaded into RAM)
write_rules_template(ds_ref, key = "id", path = "rules.yaml")

result <- compare_datasets_from_yaml(
  data_reference      = ds_ref,
  data_candidate      = ds_cand,
  key                 = "id",
  path                = "rules.yaml",
  duckdb_memory_limit = "8GB"    # tune to your machine's RAM
)

result$all_passed
```

> Do **not** call `arrow::to_duckdb()` yourself before passing to `datadiff`.
> The package opens its own private DuckDB connection; passing pre-converted
> tables from a different connection will cause a cross-connection join error.

### Memory tuning

| Machine RAM | Recommended `duckdb_memory_limit` |
|-------------|-----------------------------------|
| 8 GB | `"3GB"` |
| 16 GB | `"6GB"` |
| 32 GB | `"8GB"` (default) |
| 64 GB+ | `"20GB"` |

---

## Summary

| Function | Role |
|----------|------|
| `write_rules_template()` | Generate a YAML rules template from a reference dataset |
| `read_rules()` | Load and validate a YAML rules file |
| `compare_datasets_from_yaml()` | Compare reference and candidate datasets |
| `detect_column_types()` | Inspect the type inferred for each column |
| `derive_column_rules()` | See the merged per-column rules for a dataset + rules pair |
| `analyze_columns()` | Compare column structure between two datasets |
| `preprocess_dataframe()` | Apply text normalization rules to a dataframe |
| `add_tolerance_columns()` | Add `__absdiff`, `__thresh`, `__ok` columns for debugging |