---
title: "Introduction to dbi.table"
output:
  rmarkdown::html_vignette:
    number_sections: true

vignette: >
  %\VignetteIndexEntry{Introduction to dbi.table}
  %\VignetteEngine{knitr::rmarkdown}x
  %\VignetteEncoding{UTF-8}
---

```{r, set_options, include = FALSE}
old_opts <- options(width = 90)
```

Query database tables and views over a DBI connection using `data.table`'s
`[i, j, by]` syntax, attach database schemas to the search path, and
programmatically load database catalogs.

This vignette assumes that you are already fluent with `data.table`'s syntax and
that you know how to open a database connection using the `DBI` package.

# Installation

If you haven't alreay done so, install the `dbi.table` package from CRAN.

```{r install, eval = FALSE}
install.packages("dbi.table")
```

# Getting Started

This section uses the
[Chinook Database](https://github.com/lerocha/chinook-database)
(included in the package) to demonstrate how to

  1. create a single `dbi.table` using the `dbi.table` function,
  2. maniuplate a `dbi.table` using `data.table`'s `[i, j, by]` syntax,
  3. attach a schema to the search path using the `dbi.attach` function, and
  4. load a database catalog using the `dbi.catalog` function.

The function `chinook.duckdb` returns an open `duckdb` (DBI) connection to
the sample Chinook Database.  This connection is a typical DBI connection as
returned by `DBI::dbConnect` that can be used as the `conn` argument in DBI
package functions. Let's get started by loading the package and opening the
connection.

```{r library, message = FALSE}
library(data.table) #needed for as.data.table
library(dbi.table)
chinook <- chinook.duckdb()
```


## Create a Single `dbi.table`

The `dbi.table` function takes 2 arguments: a DBI connection, and an Id
indentifying a database table or view.

```{r single_table}
my_album <- dbi.table(chinook, DBI::Id("Album"))
```

The object `my_album` is a `dbi.table`, a data structure that represents an SQL
query (which we refer to as the `dbi.table`'s *underlying SQL query*). The
`print` method displays a preview of the underlying SQL query. 

```{r my_album_print}
#print(my_album)
my_album
```

The preview has a format similar to a `data.table` with two notable exceptions.

1. The row numbers are omitted. SQL queries do not necessarily return the
   result set in a reliable order (even on subsequent evaluations of the same
   query), and `dbi.table` does not make any extra effort to order the rows by
   default.

2. Only the first 5 rows of the `dbi.table` are displayed (`data.table` displays
   the first 5 and the last 5). Again, since the result set does not have a
   reliable order, it is not possible to say which rows are the first and which
   are the last. The rows displayed are the first 5 returned by the RDBMS.

The function `as.data.table` executes the `dbi.table`'s underlying SQL query and
retrieves the result set as a `data.table`. Pro tip: calling the extract method
(`[]`) with no arguments is a shortcut for `as.data.table`.

```{r fetch_data_table}
#as.data.table(my_album)
my_album[]
```

Since the result set is instantiated locally as a `data.table`, the row numbers
and the last 5 rows are displayed.

Note: by default, `as.data.table` (and its `[]` shortcut) fetches a
maximum of 10,000 rows. To override this limit, either set the option
`dbitable.max.fetch` or call `as.data.table` and provide the `n` argument
(e.g., `n = -1` to fetch the entire result set). 

Lastly, the `csql` function displays the `dbi.table`'s underlying SQL query.

```{r my_album_csql}
csql(my_album)
```

The underlying SQL query of a newly created `dbi.table` selects all the columns
from the database table.

## Manipulate a `dbi.table` using `data.table` Syntax

This extract from `data.table`'s *Introduction to data.table* vignette pretty much
sums up what `dbi.table` does.

```{r xref, eval = FALSE}
DT[i, j, by]

##   R:                 i                 j        by
## SQL:  where | order by   select | update  group by
```

In general, `dbi.table` should be able to handle basic `data.table` syntax. SQL
translation is done by `dbplyr::translate_sql_` which works with a variety
of R functions. However, complicated expressions (e.g., custom functions in `j`,
nested aggregation functions, most special symbols) do not work.

Best practice is to use `dbi.table` to subset and wrangle on the database, then
`data.table` to fine tune locally.

When `i` is a logical expression of the variables in the `dbi.table` then it
becomes the *WHERE* clause in the `dbi.table`'s underlying SQL query.

```{r i_where_query}
csql(my_album[AlbumId == ArtistId + 1])
```

```{r i_where}
my_album[AlbumId == ArtistId + 1]
```

When `i` is a call to `order` (or `chorder`), it becomes the *ORDER BY* clause
in the `dbi.table`'s underlying SQL query.

```{r i_order_query}
csql(my_album[order(nchar(Title), -AlbumId)])
```

```{r i_order}
my_album[order(nchar(Title), -AlbumId)]
```

When `j` is a list of expressions of the variables in the `dbi.table`, then `j`
becomes the *SELECT* clause in the `dbi.table`'s underlying SQL query.

```{r j_list}
my_album[, .(AlbumId, Title)]
```

When `by` is a list of expressions of the variables in the `dbi.table`, then
`by` becomes the *GROUP BY* clause in the `dbi.table`'s underlying SQL query.

```{r by_list_query}
csql(my_album[, .("# of Albums" = .N), .(ArtistId)])
```

```{r by_list}
my_album[, .("# of Albums" = .N), .(ArtistId)]
```

## Attach a Schema to the Search Path

The `dbi.attach` function *attaches* a DBI connection to the search path. This
means that `dbi.attach` creates a `dbi.table` for each table and view in the
schema associated with the DBI connection, then assigns these `dbi.table`s to an
environment on the search path.

```{r dbi.attach}
dbi.attach(chinook)
```

A quick look at the search path shows the database attached in position 2.

```{r search_path}
head(search(), 3)
```

The tables and views in the database schema are queriable as `dbi.table`s in the
attached environment `r search()[[2L]]`.

```{r ls_chinook}
ls("duckdb:chinook_duckdb")
```

Note: Attaching a DBI connection is intended for an interactive exploratory
analysis of a database (schema). For programatic use cases, see the *Load a
Database Catalog* section.

### Merging `dbi.table`s

Two `dbi.table`s that share the same connection (for example, all the
`dbi.table`s in an attached schema share the same connection) can be merged.
Merging two `dbi.table`s results in an SQL join that describes the same result
set as the associated `data.table` merge. That is,

```{r merge_dts, eval = FALSE}
merge(as.data.table(Album), as.data.table(Artist), by = "ArtistId")
```

and

```{r merge_dbit, eval = FALSE}
as.data.table(merge(Album, Artist, by = "ArtistId"))
```

are the same `data.table` up to row order.

```{r merge_dt_like}
csql(merge(Album, Artist, by = "ArtistId"))
```

If `dbi.table` can determine the *foreign key* constraints between `x` and `y`,
and if there is only one foreign key (either `x` referring to `y`, or `y`
referring to `x`), then this foreign key is used as the default `by` when
merging. Otherwise, `dbi.table` uses the same algorithm to determine the default
`by` columns as `data.table`.

```{r merge_no_by}
csql(merge(Customer, Employee))
```

In this case, the `Customer` table has a foreign key `SupportRepId` that refers
to the `Employee` table's primary key `EmployeeId`.

When the `y` argument is omitted, `dbi.table`'s `merge` uses the foreign key
constraints that `x` references to determine the `y` (or `y`s) to merge with.

When `y` is missing, `merge.dbi.table` merges with the tables referenced by
`x`'s foreign keys (one merge for each foreign key).

```{r merge_no_y}
csql(merge(Track))
```

When the optional `recursive` argument is `TRUE`, `merge.dbi.table` calls
`merge` with `recuresive = TRUE` on each referenced table and merges `x`
with the result. In this example, `Track` has a foreign key that references
`Album` and `Album` has a foreign key that references `Artist`. The columns
from `Artist` are included in the result set when `merge` is called with
`recuresive = TRUE`.

```{r merge_no_y_rec}
csql(merge(Track, recursive = TRUE))
```


## Load a Database Catalog

As a best practice for programatic use, it is better to load the catalog in
order to avoid modifying the search path.

```{r dbi_catalog}
catalog <- dbi.catalog(chinook)
```

Printing the catalog lists its schemas.

```{r print_dbi_catalog}
catalog
```

Individual tables can be accessed using `catalog$schema$table` syntax.

```{r dbi_catalog_table}
catalog$main$Album
```

# Scope

This section provides a brief explanation of what the `dbi.table` package is
trying to do.

Suppose that `x` is a `dbi.table` and that `e` is an expression involving `x`
that returns either a `dbi.table` or a `data.table`.

```{r scope_example}
x <- dbi.table(chinook, DBI::Id("Album"))
e <- quote(x[, .("# of Albums" = .N), .(ArtistId)])
```

Since `dbi.table`'s syntax is a subset of `data.table`'s syntax, if `e` can be
evaluated successfully (i.e., `eval(e)` does not throw an error), then `e`
should also be able to be successfully evaluated when `x` is a `data.table`.
There are thus 2 paths to the final `data.table` result:

1. evaluate `e` then coerce the result using `as.data.table`, or

2. coerce `x` to a `data.table` then evaluate `e`.

Path 2 is referred to as the reference implementation and describes the
*correct* answer: the *reference result set*. The design goal of `dbi.table` is
to get the same result set as the reference result set, up to row order.

```{r reference_check}
result_set <- as.data.table(eval(e))
x <- as.data.table(x)
reference_result_set <- eval(e)
all.equal(reference_result_set, result_set, ignore.row.order = TRUE)
```

The `dbi.table` package includes the function `reference.test` that compares the
result set to the reference result set in the more general case where `expr`
(the function's first argument) is an expression involving 1 or more
`dbi.table`s.

```{r reference.test}
x <- dbi.table(chinook, DBI::Id("Album"))
reference.test({
  x[, .("# of Albums" = .N), .(ArtistId)]
})
```

This function is used extensively in `dbi.table`'s unit/regression tests.

# Cleaning Up

We used the `chinook.duckdb` function to open a DBI connection at the beginning
of this vignette and now it is up to us to close it.

```{r disconnect}
DBI::dbDisconnect(chinook)
```

However, this leaves our R session in a wonky state. The environment
"duckdb:chinook_duckdb" is still attached and there are several `dbi.table`s in
the global environment - all of these `dbi.table`s are associated with an
invalid DBI connection.

```{r bork, error = TRUE}
#A dbi.table in the duckdb:chinook_duckdb environment
Genre
```

The R objects associated with our now-closed DBI connection need to be cleaned
up manually (or you could just restart R).

```{r clean_up_manually}
detach("duckdb:chinook_duckdb")
rm(catalog, my_album, x)
```

## Connection Management

Alternatively, when using either `dbi.attach` or `dbi.catalog`, the first
arguement can be a zero-argument function that returns an open DBI connection.
When `dbi.table` uses a function to open the DBI connection, then that
connection belongs to `dbi.table` and `dbi.table` will take care of closing it
when it is no longer needed.

```{r attach_function}
dbi.attach(chinook.duckdb)
```

When `dbi.table` is managing the connection, then all the user has to do is
detach (or delete if a catalog). The DBI connection will be closed when the
object is garbage collected.

```{r clean_up}
detach("duckdb:chinook_duckdb")
```

Further, when `dbi.table` owns the connection, it is able to reconnect in the
event that the connection unexpectedly drops.

```{r, restore_options, include = FALSE}
options(old_opts)
```