% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/scale_by.R
\name{scale_by}
\alias{scale_by}
\title{Center and scale a continuous variable conditioning on factors.}
\usage{
scale_by(object = NULL, data = NULL, scale = 1)
}
\arguments{
\item{object}{A \code{\link[stats]{formula}} whose left hand side indicates
a numeric variable to be scaled and whose right hand side indicates
factors to condition this scaling on; or the result of a previous call
to \code{scale_by} or the \code{pred} attribute of a previous call.
See 'Details'.}

\item{data}{A data.frame containing the numeric variable to be scaled and
the factors to condition on.}

\item{scale}{Numeric (default 1).  The desired standard deviation for the
numeric variable within-factor-level.  If the numeric variable is a matrix,
then \code{scale} must have either one element (used for all columns),
or as many elements as there are columns in the numeric variable. To center
the numeric variable without scaling, set \code{scale} to \code{0}.
See 'Details'.}
}
\value{
A numeric variable which is conditionally scaled within each level
  of the conditioning factor(s), with standard deviation \code{scale}.  It has
  an additional class \code{scaledby}, as well as an attribute
  \code{pred} with class \code{scaledby_pred}, which is a list containing
  the formula, the centers and scales for known factor levels, and the
  center and scale to be applied to new factor levels.  The variable returned
  can be used as the \code{object} argument in future calls to
  \code{scale_by}, as can its \code{pred} attribute.
}
\description{
\code{scale_by} centers and scales a numeric variable within each level
of a factor (or the interaction of several factors).
}
\details{
First, the behavior when \code{object} is a formula and \code{scale = 1}
is described.
The left hand side of the formula must indicate a numeric variable
to be scaled.  The full interaction of the variables on the right hand side
of the formula is taken as the factor to condition scaling on (i.e.
it doesn't matter whether they are separated with \code{+}, \code{:}, or
\code{*}).  For the remainder of this section, the numeric variable will
be referred to as \code{x} and the full factor interaction term will be
referred to as \code{facs}.

First, if \code{facs} has more than one element, then a new factor is
created as their full interaction term.  When a factor has \code{NA} values,
\code{NA} is treated as a level.  For each level of the factor which has
at least two unique non-\code{NA} \code{x} values, the mean of \code{x}
is recorded as the level's center and the standard deviation of \code{x}
is recorded as the level's scale. The mean of these
centers is recorded as \code{new_center} and the mean of these scales
is recorded as \code{new_scale}, and \code{new_center} and
\code{new_scale} are used as the center and scale for factor levels with
fewer than two unique non-\code{NA} \code{x} values. Then for each level of
the factor, the level's center is subtracted from its \code{x} values, and
the result is divided by the level's scale.
The result is that any level with more than two unique non-\code{NA} \code{x}
values now has mean \code{0} and standard deviation \code{1}, and levels
with fewer than two are placed on a similar scale (though their standard
deviation is undefined).  Note that the overall standard deviation of the
resulting variable (or standard deviations if \code{x} is a matrix) will not
be exactly \code{1} (but will be close).  The interpretation of the
variable is how far an observation is from its level's average value for
\code{x} in terms of within-level standard deviations.

If \code{scale = 0}, then only centering (but not scaling) is performed.
If \code{scale} is neither \code{0} nor \code{1}, then \code{x} is scaled
such that the standard deviation within-level is \code{scale}.  Note that
this is different than the \code{scale} argument to \code{\link[base]{scale}}
which specifies the number the centered variable is divided by (which is
the inverse of the use here).  If \code{x} is a matrix with more than
one column, then \code{scale} must either be a vector with an element for
each column of \code{x} or a single number which will be used for all
columns.  If any element of \code{scale} is \code{0}, then all elements are
treated as \code{0}.  No element in \code{scale} can be negative.

If \code{object} is not a formula, it must be a numeric variable which
resulted from a previous \code{scale_by} call, or the \code{pred} attribute
of such a numeric variable. In this case, \code{scale}
is ignored, and \code{x} in \code{data} is scaled
using the \code{formula}, \code{centers} and \code{scales} in \code{object}
(with new levels treated using \code{new_center} and \code{new_scale}).
}
\examples{
dat <- data.frame(
  f1 = rep(c("a", "b", "c"), c(5, 10, 20)),
  x1 = rnorm(35, rep(c(1, 2, 3), c(5, 10, 20)),
    rep(c(.5, 1.5, 3), c(5, 10, 20))))

dat$x1_scaled <- scale(dat$x1)
dat$x1_scaled_by_f1 <- scale_by(x1 ~ f1, dat)

mean(dat$x1)
sd(dat$x1)
with(dat, tapply(x1, f1, mean))
with(dat, tapply(x1, f1, sd))

mean(dat$x1_scaled)
sd(dat$x1_scaled)
with(dat, tapply(x1_scaled, f1, mean))
with(dat, tapply(x1_scaled, f1, sd))

mean(dat$x1_scaled_by_f1)
sd(dat$x1_scaled_by_f1)
with(dat, tapply(x1_scaled_by_f1, f1, mean))
with(dat, tapply(x1_scaled_by_f1, f1, sd))

newdata <- data.frame(
  f1 = c("a", "b", "c", "d"),
  x1 = rep(1, 4))

newdata$x1_pred_scaledby <- scale_by(dat$x1_scaled_by_f1, newdata)

newdata
}
\seealso{
\code{\link[base]{scale}}.
}
\author{
Christopher D. Eager <eager.stats@gmail.com>
}
