Please see the tidycode website for full documentation:
The tidycode package is an attempt to make analyzing R code tidy. It is modeled after the tidytext package.
One way to analyze code is to read in existing R files. The read_rfiles() function will allow parse your R files into individual R calls, indicating the original file path along with the line number for each call. The tidycode package includes some example files with the paths accessible via the tidycode_example() function. Let’s examine two, the example_plots.R file and the example_analysis.R file.
cat(readLines(tidycode_example("example_plot.R")), sep = '\n')
#> library(tidyverse)
#>
#> starwars %>%
#> select(height, mass) %>%
#> filter(!is.na(mass), !is.na(height)) %>%
#> ggplot(aes(height, mass)) +
#> geom_point()cat(readLines(tidycode_example("example_analysis.R")), sep = '\n')
#> library(tidyverse)
#> library(rms)
#>
#> starwars %>%
#> mutate(bmi = mass / ((height / 100) ^ 2)) %>%
#> select(bmi, gender) -> starwars
#>
#> dd <- datadist(starwars)
#> options(datadist = "dd")
#>
#> mod <- ols(bmi ~ gender, data = starwars) %>%
#> summary()
#>
#> plot(mod)Using the read_rfiles() function, we can read them in as a tidy data frame.
(d <- read_rfiles(
tidycode_example("example_plot.R"),
tidycode_example("example_analysis.R")
))
#> # A tibble: 9 x 3
#> file expr line
#> <chr> <list> <int>
#> 1 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua… 1
#> 2 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua… 2
#> 3 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua… 1
#> 4 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua… 2
#> 5 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua… 3
#> 6 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua… 4
#> 7 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua… 5
#> 8 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua… 6
#> 9 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua… 7This tidy data frame has one row per R call in the original file. It places the file path in the file column, the R call in the expr column, and the line number in the line column. Since this is in a tidy format, we can manipulate it using common data manipulation functions.
Let’s examine the first row.
d[1, ]
#> # A tibble: 1 x 3
#> file expr line
#> <chr> <list> <int>
#> 1 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/T/Rtmpn… <langua… 1This is the first line of the example_plot.R file. We can dig into the expr list column to see what R call was made on this first line.
The call is library(tidyverse).
Similar to the tidytext package that will unnest groups of words by token using the unnest_tokens() function, such as by word or sentence, we can unnest these calls into individual functions using the unnest_calls() function. To do this, we can pipe the data frame we just created, d into the unnest_calls() function and specify the column that contains the R calls, in this case expr.
library(dplyr)
d_funcs <- d %>%
unnest_calls(expr)
d_funcs
#> # A tibble: 35 x 4
#> file line func args
#> <chr> <int> <chr> <list>
#> 1 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn… 1 libra… <list […
#> 2 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn… 2 + <list […
#> 3 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn… 2 %>% <list […
#> 4 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn… 2 %>% <list […
#> 5 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn… 2 %>% <list […
#> 6 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn… 2 select <list […
#> 7 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn… 2 filter <list […
#> 8 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn… 2 ! <list […
#> 9 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn… 2 is.na <list […
#> 10 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn… 2 ! <list […
#> # … with 25 more rowsThis added two columns to our data frame, func a column of type character indicating each function called and args a list column containing the arguments for each function. Let’s examine that first row again.
d_funcs[1, ]
#> # A tibble: 1 x 4
#> file line func args
#> <chr> <int> <chr> <list>
#> 1 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0000gn/… 1 libra… <list […Here the function is library, which tracks with what we have previously observed. Examining the args list column, we see the following.
The argument for the library function on this first line is tidyverse. This aligns with what we observed, the first R call is library(tidyverse).
In text analysis, there is the concept of “stopwords”. These are often small common filler words you want to remove before completing an analysis, such as “a” or “the”. In a tidy code analysis, we can use a similar concept to remove some functions. For example we may want to remove the assignment operator, <-, before completing an analysis. We have compiled a list of common stop functions in the get_stopfuncs() function to antijoin from the data frame.
d_funcs %>%
anti_join(get_stopfuncs())
#> Joining, by = "func"
#> # A tibble: 17 x 4
#> file line func args
#> <chr> <int> <chr> <list>
#> 1 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 1 library <list [1]>
#> 2 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 2 select <list [2]>
#> 3 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 2 filter <list [2]>
#> 4 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 2 is.na <list [1]>
#> 5 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 2 is.na <list [1]>
#> 6 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 2 ggplot <list [1]>
#> 7 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 2 aes <list [2]>
#> 8 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 2 geom_po… <list [0]>
#> 9 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 3 library <list [1]>
#> 10 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 4 library <list [1]>
#> 11 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 5 mutate <named lis…
#> 12 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 5 select <list [2]>
#> 13 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 6 datadist <list [1]>
#> 14 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 7 options <named lis…
#> 15 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 8 ols <named lis…
#> 16 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 8 summary <list [0]>
#> 17 /private/var/folders/9x/fpr7_fbn4ln194by50dnpgyw0… 9 plot <list [1]>Akin to the tidytext get_sentiments() function for sentiment analysis, the tidycode package has a get_classifications() function that will output a classification data frame. By default, this outputs a data frame with two classification lexicons, crowdsource and leeklab. The crowdsource lexicon was developed by twitter users who tried out the classify shiny application. The leeklab lexicon was curated by members of Jeff Leek’s Lab. Both lexicons involve the same functions classified multiple times by different users. The score column indicates the percentage of functions that were classified as a given class. To just use the most prevalent classification, you can set the incude_duplicates parameter to FALSE in the get_classifications() function. By default both the crowdsource and leeklab lexicons will be output. To get just one, specify the lexicon parameter. Here we will merge in the crowdsource lexicon, picking the most prevalent classification by setting the incude_duplicates parameter to FALSE.
d_funcs %>%
anti_join(get_stopfuncs()) %>%
inner_join(get_classifications("crowdsource", include_duplicates = FALSE)) %>%
select(func, classification)
#> Joining, by = "func"Joining, by = "func"
#> # A tibble: 15 x 2
#> func classification
#> <chr> <chr>
#> 1 library setup
#> 2 select data cleaning
#> 3 filter data cleaning
#> 4 is.na data cleaning
#> 5 is.na data cleaning
#> 6 ggplot visualization
#> 7 aes visualization
#> 8 geom_point visualization
#> 9 library setup
#> 10 library setup
#> 11 mutate data cleaning
#> 12 select data cleaning
#> 13 options setup
#> 14 summary exploratory
#> 15 plot visualizationNotice we know have one classification per function. If we left the incude_duplicates parameter to its default, TRUE, we would end up with more than one classification per function along with a score column.
d_funcs %>%
anti_join(get_stopfuncs()) %>%
inner_join(get_classifications("crowdsource")) %>%
select(func, classification, score)
#> Joining, by = "func"Joining, by = "func"
#> # A tibble: 115 x 3
#> func classification score
#> <chr> <chr> <dbl>
#> 1 library setup 0.687
#> 2 library import 0.213
#> 3 library visualization 0.0339
#> 4 library data cleaning 0.0278
#> 5 library modeling 0.0134
#> 6 library exploratory 0.0128
#> 7 library communication 0.00835
#> 8 library evaluation 0.00278
#> 9 library export 0.00111
#> 10 select data cleaning 0.636
#> # … with 105 more rows