+ - 0:00:00
Notes for current slide
Notes for next slide

Review of the first day’s material

Luke Johnston

Daniel Witte

Anna Schritz

1 / 13

Project management

2 / 13

Best practices for project management

  • Use R Projects
  • Use here::here()
  • Use a style guide for code, filenaming
  • Use a consistent folder layout
  • Document your code or let your code speak for itself
  • Use Sections in scripts to separate your file
  • Save data in data/ and R scripts in R/
  • Keep R scripts concise and with a goal
  • Use source() to run code in another script
  • Don't repeat yourself (DRY), aka create functions
3 / 13

Basics of R

# vector
1:10
#> [1] 1 2 3 4 5 6 7 8 9 10
c("a", "b")
#> [1] "a" "b"
# data.frame
head(sleep, 2)
#> extra group ID
#> 1 0.7 1 1
#> 2 -1.6 1 2
# Object assignment
my_name <- "Luke"
my_name
#> [1] "Luke"
# Viewing data.frames
colnames(sleep)
#> [1] "extra" "group" "ID"
str(sleep)
#> 'data.frame': 20 obs. of 3 variables:
#> $ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
#> $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
#> $ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
summary(sleep)
#> extra group ID
#> Min. :-1.600 1:10 1 :2
#> 1st Qu.:-0.025 2:10 2 :2
#> Median : 0.950 3 :2
#> Mean : 1.540 4 :2
#> 3rd Qu.: 3.400 5 :2
#> Max. : 5.500 6 :2
#> (Other):8
4 / 13

Data Management and wrangling

5 / 13

Best practices for wrangling

  • Don't edit your raw data
  • Wrangle and manage your data using code
  • Save final wrangled form as a csv file in data/ folder
  • Try to keep data "tidy" (column and row should uniquely describe the data value)
  • Make use of the %>% pipe to chain functions together
  • Use the common data wrangling "verbs":
    • dplyr: mutate(), select(), rename(), filter(), arrange(), group_by(), summarise(),
    • tidyr: gather(), spread()
6 / 13

Final exercise: Review of mutate and select

nhanes_wrangled <- NHANES %>%
mutate(MoreThan5DaysActive = if_else(PhysActiveDays >= 5, TRUE, FALSE)) %>%
select(SurveyYr, Gender, Age, Poverty, BMI, BPSysAve, BPDiaAve, TotChol,
DiabetesAge, nBabies, MoreThan5DaysActive, AlcoholDay) %>%
rename(TotalCholesterol = TotChol, NumberOfBabies = nBabies,
DrinksOfAlcoholInDay = AlcoholDay, AgeDiabetesDiagnosis = DiabetesAge) %>%
filter(Age >= 18, Age <= 75)
nhanes_wrangled
#> # A tibble: 10,000 x 12
#> SurveyYr Gender Age Poverty BMI BPSysAve BPDiaAve TotChol DiabetesAge
#> <fct> <fct> <int> <dbl> <dbl> <int> <int> <dbl> <int>
#> 1 2009_10 male 34 1.36 32.2 113 85 3.49 NA
#> 2 2009_10 male 34 1.36 32.2 113 85 3.49 NA
#> 3 2009_10 male 34 1.36 32.2 113 85 3.49 NA
#> 4 2009_10 male 4 1.07 15.3 NA NA NA NA
#> # … with 9,996 more rows, and 3 more variables: nBabies <int>,
#> # MoreThan5DaysActive <lgl>, AlcoholDay <int>
7 / 13

Final exercise: Review of rename

nhanes_wrangled <- NHANES %>%
mutate(MoreThan5DaysActive = if_else(PhysActiveDays >= 5, TRUE, FALSE)) %>%
select(SurveyYr, Gender, Age, Poverty, BMI, BPSysAve, BPDiaAve, TotChol,
DiabetesAge, nBabies, MoreThan5DaysActive, AlcoholDay) %>%
rename(TotalCholesterol = TotChol, NumberOfBabies = nBabies,
DrinksOfAlcoholInDay = AlcoholDay, AgeDiabetesDiagnosis = DiabetesAge) %>%
filter(Age >= 18, Age <= 75)
nhanes_wrangled
#> # A tibble: 10,000 x 12
#> SurveyYr Gender Age Poverty BMI BPSysAve BPDiaAve TotalCholesterol
#> <fct> <fct> <int> <dbl> <dbl> <int> <int> <dbl>
#> 1 2009_10 male 34 1.36 32.2 113 85 3.49
#> 2 2009_10 male 34 1.36 32.2 113 85 3.49
#> 3 2009_10 male 34 1.36 32.2 113 85 3.49
#> 4 2009_10 male 4 1.07 15.3 NA NA NA
#> # … with 9,996 more rows, and 4 more variables:
#> # AgeDiabetesDiagnosis <int>, NumberOfBabies <int>,
#> # MoreThan5DaysActive <lgl>, DrinksOfAlcoholInDay <int>
8 / 13

Final exercise: Review of filter

nhanes_wrangled <- NHANES %>%
mutate(MoreThan5DaysActive = if_else(PhysActiveDays >= 5, TRUE, FALSE)) %>%
select(SurveyYr, Gender, Age, Poverty, BMI, BPSysAve, BPDiaAve, TotChol,
DiabetesAge, nBabies, MoreThan5DaysActive, AlcoholDay) %>%
rename(TotalCholesterol = TotChol, NumberOfBabies = nBabies,
DrinksOfAlcoholInDay = AlcoholDay, AgeDiabetesDiagnosis = DiabetesAge) %>%
filter(Age >= 18, Age <= 75)
nhanes_wrangled
#> # A tibble: 6,964 x 12
#> SurveyYr Gender Age Poverty BMI BPSysAve BPDiaAve TotalCholesterol
#> <fct> <fct> <int> <dbl> <dbl> <int> <int> <dbl>
#> 1 2009_10 male 34 1.36 32.2 113 85 3.49
#> 2 2009_10 male 34 1.36 32.2 113 85 3.49
#> 3 2009_10 male 34 1.36 32.2 113 85 3.49
#> 4 2009_10 female 49 1.91 30.6 112 75 6.7
#> # … with 6,960 more rows, and 4 more variables:
#> # AgeDiabetesDiagnosis <int>, NumberOfBabies <int>,
#> # MoreThan5DaysActive <lgl>, DrinksOfAlcoholInDay <int>
9 / 13

Final exercise: Review of gather

nhanes_wrangled %>%
gather(Measure, Value, -SurveyYr, -Gender) %>%
group_by(SurveyYr, Gender, Measure) %>%
summarise(Mean = round(mean(Value, na.rm = TRUE), 2)) %>%
arrange(Measure, Gender, SurveyYr) %>%
spread(SurveyYr, Mean)
#> # A tibble: 69,640 x 4
#> SurveyYr Gender Measure Value
#> <fct> <fct> <chr> <dbl>
#> 1 2009_10 male Age 34
#> 2 2009_10 male Age 34
#> 3 2009_10 male Age 34
#> 4 2009_10 female Age 49
#> # … with 6.964e+04 more rows
10 / 13

Final exercise: Review of group_by and summarise

nhanes_wrangled %>%
gather(Measure, Value, -SurveyYr, -Gender) %>%
group_by(SurveyYr, Gender, Measure) %>%
summarise(Mean = round(mean(Value, na.rm = TRUE), 2)) %>%
arrange(Measure, Gender, SurveyYr) %>%
spread(SurveyYr, Mean)
#> # A tibble: 40 x 4
#> # Groups: SurveyYr, Gender [4]
#> SurveyYr Gender Measure Mean
#> <fct> <fct> <chr> <dbl>
#> 1 2009_10 female Age 44.0
#> 2 2009_10 female AgeDiabetesDiagnosis 48.1
#> 3 2009_10 female BMI 29.0
#> 4 2009_10 female BPDiaAve 67.7
#> # … with 36 more rows
11 / 13

Final exercise: Review of arrange

nhanes_wrangled %>%
gather(Measure, Value, -SurveyYr, -Gender) %>%
group_by(SurveyYr, Gender, Measure) %>%
summarise(Mean = round(mean(Value, na.rm = TRUE), 2)) %>%
arrange(Measure, Gender, SurveyYr) %>%
spread(SurveyYr, Mean)
#> # A tibble: 40 x 4
#> # Groups: SurveyYr, Gender [4]
#> SurveyYr Gender Measure Mean
#> <fct> <fct> <chr> <dbl>
#> 1 2009_10 female Age 44.0
#> 2 2011_12 female Age 44.2
#> 3 2009_10 male Age 43.1
#> 4 2011_12 male Age 43.9
#> # … with 36 more rows
12 / 13

Final exercise: Review of spread

nhanes_wrangled %>%
gather(Measure, Value, -SurveyYr, -Gender) %>%
group_by(SurveyYr, Gender, Measure) %>%
summarise(Mean = round(mean(Value, na.rm = TRUE), 2)) %>%
arrange(Measure, Gender, SurveyYr) %>%
spread(SurveyYr, Mean)
#> # A tibble: 20 x 4
#> # Groups: Gender [2]
#> Gender Measure `2009_10` `2011_12`
#> <fct> <chr> <dbl> <dbl>
#> 1 female Age 44.0 44.2
#> 2 female AgeDiabetesDiagnosis 48.1 46.5
#> 3 female BMI 29.0 28.6
#> 4 female BPDiaAve 67.7 70.0
#> # … with 16 more rows
13 / 13

Project management

2 / 13
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow