From 5cd006c4a04fc32b7e64fa011402c52cb96cab39 Mon Sep 17 00:00:00 2001 From: GitHub Actions Date: Sat, 20 Jan 2024 22:12:57 +0000 Subject: [PATCH] differences for PR #152 --- 04-data-structures-part2.md | 152 ++++++++++++++++---------------- fig/06-rmd-generate-figures.sh | 0 fig/12-plyr-generate-figures.sh | 0 md5sum.txt | 2 +- 4 files changed, 77 insertions(+), 77 deletions(-) mode change 100755 => 100644 fig/06-rmd-generate-figures.sh mode change 100755 => 100644 fig/12-plyr-generate-figures.sh diff --git a/04-data-structures-part2.md b/04-data-structures-part2.md index aea558229..7a88710f5 100644 --- a/04-data-structures-part2.md +++ b/04-data-structures-part2.md @@ -36,10 +36,10 @@ So far, you have seen the basics of manipulating data frames with our nordic dat ::::::::::::::::::::::::::::::::::::::::: instructor -Pay attention to and explain the errors and warnings generated from the +Pay attention to and explain the errors and warnings generated from the examples in this episode. -::::::::::::::::::::::::::::::::::::::::: +::::::::::::::::::::::::::::::::::::::::: ```r @@ -72,12 +72,12 @@ gapminder <- read.csv("data/gapminder_data.csv") ```r -gapminder <- read.csv("https://datacarpentry.org/r-intro-geospatial/data/gapminder_data.csv", stringsAsFactors = TRUE) #in R version 4.0.0 the default stringsAsFactors changed from TRUE to FALSE. But because below we use some examples to show what is a factor, we need to add the stringAsFactors = TRUE to be able to perform the below examples with factor. +gapminder <- read.csv("https://datacarpentry.org/r-intro-geospatial/data/gapminder_data.csv") ``` - You can read directly from excel spreadsheets without converting them to plain text first by using the [readxl](https://cran.r-project.org/package=readxl) package. - + :::::::::::::::::::::::::::::::::::::::::::::::::: @@ -266,11 +266,10 @@ gapminder[sample(nrow(gapminder), 5), ] ## Challenge 2 -Read the output of `str(gapminder)` again; this time, use what you've learned -about factors and vectors, as well as the output of functions like `colnames` -and `dim` to explain what everything that `str` prints out for `gapminder` -means. If there are any parts you can't interpret, discuss with your -neighbors! +Read the output of `str(gapminder)` again; this time, use what you've learned, +as well as the output of functions like `colnames` and `dim` to explain what +everything that `str` prints out for `gapminder` means. If there are any parts +you can't interpret, discuss with your neighbors! ::::::::::::::: solution @@ -293,17 +292,6 @@ We would like to create a new column to hold information on whether the life exp ```r below_average <- gapminder$lifeExp < 70.5 -head(gapminder) -``` - -```{.output} - country year pop continent lifeExp gdpPercap -1 Afghanistan 1952 8425333 Asia 28.801 779.4453 -2 Afghanistan 1957 9240934 Asia 30.332 820.8530 -3 Afghanistan 1962 10267083 Asia 31.997 853.1007 -4 Afghanistan 1967 11537966 Asia 34.020 836.1971 -5 Afghanistan 1972 13079460 Asia 36.088 739.9811 -6 Afghanistan 1977 14880372 Asia 38.438 786.1134 ``` We can then add this as a column via: @@ -313,17 +301,6 @@ We can then add this as a column via: cbind(gapminder, below_average) ``` - -```{.output} - country year pop continent lifeExp gdpPercap below_average -1 Afghanistan 1952 8425333 Asia 28.801 779.4453 TRUE -2 Afghanistan 1957 9240934 Asia 30.332 820.8530 TRUE -3 Afghanistan 1962 10267083 Asia 31.997 853.1007 TRUE -4 Afghanistan 1967 11537966 Asia 34.020 836.1971 TRUE -5 Afghanistan 1972 13079460 Asia 36.088 739.9811 TRUE -6 Afghanistan 1977 14880372 Asia 38.438 786.1134 TRUE -``` - We probably don't want to print the entire dataframe each time, so let's put our `cbind` command within a call to `head` to return only the first six lines of the output. @@ -399,7 +376,7 @@ Let's overwrite the content of gapminder with our new data frame. ```r -below_average <- as.logical(gapminder$lifeExp<70.5) +below_average <- as.logical(gapminder$lifeExp < 70.5) gapminder <- cbind(gapminder, below_average) ``` @@ -422,24 +399,34 @@ tail(gapminder_norway) 1705 Norway 2016 5000000 Nordic 80.300 49400.0000 FALSE ``` -To understand why R is giving us a warning when we try to add this row, let's learn a little more about factors. ## Factors Here is another thing to look out for: in a `factor`, each different value -represents what is called a `level`. In our case, the `factor` "continent" has 5 -levels: "Africa", "Americas", "Asia", "Europe" and "Oceania". R will only accept -values that match one of the levels. If you add a new value, it will become -`NA`. - -The warning is telling us that we unsuccessfully added "Nordic" to our -*continent* factor, but 2016 (a numeric), 5000000 (a numeric), 80.3 (a numeric), -49400\.0 (a numeric) and `FALSE` (a logical) were successfully added to -*country*, *year*, *pop*, *lifeExp*, *gdpPercap* and *below\_average* -respectively, since those variables are not factors. 'Norway' was also -successfully added since it corresponds to an existing level. To successfully -add a gapminder row with a "Nordic" *continent*, add "Nordic" as a *level* of -the factor: +represents what is called a `level`. + +Let's convert the columns continent and country into factors: + + +```r +gapminder$continent <- factor(gapminder$continent) +gapminder$country <- factor(gapminder$country) +str(gapminder) +``` + +```{.output} +'data.frame': 1704 obs. of 7 variables: + $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ... + $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ... + $ pop : num 8425333 9240934 10267083 11537966 13079460 ... + $ continent : Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ... + $ lifeExp : num 28.8 30.3 32 34 36.1 ... + $ gdpPercap : num 779 821 853 836 740 ... + $ below_average: logi TRUE TRUE TRUE TRUE TRUE TRUE ... +``` + +In our case, the `factor` "continent" has 5 levels: "Africa", "Americas", +"Asia", "Europe" and "Oceania": ```r @@ -447,55 +434,69 @@ levels(gapminder$continent) ``` ```{.output} -NULL +[1] "Africa" "Americas" "Asia" "Europe" "Oceania" ``` +A factor is not a character. For example, if we try to add the same row from +above to our data.frame, some values will become `NA`. This is so because +"continent" and "country" are now factors and R will only accept new values +that match one of the factor's levels: + + ```r -levels(gapminder$continent) <- c(levels(gapminder$continent), "Nordic") -gapminder_norway <- rbind(gapminder, - list('Norway', 2016, 5000000, 'Nordic', 80.3,49400.0, FALSE)) +new_row <- list('Norway', 2016, 5000000, 'Nordic', 80.3, 49400.0, FALSE) +gapminder_norway <- rbind(gapminder, new_row) ``` ```{.warning} -Warning in `[<-.factor`(`*tmp*`, ri, value = structure(c("Asia", "Asia", : -invalid factor level, NA generated +Warning in `[<-.factor`(`*tmp*`, ri, value = "Nordic"): invalid factor level, +NA generated ``` +This warning is telling us that we unsuccessfully added "Nordic" to our +*continent* factor (see below), but 2016 (a numeric), 5000000 (a numeric), 80.3 +(a numeric), 49400\.0 (a numeric) and `FALSE` (a logical) were successfully +added to *country*, *year*, *pop*, *lifeExp*, *gdpPercap* and *below\_average* +respectively, since those variables are not factors. 'Norway' was also +successfully added since it corresponds to an existing level. + + ```r -tail(gapminder_norway) +tail(gapminder_norway, n = 1) ``` ```{.output} - country year pop continent lifeExp gdpPercap below_average -1700 Zimbabwe 1987 9216418 62.351 706.1573 TRUE -1701 Zimbabwe 1992 10704340 60.377 693.4208 TRUE -1702 Zimbabwe 1997 11404948 46.809 792.4500 TRUE -1703 Zimbabwe 2002 11926563 39.989 672.0386 TRUE -1704 Zimbabwe 2007 12311143 43.487 469.7093 TRUE -1705 Norway 2016 5000000 Nordic 80.300 49400.0000 FALSE + country year pop continent lifeExp gdpPercap below_average +1705 Norway 2016 5e+06 80.3 49400 FALSE ``` -Alternatively, we can change a factor into a character vector; we lose the handy -categories of the factor, but we can subsequently add any word we want to the -column without babysitting the factor levels: +To successfully add a row with a "Nordic" *continent*, add "Nordic" as a +*level* of the factor: ```r -str(gapminder) +levels(gapminder$continent) <- c(levels(gapminder$continent), "Nordic") +``` + +And then add the Norway row again: + + +```r +gapminder_norway <- rbind(gapminder, + list('Norway', 2016, 5000000, 'Nordic', 80.3,49400.0, FALSE)) +tail(gapminder_norway, n = 1) ``` ```{.output} -'data.frame': 1704 obs. of 7 variables: - $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ... - $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ... - $ pop : num 8425333 9240934 10267083 11537966 13079460 ... - $ continent : chr "Asia" "Asia" "Asia" "Asia" ... - ..- attr(*, "levels")= chr "Nordic" - $ lifeExp : num 28.8 30.3 32 34 36.1 ... - $ gdpPercap : num 779 821 853 836 740 ... - $ below_average: logi TRUE TRUE TRUE TRUE TRUE TRUE ... + country year pop continent lifeExp gdpPercap below_average +1705 Norway 2016 5e+06 Nordic 80.3 49400 FALSE ``` +Alternatively, we can change the "continent" factor into a character vector. In +this way, we lose the handy categories of the factor, but we can subsequently +add any word we want to the column without babysitting the factor levels: + + ```r gapminder$continent <- as.character(gapminder$continent) str(gapminder) @@ -503,7 +504,7 @@ str(gapminder) ```{.output} 'data.frame': 1704 obs. of 7 variables: - $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ... + $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ... $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ... $ pop : num 8425333 9240934 10267083 11537966 13079460 ... $ continent : chr "Asia" "Asia" "Asia" "Asia" ... @@ -521,7 +522,7 @@ vectors and rows are lists.* We can also glue two data frames together with ```r gapminder <- rbind(gapminder, gapminder) -tail(gapminder, n=3) +tail(gapminder, n = 3) ``` ```{.output} @@ -603,4 +604,3 @@ df <- cbind(df, coffeetime = c(TRUE, TRUE)) :::::::::::::::::::::::::::::::::::::::::::::::::: - diff --git a/fig/06-rmd-generate-figures.sh b/fig/06-rmd-generate-figures.sh old mode 100755 new mode 100644 diff --git a/fig/12-plyr-generate-figures.sh b/fig/12-plyr-generate-figures.sh old mode 100755 new mode 100644 diff --git a/md5sum.txt b/md5sum.txt index 36d683236..1e0293dc7 100644 --- a/md5sum.txt +++ b/md5sum.txt @@ -6,7 +6,7 @@ "episodes/01-rstudio-intro.Rmd" "f4e11815e378019213cd8bc32bd5d292" "site/built/01-rstudio-intro.md" "2023-11-21" "episodes/02-project-intro.Rmd" "00024461ca6e3ea1ec659cf9434377d4" "site/built/02-project-intro.md" "2023-11-21" "episodes/03-data-structures-part1.Rmd" "a83070b1d04789704c8173e6813aba66" "site/built/03-data-structures-part1.md" "2023-11-21" -"episodes/04-data-structures-part2.Rmd" "22100d1539c25cba0459d909f346f516" "site/built/04-data-structures-part2.md" "2023-11-21" +"episodes/04-data-structures-part2.Rmd" "1cdde317409584348e41142273f08428" "site/built/04-data-structures-part2.md" "2024-01-20" "episodes/05-data-subsetting.Rmd" "b673744f991a865b9996504197cc013e" "site/built/05-data-subsetting.md" "2023-11-21" "episodes/06-dplyr.Rmd" "5d6106566981f73f1e3dc6a5c011fa28" "site/built/06-dplyr.md" "2023-11-21" "episodes/07-plot-ggplot2.Rmd" "7cbd4da57c055ecbc3ee80bd2694497a" "site/built/07-plot-ggplot2.md" "2023-11-21"