-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path027-simulation.Rmd
440 lines (320 loc) · 18.4 KB
/
027-simulation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
# Simulation: When calculations are difficult
**Caution: in a highly developmental stage! See Section \@ref(caution).**
__Status__: Needs framing from the context of this book if so (as opposed to DSCI 551, its original context). Should prepare for bootstrap.
So far, we've seen many quantities that help us communicating an uncertain outcome:
- probability
- probability mass function
- odds
- mode
- entropy
- mean
- variance / standard deviation
Sometimes, it's not easy to compute these things. In these situations, we can use __simulation__ to approximate these and other quantities. This is today's topic.
Let's set up the workspace for this lecture:
```{r, warning = FALSE}
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(testthat))
```
### Learning Objectives
From this lecture, students are expected to be able to:
- Generate a random sample from a discrete distribution in R
- Reproduce the same random sample each time you re-run your code in R by setting the seed.
- Evaluate whether or not a set of observations are iid.
- Use simulation to approximate distribution properties (like mean and variance) using empirical quantities, especially for random variables involving multiple other random variables.
### Review Activity (15 min)
True or False?
1. In general, 9 parameters must be specified in order to fully describe a distribution with 9 outcomes.
2. A Binomial distribution only has one mean, but there are many Binomial distributions that have the same mean.
3. A Poisson distribution only has one mean, but there are many Poisson distributions that have the same mean.
3. A Binomial distribution is also a Bernoulli distribution, but a Bernoulli distribution is not a Binomial distribution.
### Random Samples: Terminology (5 min)
A __random sample__ is a collection of random outcomes/variables. Using symbols, a random sample of size $n$ is usually depicted as $X_1, \ldots, X_n$. We think of data as being a random sample.
Some examples of random samples:
- the first five items you get in a game of Mario Kart
- the outcomes of ten dice rolls
- the daily high temperature in Vancouver for each day in a year.
A random sample is said to be __independent and identically distributed__ (or __iid__) if
1. each pair of observations are independent, and
2. each observation comes from the same distribution.
We'll define "independent" next class, but for now, you can think of this as meaning "not influencing each other".
Sometimes, when an outcome is said to be __random__, this can either mean the outcome has some distribution (with non-zero entropy), or that is has the distribution with maximum entropy. To avoid confusion, the word __stochastic__ refers to the former (as having some uncertain outcome). For example, if a die is weighted so that "1" appears very often, would you call this die "random"? Whether or not you do, it's always _stochastic_.
The opposite of stochastic is __deterministic__: an outcome that will be known with 100% certainty.
### Seeds (5 min)
Computers can't actually generate truly random outcomes. Instead, they use something called [pseudorandom numbers](https://en.wikipedia.org/wiki/Pseudorandom_number_generator).
As an example of a basic algorithm that produces pseudo-random numbers between 0 and 1, consider starting with your choice of number $x_0$ between 0 and 1, and iterating the following equation: $$x_{i+1} = 4 x_i (1 - x_i).$$ The result will appear to be random numbers between 0 and 1. Here is the resulting sequence when we start with $x_0 = 0.3$ and iterate 1000 times:
```{r, echo = FALSE, fig.width = 8, fig.height = 2, fig.align = 'center'}
n <- 1000
x <- 0.3
for (i in 1:n) x[i+1] <- 4 * x[i] * (1 - x[i])
qplot(0:n, x) +
labs(x = "Iteration number",
y = "Function evaluation") +
theme_bw()
```
Although this sequence is deterministic, it behaves like a random sample. But not entirely! All pseudorandom number generators have some pitfalls. In the case above, one pitfall is that neighbouring pairs are not independent from each other (by definition of the way the sequence was set up!). There are some sophisticated algorithms that produce outcomes that more closely resemble a random sample, so most of the time, we don't have to worry about the sample not being truly random.
The __seed__ (or __random state__) in a pseudo-random number generator is some pre-specified initial value that determines the generated sequence. As long as the seed remains the same, the resulting sample will also be the same. In the case above, this is $x_0 = 0.3$. In R, if we don't explicitly set the seed, then the seed will be chosen for us.
In R, we can set the seed using the `set.seed()` function.
The seed gives us an added advantage over truly random numbers: it allows our analysis to be reproducible! If we explicitly set a seed, then someone who re-runs the analysis will get the same results.
### Generating Random Samples: Code
Here, we'll look at some R functions that help us generate a random sample. We're still focussing on discrete distributions, here.
#### From Finite Number of Categories (5 min)
In R, we can generate a random sample from a distribution with a finite number of outcomes using the [`sample()` function](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/sample):
- Put the outcomes as a vector in the first argument, `x`.
- Put the desired sample size in the argument `size`.
- Put `replace = TRUE` so that sampling can happen with replacement.
- Put the probabilities of the outcomes as a vector respective to `x` in the argument `prob`.
- Just a warning: if these probabilities do not add up to 1, R will not throw an error. Instead, R automatically adjusts the probabilities so that they add up to 1.
Here's an example of generating 10 items using the Mario Kart item distribution from Lecture 1. Notice that the seed is set, so that every time these lecture notes are rendered, the same results are obtained.
```{r}
set.seed(1)
outcomes <- c("banana", "bob-omb", "coin", "horn", "shell")
probs <- c(0.12, 0.05, 0.75, 0.03, 0.05)
n <- 10
sample(outcomes, size = n, replace = TRUE, prob = probs)
```
#### From Distribution Families (5 min)
In R, we can generate data from a distribution belonging to some parametric family using the `rdist()` function, where "`dist`" is replaced with a short-form of the distribution family's name. We can access the corresponding pmf with `ddist()`.
The following table summarizes the functions related to the distribution famlies we've seen so far:
| Family | R function |
|--------|------------|
| Binomial | `rbinom()` |
| Geometric | `rgeom()` |
| Negative Binomial | `rnbinom()` |
| Poisson | `rpois()` |
Here's how to use these functions:
- Sample size:
- For R, put this in the argument `n`, which comes first.
- Each parameter has its own argument. Sometimes, like in R's `rnbinom()`, there are more parameters than needed, giving the option of different parameterizations. Be sure to only specify the exact number of parameters required to isolate a member of the distribution family!
__Example__: Generate 10 observations from a binomial distribution with probability of success 0.6 and 5 trials.
Using R:
```{r}
rbinom(10, size = 5, prob = 0.6)
```
The Negative Binomial family is an example of a function in R that allows for a different parameterization. Notice that specifying too many or too few parameters results in an error (remember, we need to specify two parameters):
```{r, error = TRUE}
rnbinom(10, size = 5)
rnbinom(10, size = 5, prob = 0.6, mu = 4)
```
### Running Simulations
So far, we've seen two ways to calculate quantities that help us communicate uncertainty (like means and probabilities):
1. The __distribution-based approach__ (using the distribution), resulting in _true values_.
2. The __empirical approach__ (using data), resulting in _approximate values_ that improve as the sample size increases.
For example, the true mean of a random variable $X$ can be calculated as $E(X) = \sum_x x P(X = x)$ using each pair of outcome and outcome's probability, or can be approximated using the empirical approach from a random sample $X_1, \ldots, X_n$ by $E(X) \approx (1/n) \sum_{i=1}^n X_i$.
This means that we can approximate these quantities by generating a sample! An analysis that uses a randomly generated data set is called a __simulation__.
#### Code for empirical quantities (0 min)
For your reference, here are some hints for calculating empirical quantities using R. We'll be going over these below in the "Basic Simulation" section.
- `mean()` calculates the sample average.
- `var()` calculates the sample variance (the $n-1$ version, not $n$), and `sd()` its square root for the standard deviation.
- For a single probability, remember that a mean is just an average. Just calculate the mean of a condition.
- For an entire pmf, use the `table()` function, or more conveniently, the `janitor::tabyl()` function.
- For the mode, either get it manually using the `table()` or `janitor::tabyl()` function, or you can use `DescTools::Mode()`.
#### Basic Simulation (10 min)
Consider playing games with probability of success $p=0.7$ until you experience $k=5$ successes, and counting the number of failures. This random variable (say $X$) has a Negative Binomial distribution.
You can find an R script containing the code for the basic simulation [in the students' repo](https://github.ubc.ca/MDS-2019-20/DSCI_551_stat-prob-dsci_students/blob/master/supplementary/lec3-basic.R).
Let's demonstrate both a distribution-based and empirical approach to computing the variance and pmf. First, let's obtain our random sample (of, say, 10000 observations).
```{r}
set.seed(88)
k <- 5
p <- 0.7
n <- 10000
rand_sample <- rnbinom(n, size = 5, prob = 0.7)
head(rand_sample, 100)
```
Mean
```{r}
(1 - p) * k / p # True, distribution-based
mean(rand_sample) # Approximate, empirical
```
Variance
```{r}
(1 - p) * k / p^2 # True, distribution-based
var(rand_sample) # Approximate, empirical
```
Standard deviation
```{r}
sqrt((1 - p) * k / p^2) # True, distribution-based
sd(rand_sample) # Approximate, empirical
```
Probability of seeing 0
```{r}
mean(rand_sample == 0) # Approximate, empirical
dnbinom(0, size = k, prob = p) # True, distribution-based
```
pmf
```{r}
## Code without using the tidyverse:
pmf <- janitor::tabyl(rand_sample) # Empirical
pmf$n <- NULL
pmf <- setNames(pmf, c("x", "empirical"))
pmf$actual <- dnbinom(pmf$x, size = k, prob = p) # True
## Code using the tidyverse:
pmf <- janitor::tabyl(rand_sample) %>%
select(x = rand_sample, empirical = percent) %>%
mutate(actual = dnbinom(x, size = k, prob = p))
pmf %>%
mutate(actual = round(actual, 4), # Empirical
empirical = round(empirical, 4)) %>% # True
DT::datatable(rownames = FALSE)
```
Here's a plot of the pmf:
```{r, fig.width = 5, fig.height = 2, fig.align = "center"}
pmf %>%
gather(key = "method", value = "Probability", empirical, actual) %>%
ggplot(aes(x, Probability)) +
facet_wrap(~ method) +
geom_col(fill = "maroon") +
theme_bw()
```
Entropy
It turns out to be hard to calculate the actual entropy, so we will only compute the empirical:
```{r}
- sum(pmf$empirical * log(pmf$empirical))
```
Mode
```{r}
## Actual
pmf %>%
filter(actual == max(actual)) %>%
pull(x)
## Empirical
pmf %>%
filter(empirical == max(empirical)) %>%
pull(x)
```
Distribution-based calculations on empirical pmf
What do you think you'll get if you use the definition of mean, variance, etc. _on the empirical distribution_? You get the empirical values! Here's an example with the mean -- notice that they are identical.
```{r}
sum(pmf$x * pmf$empirical)
mean(rand_sample)
```
Law of Large Numbers
To demonstrate that the a larger sample size improves the approximation of the empirical quantities, let's see how the sample average changes as we collect more and more data:
```{r, fig.width = 8, fig.height = 3, fig.align = "center"}
tibble(i = 1:n,
mean = cumsum(rand_sample) / i) %>%
ggplot(aes(i, mean)) +
geom_hline(yintercept = (1 - p) * k / p,
colour = "maroon") +
geom_line() +
labs(x = "Sample size",
y = "Empirical mean") +
theme_bw()
```
You can try this for yourself with [Chapter 1 -"expectation" in Seeing Theory ](https://seeing-theory.brown.edu/basic-probability/index.html).
### Multi-Step Simulations (10 min)
The simulation above was not all that useful, since we could calculate basically anything. Where it gets more interesting is when we want to calculate things for a random variable that transforms and/or combines multiple random variables.
The idea is that some random variables will have a distribution that depends on other random variables, but in a way that's explicit. For example, consider a random variable $T$ that we can obtain as follows. Take $X \sim \text{Poisson}(5)$, and then $T = \sum_{i = 1}^{X} D_i$, where each $D_i$ are iid with some specified distribution. In this case, to generate $T$, you would first need to generate $X$, then generate $X$ values of $D_i$, then sum those up to get $T$. This is the example we'll see here, but in general, you can have any number of dependencies, each component of which you would have to generate.
Consider an example that a Vancouver port faces with "gang demand". Whenever a ship arrives to the port of Vancouver, they request a certain number of "gangs" (groups of people) to help unload the ship. Let's suppose the number of gangs requested by a ship has the following distribution:
```{r, fig.width = 3, fig.height = 2, fig.align = "center"}
gang <- 1:4
p <- c(0.2, 0.4, 0.3, 0.1)
tibble(
gangs = gang,
p = p
) %>%
ggplot(aes(gangs, p)) +
geom_col(fill = "maroon") +
labs(x = "Number of Gangs",
y = "Probability") +
theme_bw()
```
The following function sums up simulated gangs requested by a certain number of ships, with the above probability distribution as a default. As an example, check out the simulated gang request from 10 ships:
```{r}
#' Generate gang demand
#'
#' Simulates the number of gangs requested, if each ship
#' requests a random number of gangs.
#'
#' @param n_ships Number of ships that are making demands
#' @param gangs Possible gang demands made by a ship.
#' @param prob Probabilities of gang demand corresponding to "gangs"
#'
#' @return Number representing the total gang demand
demand_gangs <- function(n_ships, gangs = gang, prob = p) {
if (length(gangs) == 1) {
gangs <- c(gangs, gangs)
prob <- c(1,1)
}
requests <- sample(
gangs,
size = n_ships,
replace = TRUE,
prob = prob
)
sum(requests)
}
test_that("demand_gangs output is as expected", {
expect_identical(demand_gangs(0), 0L)
expect_gte(demand_gangs(1), min(gang))
expect_lte(demand_gangs(1), max(gang))
expect_gte(demand_gangs(10), 10*min(gang))
expect_lte(demand_gangs(10), 10*max(gang))
expect_identical(length(demand_gangs(10)), 1L)
expect_identical(demand_gangs(10, gangs = 2, prob = 1), 20)
})
demand_gangs(10)
```
Now suppose that the number of ships that arrive on a given day follows the Poisson distribution with a mean of 5. What's the distribution of total gang request on a given day? Let's simulate the process to find out:
1. Generate arrival quantities for many days from the Poisson(5) distribution.
2. For each day, simulate total gang request for the simulated number of ships.
3. You now have your random sample -- compute things as you normally would.
Let's try this, obtaining a sample of 10000 days:
```{r, fig.width = 6, fig.height = 2, fig.align = "center"}
n_days <- 10000
## Step 1: generate a bunch of ships arriving each day
arrivals <- rpois(n_days, lambda = 5)
## Step 2: Simulate total gang request on each day.
total_requests <- purrr::map_int(arrivals, demand_gangs)
## Step 3: Compute things like pmf, mean, variance:
tibble(x = total_requests) %>%
ggplot(aes(x, y = ..prop..)) +
geom_bar() +
labs(x = "Total Gang Request",
y = "Probability") +
theme_bw()
tibble(
mean = mean(total_requests),
variance = var(total_requests)
) %>%
knitr::kable()
```
### Generating Continuous Data
```{r, warning = FALSE, message = FALSE, echo = FALSE}
library(tidyverse)
knitr::opts_chunk$set(echo = FALSE, fig.align = "center")
```
Until now, we've sidestepped the actual procedure for how a random outcome is actually generated. For the discrete case, we could get by with the "drawing from a hat" analogy. But this won't get us far in the continuous case, because each outcome has 0 probability of occuring.
The idea is to convert a random number between 0 and 1 into an outcome. Going back to the discrete case, using the Mario Kart example, we can break the interval [0, 1] into sub-intervals with widths equal to their probabilities. Visually, this might look like the following:
```{r, fig.height = 3, fig.width = 8}
mario <- tibble(
item = c("Banana", "Bob-omb", "Coin", "Horn", "Shell"),
prob = c(0.12, 0.05, 0.75, 0.03, 0.05)
) %>%
mutate(item = fct_reorder(item, prob)) %>%
arrange(desc(item)) %>%
mutate(right = cumsum(prob),
left = lag(right) %>% replace_na(0)) %>%
gather(key = "position", value = "step", left, right) %>%
group_by(item) %>%
mutate(middle = mean(step))
ggplot(mario, aes(x = step, y = item, group = item)) +
geom_line() +
geom_point() +
geom_text(aes(label = prob, x = middle), position = position_nudge(y = 0.25)) +
theme_minimal() +
labs(x = "Random number", y = "") +
scale_x_reverse(breaks = seq(0, 1, by = 0.25),
labels = seq(0, 1, by = 0.25) %>% rev())
```
We can make a similar plot for a Poisson(3) random variable (the y-axis is truncated because we can't plot all infinite outcomes):
```{r, fig.height = 3, fig.width = 8}
ggplot(tibble(x = 0:1), aes(x)) +
stat_function(fun = function(x) qpois(x, lambda = 3), n = 1000) +
theme_minimal() +
scale_y_continuous("Outcome", breaks = 0:8, limits = c(0, 8)) +
xlab("Random number")
```
Indeed, this plot is nothing other than the quantile function! This idea extends to all random variables. If we want to generate an observation of a random variable $Y$ with quantile function $Q_Y$, just follow these two steps:
1. Generate a number $U$ completely at random between 0 and 1.
2. Calculate the observation as $Y = Q_Y(U)$.
For continuous random variables only, the opposite of this result also has important implications: if $Y$ is a continuous random variable with cdf $F_Y$, then $$F_Y(Y) \sim \text{Unif}(0,1).$$ This is important for p-values in hypothesis testing (DSCI 552+), transformations, and copulas (optional question on your lab assignment).