-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathanalysis.Rmd
367 lines (319 loc) · 11.1 KB
/
analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
---
title: "Pet Box Subscription analysis"
author: "Michal Lauer"
date: "`r format(Sys.time(), format = '%d. %m. %Y')`"
output:
html_document:
theme: paper
css: "assets/css/style.css"
df_print: kable
highlight: zenburn
code_folding: hide
toc: true
toc_float: true
editor_options:
chunk_output_type: console
knit: (function(inputFile, encoding) {
rmarkdown::render(inputFile,
encoding = encoding,
output_dir = "output/",
output_format = c("html_document", "pdf_document"))})
---
```{r setup, include=F}
# Environment preparation
cat("\f")
rm(list = ls())
# Clear output
if (dir.exists("output") & !isTRUE(getOption("knitr.in.progress"))) {
unlink("output", recursive = T)
}
# RMarkdown requirements for renv
require("rmarkdown")
require("yaml")
# knitr setting
knitr::opts_chunk$set(
fig.align = "center",
fig.path = "output/imgs/",
fig.width = 9,
fig.height = 7,
dev = c("svg", "png")
)
# # gtsummary theme
# gtsummary::set_gtsummary_theme(list(
# "style_number-arg:big.mark" = " "
# ))
# Function for printing vector length
len <- function(fc) length(unique(fc))
diff_abs <- function(c) abs(c[2] - c[1])
diff_rel <- function(c) round(abs(c[2] - c[1])*100, 2)
```
# Preface
Purpose of this project was to get a Data Analyst Associate certification
by DataCamp. The data is artificial and do not represent a real customer.
The owner of the assignment and the data is DataCamp.
# Introduction
PetMind is a nationwide pet product retailer in the United States. With
inflation hitting 41-year highs, the company is planning to reduce the cost of
customer retention by improving brand loyalty. The first strategy is to launch
a monthly pet box subscription in three months.
The marketing team is preparing a list of popular products for the pet box
subscription. The chief marketing officer wants to know whether the list should
only include the products being purchased more than once.
The marketing team would like to answer the following questions to help
with the decision:
> How many products are being purchased more than once? </br>
> Do the products being purchased again have better sales than others? </br>
> What products are more likely to be purchased again for different types of pets?
# Libraries
This block loads all libraries.
```{r libs, message=F, warning=F}
# Data wrangling
library(readr)
library(dplyr)
library(forcats)
# Data visualization
library(skimr)
library(ggplot2)
# Text manipulation
library(glue)
library(stringr)
library(tidytext)
```
# Data preparation
The raw data set is first inspected before any further changes are made.
```{r data_raw}
data_raw <-
read_csv(
file = "input/pet_sales.csv",
col_types = cols(
product_id = col_double(),
product_category = col_character(),
sales = col_character(),
price = col_double(),
vendor_id = col_character(),
pet_size = col_character(),
pet_type = col_character(),
rating = col_double(),
re_buy = col_double()
)
)
head(data_raw)
```
The data needs to be first cleaned. In the clean data, the
- *product_id* is factorized,
- *product_category* is factorized,
- *sales* is parsed to numerical column *sales_usd*,
- *price* is renamed to *price_usd*,
- *vendor_id* is factorized,
- *pet_size* is ordered and factorized,
- *pet_type* is factorized and transformed to title case,
- *rating* is parsed as integer, and
- *re_buy* is factorized.
```{r data_unfiltered}
pet_size_levels <- c("extra_small", "small", "medium", "large", "extra_large")
data_unfiltered <-
data_raw |>
mutate(product_id = factor(product_id),
product_category = factor(product_category),
sales_usd = parse_number(sales),
price_usd = price,
vendor_id = factor(vendor_id),
pet_size = factor(pet_size,
levels = pet_size_levels, ordered = T),
pet_type = factor(str_to_title(pet_type)),
rating = as.integer(rating),
re_buy = factor(x = as.logical(re_buy),
labels = c("TRUE" = "Rebought",
"FALSE" = "Not rebought"))) |>
relocate(sales_usd, .before = sales) |>
select(-sales)
skim_without_charts(data_unfiltered)
```
There is `r nrow(data)` spanning across `r ncol(data)` different columns.
There are `r len(data_unfiltered$pet_size)` different **pet sizes**,
`r len(data_unfiltered$pet_type)` different **pet types**, and
`r len(data_unfiltered$product_category)` **product categories**. **Product id**
and **vendor id** is different for each product, which means that product id uniquely identifies each product and each product has exactly one vendor.
**Sales** and **price** are normally distributed.
From the assignment, only *some* pet types should be considered
```{r data}
# Considered pet types
considered_types <- c("Cat", "Dog", "Fish", "Bird")
# Data filtering
data <-
data_unfiltered |>
filter(pet_type %in% considered_types) |>
mutate(pet_type = fct_drop(pet_type))
fct_count(data$pet_type)
```
# Characteristics of pet size for each animal type
```{r size-vs-type}
size_type <-
data |>
mutate(pet_size = fct_rev(pet_size),
pet_type = fct_relevel(pet_type, "Fish", after = 2),
pet_size = fct_relabel(pet_size,
.fun = ~ str_to_sentence(str_replace_all(.x, "_", " "))
))
size_type |>
ggplot(aes(x = pet_size, fill = pet_type)) +
geom_bar(show.legend = F) +
facet_wrap(vars(pet_type), nrow = 2) +
coord_flip() +
theme_classic() +
labs(
title = "Cats tend to be larger than dogs and fish tend to be smaller than birds",
x = "Pet size",
y = "Absolute count",
caption = "Michal Lauer for DataCamp, laumi.me"
)
```
# Characteristics of Product category vs. Price
```{r product-vs-price}
pallette <- c("#12db43", "#2adf4a", "#39e252", "#45e659", "#50e960",
"#5aed67", "#63f16d", "#6bf474", "#73f87b", "#7bfb81", "#83ff88")
data |>
ggplot(aes(x = reorder(product_category, price_usd, FUN = median),
fill = reorder(product_category, price_usd, FUN = median),
y = price_usd)) +
geom_boxplot(show.legend = F) +
scale_y_continuous(labels = scales::dollar_format()) +
scale_fill_manual(values = rev(pallette)) +
theme_classic() +
coord_flip() +
labs(
title = "The most expensive product categories are toys, bedding, and clothes",
subtitle = "The cheapest are supplements, food, and accessory",
x = "Product cagetory",
y = "Price ($)",
caption = "Michal Lauer for DataCamp, laumi.me"
)
```
# Relationship bettwen Pet type vs. Total sales
```{r pet-vs-total-sales}
pet_total <-
data |>
group_by(pet_type) |>
summarise(sales_usd = sum(sales_usd)) |>
mutate(pet_type = fct_reorder(.f = pet_type, .x = sales_usd),
pet_type = fct_rev(pet_type))
pet_total |>
ggplot(aes(x = pet_type, y = sales_usd, fill = pet_type)) +
geom_col(show.legend = F) +
scale_y_continuous(labels = scales::dollar_format()) +
scale_x_discrete(limits = rev) +
scale_fill_brewer(type = "div") +
theme_classic() +
labs(
title = "Most sales are made for cats and dogs",
x = "Pet type",
y = "Sales ($)",
caption = "Michal Lauer for DataCamp, laumi.me"
)
```
# Business questions
## Rebought products
The first question asked by the marketing team is:
> How many products are being purchased more than once?
```{r rebought-plot}
# Compute data set for further analysis
rbt_data <-
data |>
count(re_buy) |>
mutate(p = n/sum(n))
# Get p-valuye
rbt_p <- prop.test(x = rbt_data$n,
n = rep(nrow(data), times = 2))
# Plot data
rbt_data |>
ggplot(aes(color = re_buy)) +
# Lollipop head
geom_point(aes(x = re_buy, y = n), size = 3) +
# Lollipop body
geom_segment(aes(x = re_buy, xend = re_buy, y = 0, yend = n), size = 1) +
# Lollipop description
geom_text(aes(x = re_buy, y = n, label = glue("{n} ({round(p*100, 2)} %)")),
nudge_y = 20, color = "black") +
scale_y_continuous(name = "Absolute count",
limits = c(0, 620),
breaks = seq(0, 600, by = 100),
expand = c(0, 0),
sec.axis = sec_axis(name = "Relative proportion",
trans = ~ ./nrow(data),
breaks = seq(0, 1, by = 0.1),
labels = scales::percent_format())) +
theme_classic() +
theme(legend.position = "none") +
labs(title = "Items are usually not rebought",
subtitle = glue("Absolute difference of {diff_abs(rbt_data$n)} ",
"({diff_rel(rbt_data$p)} %), ",
"p-value: {format.pval(rbt_p$p.value)}"),
x = "Was item rebought?",
caption = "Michal Lauer for DataCamp, laumi.me"
)
```
From the figure above, it can be seen that
`r filter(rbt_data, re_buy == "Rebought")$n` products
(`r round(filter(rbt_data, re_buy == "Rebought")$p*100, 2)` %) of
products are being rebought.
## Sales for rebought items
The second question that is being asked is regarding sales.
> Do the products being purchased again have better sales than others?
```{r purchased-again}
# Formatted sale means
sf <-
data |>
group_by(re_buy) |>
summarise(mean = round(mean(sales_usd)/100, 0),
mean = prettyNum(mean, big.mark = " "),
mean = glue("{mean}K"))
# Sales p-value
sp <- t.test(sales_usd ~ re_buy, data = data)
# Sales plot
data |>
ggplot(aes(x = re_buy, y = sales_usd, fill = re_buy)) +
geom_boxplot() +
scale_y_continuous(labels = scales::dollar_format()) +
scale_fill_brewer(type = "div", palette = 4) +
theme_classic() +
labs(
title =
glue("There is no significant difference between rebought and not ",
"rebought items."),
subtitle =
glue("Rebought mean: {filter(sf, re_buy == 'Rebought')$mean}, ",
"Not rebought mean: {filter(sf, re_buy == 'Not rebought')$mean}, ",
"p-value: {format.pval(sp$p.value)}"),
x = "Was item rebought?",
y = "Sales ($)",
caption = "Michal Lauer for DataCamp, laumi.me"
)
```
There is no significant difference between sales for rebought and Not rebought
items.
## Different products for different pets
The final question asked by the marketing team is:
> What products are more likely to be purchased again for different types of
pets?
```{r more-likely-again}
mtb <-
data |>
filter(re_buy == "Rebought") |>
count(pet_type, product_category) |>
mutate(product_category = reorder_within(product_category, n, pet_type),
pet_type = fct_relevel(pet_type))
mtb |>
ggplot(aes(x = product_category, y = n, fill = pet_type)) +
geom_col(show.legend = F) +
facet_wrap(vars(pet_type), ncol = 2, scales = "free") +
coord_flip() +
scale_x_reordered() +
theme_classic() +
labs(
title = "Most rebought products are for cats and dogs",
subtitle = "Top 3 products for all animals are toys, snacks, and equipment",
x = "Product category",
y = "Total number of sales",
caption = "Michal Lauer for DataCamp, laumi.me"
)
```