-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy pathREADME.Rmd
325 lines (240 loc) · 15.9 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
---
title: "README"
output:
github_document:
html_preview: false
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
devtools::load_all()
```
# dnddata
* [dnddata](#dnddata)
* [Usage/installation](#usageinstallation)
* [Examples](#examples)
* [About the data](#about-the-data)
* [Column/element description](#columnelement-description)
* [Caveats](#caveats)
* [Possible Issues with data fields](#possible-issues-with-data-fields)
* [Possible issues with detection of unique characters](#possible-issues-with-detection-of-unique-characters)
* [Possible issues with selection bias](#possible-issues-with-selection-bias)
This is a weekly updated dataset of character that are submitted to my web
applications [printSheetApp](https://oganm.com/shiny/printSheetApp) and
[interactiveSheet](https://oganm.com/shiny/interactiveSheet). It is a superset
of the dataset I previously released under
[oganm/dndstats](https://oganm.github.io/dndstats) with a much larger sample
(`r length(dnd_chars_unique_list)` characters) size and more data fields. It was inspired
by the [FiveThirtyEight](https://fivethirtyeight.com/features/is-your-dd-character-rare/) article on race/class proportions and the data seems to correlate
well with those results (see my [dndstats article](https://oganm.github.io/dndstats)).
Along with a simple table (an R `data.frame` in package), the data is also
present in json format (an R `list` in package). In the table version some data
fields encode complex information that are represented in a more readable manner
in the json format. The data included is otherwise identical.
## Usage/installation
If you are an R user, you can simply install this package and load it to access the dataset
```r
devtools::install_github('oganm/dnddata')
library(dnddata)
```
Try `?tables`, `?lists` to see available objects and their descriptions
If you are not an R user, access the files within the [data-raw](data-raw) directory. The files
are available as JSON and TSV. You can find the field descriptions
[below](#columnelement-description). `dnd_chars_all` files contain all characters
that are submitted while `dnd_chars_unique` files are filtered to include unique
characters.
## Examples
I will be using the list form of the dataset as a basis here.
Let's replicate that plot from [fivethirtyeight](https://fivethirtyeight.com/features/is-your-dd-character-rare/)
as I did in my [original article](https://oganm.github.io/dndstats/).
```{r,fig.width=9,message=FALSE}
library(purrr)
library(ggplot2)
library(magrittr)
library(dplyr)
library(reshape2)
# find all available races
races = dnd_chars_unique_list %>%
purrr::map('race') %>%
purrr::map_chr('processedRace') %>% trimws() %>%
unique %>% {.[.!='']}
# find all available classes
classes = dnd_chars_unique_list %>%
purrr::map('class') %>%
unlist(recursive = FALSE) %>%
purrr::map_chr('class') %>% trimws() %>% unique
# create an empty matrix
coOccurenceMatrix = matrix(0 , nrow=length(races),ncol = length(classes))
colnames(coOccurenceMatrix) = classes
rownames(coOccurenceMatrix) = races
# fill the matrix with co-occurences of race and classes
for(i in seq_along(races)){
for(j in seq_along(classes)){
# get characters with the right race
raceSubset = dnd_chars_unique_list[dnd_chars_unique_list %>%
purrr::map('race') %>%
purrr::map_chr('processedRace') %>% {.==races[i]}]
# get the characters with the right class. Weight multiclassed characters based on level
raceSubset %>% purrr::map('class') %>%
purrr::map_dbl(function(x){
x %>% sapply(function(y){
(trimws(y$class) == classes[j])*y$level/(sum(map_int(x,'level')))
}) %>% sum}) %>% sum -> coOcc
coOccurenceMatrix[i,j] = coOcc
}
}
# reorder the matrix a little bit
coOccurenceMatrix =
coOccurenceMatrix[coOccurenceMatrix %>% apply(1,sum) %>% order(decreasing = FALSE),
coOccurenceMatrix %>% apply(2,sum) %>% order(decreasing = TRUE)]
# calculate percentages
coOccurenceMatrix = coOccurenceMatrix/(sum(coOccurenceMatrix))* 100
# remove the rows and columns if they are less than 1%
coOccurenceMatrixSubset = coOccurenceMatrix[,!(coOccurenceMatrix %>% apply(2,sum) %>% {.<1})]
coOccurenceMatrixSubset = coOccurenceMatrixSubset[!(coOccurenceMatrixSubset %>% apply(1,sum) %>% {.<1}),]
# add in class and race sums
classSums = coOccurenceMatrix %>% apply(2,sum) %>% {.[colnames(coOccurenceMatrixSubset)]}
raceSums = coOccurenceMatrix %>% apply(1,sum) %>% {.[rownames(coOccurenceMatrixSubset)]}
coOccurenceMatrixSubset = cbind(coOccurenceMatrixSubset,raceSums)
coOccurenceMatrixSubset = rbind(Total = c(classSums,NA), coOccurenceMatrixSubset)
colnames(coOccurenceMatrixSubset)[ncol(coOccurenceMatrixSubset)] = "Total"
# ggplot
coOccurenceFrame = coOccurenceMatrixSubset %>% reshape2::melt()
names(coOccurenceFrame)[1:2] = c('Race','Class')
coOccurenceFrame %<>% mutate(fillCol = value*(Race!='Total' & Class!='Total'))
coOccurenceFrame %>% ggplot(aes(x = Class,y = Race)) +
geom_tile(aes(fill = fillCol),show.legend = FALSE)+
scale_fill_continuous(low = 'white',high = '#46A948',na.value = 'white')+
cowplot::theme_cowplot() +
geom_text(aes(label = value %>% round(2) %>% format(nsmall=2))) +
scale_x_discrete(position='top') + xlab('') + ylab('') +
theme(axis.text.x = element_text(angle = 30,vjust = 0.5,hjust = 0))
```
Or try something new. Wonder which fighting style is more popular?
```{r}
dnd_chars_unique_list %>% purrr::map('choices') %>%
purrr::map('fighting style') %>%
unlist %>%
table %>%
sort(decreasing = TRUE) %>%
as.data.frame %>%
ggplot(aes(x = ., y = Freq)) +
geom_bar(stat= 'identity') +
cowplot::theme_cowplot() +
theme(axis.text.x= element_text(angle = 45,hjust = 1))
```
## About the data
### Column/element description
- **ip:** A shortened hash of the IP address of the submitter
- **finger:** A shortened hash of the browser fingerprint of the submitter
- **name:** A shortened hash of character names
- **race:** Race of the character as coded by the app. May be unclear as the app inconsistently codes race/subrace information. See processedRace
- **background:** Background as it comes out of the application.
- **date:** Time & date of input. Dates before 2018-04-16 are unreliable as some has accidentally changed while moving files around.
- **class:** Class and level. Different classes are separated by | when needed.
- **justClass:** Class without level. Different classes are separated by | when needed.
- **subclass:** Subclass. Might be missing if the character is low level. Different classes are separated by | when needed.
- **level:** Total level
- **feats:** Feats chosen. Mutliple feats are separated by | when needed
- **HP:** Total HP
- **AC:** AC score
- **Str, Dex, Con, Int, Wis, Cha:** Ability score modifiers
- **alignment:** Alignment free text field. Since it's a free text field, it includes alignments written in many forms. See processedAlignment, good and lawful to get the standardized alignment data.
- **skills:** List of proficient skills. Skills are separated by |.
- **weapons:** List of weapons, separated by |. This is a free text field. See processedWeapons for the standardized version
- **spells:** List of spells, separated by |. Each spell has its level next to it separated by *s. This is a free text field. See processedSpells for the standardized version
- **castingStat:** Casting stat as entered by the user. The format allows one casting stat so this is likely wrong if the character has different spellcasting classes. Also every character has a casting stat even if they are not casters due to the data format.
- **choices:** Character building choices. This field information about character properties such as fighting styles and skills chosen for expertise. Different choice types are separated by | when needed. The choice data is written as name of choice followed by a / followed by the choices that are separated by *s
- **country:** The origin of the submitter's IP
- **countryCode:** 2 letter country code
- **processedAlignment:** Standardized version of the alignment column. I have manually matched each non standard spelling of alignment to its correct form. First character represents lawfulness (L, N, C), second one goodness (G,N,E). An empty string means alignment wasn't written or unclear.
- **good, lawful:** Isolated columns for goodness and lawfulness
- **processedRace:** I have gone through the way race column is filled by the app and asigned them to correct races. Also includes some common races that are not natively supported such as warforged and changelings. If empty, indiciates a homebrew race not natively supported by the app.
- **processedSpells:** Formatting is same as spells. Standardized version of the spells column. Spells are matched to an official list using string similarity and some hardcoded rules.
- **processedWeapons:** Formatting is same as weapons. Standardized version of the weapons column. Created like the processedSpells column.
- **levelGroup:** Splits levels into groups. The groups represent the common ASI levels
- **alias:** A friendly alias that correspond to each uniqe name
The list version of this dataset contains all of these fields but they are organised
a little differently, keeping fields like `spells` and `processedSpells` together.
### Caveats
#### Possible Issues with data fields
Some data fields are more reliable than others. Below is a summary of all potential problems with the data fields
* **ip and browser fingerprints:** Both IP and browser fingerprints are represented as hashes.
I keep them to have an idea of individual users but did not make use of them so far. Note
that same IPs can be shared by an entire region in some cases.
* **processedAlignment:** Alignment is a free text field in the app and optional. Many characters do not enter their alignments. To create the standardized alignment fields, I went through every entry and manually assigned every alternative spelling to the standardized version. These include mispelled entries, abreviations, entries in different languages etc. In cases where I wasn't able
to match (eg. what the hell is "lawful cute"), this field was left blank. Between automatic updates
new and exciting ways to describe alignment can come into play. Unless I manually added these
new entries, they will also appear blank.
* **processedSpells:** The mobile app allows entering free text into the spell fields. Which means I have
to deal with people writing spells in a non-standard way with typos, abbreviations or additional information such as range, damage dice. I use some heuristics to match the entered text to a list of all published spells. Shortly, I look at the Levenshtein distance between the entry and the published spells and match the entry with the top result if
* the spell level is correct and,
* there are not more than 10 substitutions/deletions/insertions or either entry or the potential match includes all words that the counterpart includes.
* In addition, there are special cases for Bigby's Hand, Tasha's Hideous Laughter and Melf's Acid Arrow as those spells are often written
in their SRD form and match to wrong spells.
```{r, echo = FALSE}
withSpells = which(dnd_chars_unique$spells !='')
withSpells %>% lapply(function(i){
rawSpells = dnd_chars_unique$spells[i] %>% strsplit('\\|') %>% {.[[1]]}
pSpells = dnd_chars_unique$processedSpells[i] %>% strsplit('\\|') %>% {.[[1]]}
seq_along(rawSpells) %>% sapply(function(j){
c(i,rawSpells[j],pSpells[j])
}) %>% t
}) %>% do.call(rbind,.) -> spellProcessedPairs
spellCount = spellProcessedPairs %>% nrow
standardSpellCount = nrow(spellProcessedPairs[spellProcessedPairs[,3] !='*' & spellProcessedPairs[,2] == spellProcessedPairs[,3],])
nonStandardSpellCount = nrow(spellProcessedPairs[spellProcessedPairs[,3] !='*' & spellProcessedPairs[,2] != spellProcessedPairs[,3],])
mismatchCount = spellProcessedPairs[spellProcessedPairs[,3] =='*',-3] %>% nrow
nonStandardPercent = nonStandardSpellCount/spellCount * 100
mismatchPercent = mismatchCount/spellCount * 100
standardPercent = standardSpellCount/spellCount * 100
```
`r round(standardPercent)`% of all spells parsed did not
require any modification. `r round(nonStandardPercent)`% of were only able to be
matched through the heuristics. A manual examination of a random seleciton of
these matches revealed 2/200 mistakes. `r round(mismatchPercent)`% of the spell
entries were not matched to an official spell. Manual observation of these
entries revealed that the common reasons for a failure to match are users
writing the spell under the wrong spell level, writing some class/race features
such as blindsight as spells or adding/removing more than 10 charters when
writing the spells either through abbreviation or adding additional information
about the spell.
```{r echo = FALSE}
withWeapons = which(dnd_chars_unique$weapons !='')
withWeapons %>% lapply(function(i){
rawWeapons = dnd_chars_unique$weapons[i] %>% stringr::str_split('\\|') %>% {.[[1]]}
pWeapons = dnd_chars_unique$processedWeapons[i] %>% stringr::str_split('\\|') %>% {.[[1]]}
seq_along(rawWeapons) %>% sapply(function(j){
c(i,rawWeapons[j],pWeapons[j])
}) %>% t
}) %>% do.call(rbind,.) -> weaponProcessedPairs
weaponCount = weaponProcessedPairs %>% nrow
standardWeaponCount = nrow(weaponProcessedPairs[weaponProcessedPairs[,2] == weaponProcessedPairs[,3],])
nonStandardWeaponCount = nrow(weaponProcessedPairs[weaponProcessedPairs[,2] != weaponProcessedPairs[,3] & weaponProcessedPairs[,3] !='',])
mismatchCount = weaponProcessedPairs[weaponProcessedPairs[,3] =='',] %>% nrow
nonStandardPercent = nonStandardWeaponCount/weaponCount * 100
mismatchPercent = mismatchCount/weaponCount * 100
standardPercent = standardWeaponCount/weaponCount * 100
```
* **processedWeapons:** Weapon names are also free text fields so a processing method similar to the one used for spells is used for weapon names. Instead of a threshold of 10 substititutions/deletions/insertions, 2 was used since weapon names typically did not include additional information like spell names did. Special cases were written for hand crossbow and heavy crossbow as they were typically mismatched to their official name (eg. "crossbow, hand"). Here the weapons that weren't matched were spell names or homebrew weapons.
`r round(standardPercent)`% of all weapons parsed did not require any
modification. `r round(nonStandardPercent)`% of were only able to be matched
through the heuristics. A manual examination of a random seleciton of these
matches revealed 1/200 mistake. `r round(mismatchPercent)`% of the weapon
entries were not matched to an official weapon.
#### Possible issues with detection of unique characters
Identification of unique characters rely on some heuristics. I assume any character
with the same name and class is potentially the same character. In these cases I
pick the highest level character. Race and other properties are not considered so
some unique characters may be lost along the way. I have chosen to be less exact
to reduce the nubmer of possible test characters since there were examples of people
submitting essentially the same character with different races, presumably to test
things out. For multiclassed characters, if a lower level character with the same name and a subset of classes exist, they are removed, again leaving the character with the highest
level.
#### Possible issues with selection bias
This data comes from characters submitted to my web applications. The
applications are written to support a popular third party character sheet app
for mobile platforms. I have advertised my applications primarily on Reddit
r/dndnext and r/dnd. I have seen them mentioned in a few other platforms by word of
mouth. That means we are looking at subsamples of subsamples here, all of which
can cause some amount of selection bias. Some characters could be thought
experiments or for testing purposes and never see actual game play.