Authors: N'yoma Diamond, Grant Perkins
Code for experiments and results analysis conducted in the paper of the same name. Preprint available at https://doi.org/10.48550/arXiv.2212.08734.
- NumPy (ver. 1.23.3)
- pandas (ver. 1.5.0)
- scikit-learn (ver. 1.1.2)
- statsmodels (ver. 0.13.5)
- more-itertools (ver. 8.14.0)
By default, main.py
expects input data in the form of CSVs using the following directory structure:
./data/{asset type}/{asset name}.csv
Expected asset type/name combinations are as follows:
stock
SPY.US
forex
USDGBP.FOREX
USDEUR.FOREX
USDCAD.FOREX
USDJPY.FOREX
EURGBP.FOREX
bond
US10Y.GBOND
US5Y.GBOND
UK5Y.GBOND
JP5Y.GBOND
future
US.COMM
ES.COMM
NK.COMM
HSI.COMM
FESX.COMM
VIX.COMM
GC.COMM
NG.COMM
ZC.COMM
ZS.COMM
By default, stats.ipynb
expects input data in the form of CSVs with the following directory/name structure:
./out/**/*results.csv
Note that **/*
enables the selection of any and all files ending in results.csv
from arbitrary subdirectories of ./out
If desired, this can be changed by editing the data_dir
and concat_results
variables.
The files in this project are organized as follows:
main.py
contains code for running experiments, including model hyperparameter setting and initialization.stats.ipynb
contains code for evaluating experiment results.trainer.py
contains code for the model training and hyperparameter search pipelines.dataset.py
contains code for initializing datasets.utils.py
contains general-purpose utility code.constants.py
contains useful constants, settings, and typing information.
The code can be run from command-line using the following command:
python main.py <optional arguments>
This will create an output directory (if one does not already exist) and save the experimental results into it. If no arguments are provided, this will run a single replication where all models are trained in serial with results saved into a file named results.csv
.
If desired, optional command-line arguments may be provided to main.py
:
-
-p <value (optional)>
/--processes=<value (optional)>
: Enable the use of multiple parallel processes when training models. Providing a number specifies the maximum number of processes to run concurrently per replication. E.g.,main.py -p 5
sets a maximum of 5 concurrent processes per replication. Unlimited processes are used if a numeric value is not provided (I.e.,main.py -p
).NOTE: The
n_jobs
parameter for supported scikit-learn operations will be set to 1 due to incompatibility with multiprocessing. Otherwise,n_jobs
is set to -1 (unlimited jobs) unless-r
/--replications
is used.WARNING: Allowing unlimited processes may use large amounts of resources and cause systems to freeze or crash.
-
-r <value>
/--replications=<value>
: Run experiment using the provided number of replications.NOTE: Each replication is given a dedicated process. As a result, the
n_jobs
parameter for supported scikit-learn operations will be set to 1 due to incompatibility with multiprocessing. Therefore, combining this argument with-p
/--processes
is highly recommended.NOTE: The results of each replication are saved to a dedicated file with the name
{replication number}_results.csv
. -
-m <model name>
/--model=<model name>
: Run experiment using only the provided model. Acceptable inputs are as follows:- DecisionTree
- RandomForest
- LogisticRegression
- LinearSVM
- KNN
- RandomBaseline
- ConstantBaseline
- PreviousBaseline
- ConsensusBaseline
NOTE: This changes the name of the
results.csv
output file to contain the specified model name (i.e.,{model name}_results.csv
). -
-o <output path>
/--out_dir=<output path>
: Use the specified output directory to save experiment results to.NOTE: The provided directory and any parent directories will be created if not already present.
-
-u
/--use-uuid
: Append a unique identifier (UUID) to the end of the output directory. This argument can be used in conjunction with-o
/--out_dir
. This is useful to prevent accidentally overwriting results files when running multiple experiments without using-r
/--replications
In order to analyze results, run the code in stats.ipynb
. This can be done using any Jupyter Notebook environment (e.g. Jupyter Notebook IDE, Jupyter Lab, PyCharm, Google Colab). This will perform the same analysis described in the paper.
NOTE: By default, this code assumes that the files to analyze will end in results.csv
, and will not work otherwise.
stats.ipynb
has a number of parameters which can be changed if desired:
-
data_dir
: The directory from which to load results data../out
by default. -
concat_results
: Whether to combine any and all available results files indata_dir
.True
by default. IfFalse
, use<data_dir>/results.csv
.NOTE: When True, requires desired files for joining to end in
results.csv
. Files in subdirectories ofdata_dir
will also be used. -
alpha
:$\alpha$ threshold to use when computing reduced ANOVA models (1 - confidence level).0.05
by default. -
only_reduced
: Only display reduced ANOVA models.False
by default. -
latex_output
: Change table output to print$\LaTeX$ formatted tables instead of DataFrames.False
by default. -
combine_anova_latex
: Combine full and reduced ANOVA model printouts into one table.NOTE: Requires
latex_output
to beTrue
, does nothing otherwise. Overridesonly_reduced
ifTrue