INNDiE: An Integrated Neural Network Development Environment
A Computer Science and Robotic Engineering Major Qualifying Project submitted to the Faculty of Worcester Polytechnic Institute in partial fulfillment of the requirements for the degree of Bachelor of Science.
The core project is responsible for understanding the operations the user wants to complete and executing them in the correct order. Typically, the output of the user's commands takes the form of code generation.
This project has a DSL which models the generated code. Project dsl-interface
provides the interface for the components of the DSL. Project dsl
provides
the implementation. Other offshoot projects, such as tasks-yolov3
, add
model-specific implementation details that can be used with the DSL.
The flow of information through the DSL to the generated code is structured as follows:
- Information is input using the DSL via the
ScriptGenerator
. This forms a "program" configuration which completely describes the code which will be generated. - This configuration is checked for correctness by the
ScriptGenerator
insideScriptGenerator.code
. - The configuration is parsed into a
Code
dependency graph usingCodeGraph
, which performs further correctness checks to verify the dependency graph is a singular DAG. - That graph is then traversed and parsed into a program by appending
Code.code
segments in traversal order, predecessors first.
If at any point the ScriptGenerator
or any of its dependencies determines
that the configuration is invalid, an error type is returned.
To add a layer,
- Add it in
SealedLayer
as a sealed subclass. Any parameter validation should be tested. - Add test cases that use it in
DefaultLayerToCodeTest
and then add it as a case inDefaultLayerToCode
. - Use it in
LoadLayersFromHDF5::parseLayer
and amend any tests that now fail because they previously did not understand the new layer type. If no tests fail, add a test that loads a model containing the new layer type.
To add an initializer,
- Add it in
Initializer
as a sealed subclass. - Add a case for it in
DefaultInitializerToCode
. Add test cases inDefaultInitializerToCodeTest
. - Add a case for it in
LoadLayersFromHDF5::initializer
. Generate new models that use each variance of the newInitializer
and add a test inLoadLayersWithInitializersIntegrationTest
that loads each one.
INNDiE uses a simple plugin system to generalize over many different datasets and models.
You can use dataset plugins to control how datasets are processed before training. In the training script, after a dataset is loaded, it is given to a dataset plugin for processing, before being given to the model during training. This is when any model-specific processing or formatting should be done. Dataset plugins must implement this function:
def process_dataset(x, y):
# Do some data processing in here and return the results.
return (processed_x, processed_y)
In that function, x
is the input data and y
is the target data.
At the start of a test run, the test data must be loaded by a plugin. You can control how the test data is loaded and processed before being given to the model during inference using a test data loading plugin. These plugins must implement this function:
def load_test_data(path):
# Load the dataset from `path`, process it, and return it
# along with `steps`.
return (data, steps)
In that function, path
is the file path to the test data that the user requested be loaded. In the
return statement, data
is the final dataset that will be given directly to the model and steps
is the number of steps to give to TensorFlow when running inference; consult the TensorFlow
documentation
for how to use steps
.
After the inference step of a test run has completed, the input and output to/from the model is
given to a test output processing plugin. This plugin is responsible for interpreting the output of
the model and writing any test results to a folder in the current directory called output
. Any
files put into this directory will be presented to the user in INNDiE's test view UI. These plugins
must implement this function:
def process_model_output(model_input, model_output):
from pathlib import Path
# Write test result files into the `output` directory.
Path("output/file1.txt").touch()
Path("output/file2.txt").touch()
In that function, model_input
is the data given to the model for inference and model_output
is
the output directly from the model.
Inside INNDiE's autogenerated S3 bucket (named with a prefix inndie-autogenerated-
followed by some
random alphanumeric characters for uniqueness), INNDiE manages these directories:
- inndie-untrained-models
- Contains “untrained” models that the user can use to create a new Job with
- These models cannot be used for testing because they are assumed to not contain any weights (or at least not any meaningful weights)
- inndie-training-results
- Contains all results from running a training script
- inndie-test-data
- Contains test data files that can be used with the test view
- inndie-datasets
- Contains the user's custom datasets
- inndie-training-scripts
- Contains generated training scripts
- inndie-training-progress
- Contains training progress files that INNDiE polls to get training progress updates
- inndie-plugins
- Unofficial plugins are stored here
- Security Group for ECS named
inndie-autogenerated-ecs-sg
- Security Group for EC2 named
inndie-autogenerated-ec2-sg
- Security Group for RDS named
inndie-autogenerated-rds-sg
- Task role for ECS named
inndie-autogenerated-ecs-task-role
- IAM role for EC2 named
inndie-autogenerated-ec2-role
- Instance profile for EC2 named
inndie-autogenerated-ec2-instance-profile
- ECS Cluster named
inndie-autogenerated-cluster
- ECS Task Definition named
inndie-autogenerated-task-family