diff --git a/README.md b/README.md index 0773dbe..0b6cdfb 100644 --- a/README.md +++ b/README.md @@ -26,10 +26,18 @@ We have a [live demo](https://vidur.westus2.cloudapp.azure.com/) that captures t * __Instructions on adding a new model to existing or new SKUs can be found [here](docs/profiling.md)__. * All models support a maximum context length of 4k except `Llama3-8B` and `Llama3-70B` which support 16k context length by passing additional CLI params: +For random forrest: ```text ---sklearn_execution_time_predictor_prediction_max_prefill_chunk_size 16384 \ ---sklearn_execution_time_predictor_prediction_max_batch_size 512 \ ---sklearn_execution_time_predictor_prediction_max_tokens_per_request 16384 \ +--random_forrest_execution_time_predictor_config_prediction_max_prefill_chunk_size 16384 \ +--random_forrest_execution_time_predictor_config_prediction_max_batch_size 512 \ +--random_forrest_execution_time_predictor_config_prediction_max_tokens_per_request 16384 \ +``` + +For linear regression: +```text +--linear_regression_execution_time_predictor_config_prediction_max_prefill_chunk_size 16384 \ +--linear_regression_execution_time_predictor_config_prediction_max_batch_size 512 \ +--linear_regression_execution_time_predictor_config_prediction_max_tokens_per_request 16384 \ ``` * Pipeline parallelism is supported for all models. The PP dimension should divide the number of layers in the model. @@ -97,26 +105,20 @@ or a big example with all the parameters, ```sh python -m vidur.main \ ---replica_device a100 \ ---replica_model_name meta-llama/Llama-2-7b-hf \ ---cluster_num_replicas 1 \ ---replica_num_tensor_parallel_workers 1 \ ---replica_num_pipeline_stages 1 \ ---request_generator_provider synthetic \ ---synthetic_request_generator_length_provider trace \ ---synthetic_request_generator_interval_provider static \ ---request_generator_max_tokens 4096 \ ---trace_request_length_generator_trace_file ./data/processed_traces/arxiv_summarization_stats_llama2_tokenizer_filtered_v2.csv \ ---synthetic_request_generator_num_requests 128 \ ---request_generator_provider synthetic \ ---synthetic_request_generator_length_provider trace \ ---synthetic_request_generator_interval_provider static \ ---request_generator_max_tokens 4096 \ ---trace_request_length_generator_trace_file ./data/processed_traces/arxiv_summarization_stats_llama2_tokenizer_filtered_v2.csv \ ---synthetic_request_generator_num_requests 128 \ ---replica_scheduler_provider vllm \ ---replica_scheduler_batch_size_cap 256 \ ---vllm_scheduler_max_tokens_in_batch 4096 +--replica_config_device a100 \ +--replica_config_model_name meta-llama/Llama-2-7b-hf \ +--cluster_config_num_replicas 1 \ +--replica_config_tensor_parallel_size 1 \ +--replica_config_num_pipeline_stages 1 \ +--request_generator_config_type synthetic \ +--length_generator_config_type trace \ +--interval_generator_config_type static \ +--[trace|zipf|uniform|fixed]_request_length_generator_config_max_tokens 4096 \ +--trace_request_length_generator_config_trace_file ./data/processed_traces/arxiv_summarization_stats_llama2_tokenizer_filtered_v2.csv \ +--synthetic_request_generator_config_num_requests 128 \ +--replica_scheduler_config_type vllm \ +--[vllm|lightllm|orca|faster_transformer|sarathi]_scheduler_config_batch_size_cap 256 \ +--[vllm|lightllm]_scheduler_config_max_tokens_in_batch 4096 ``` The simulator supports a plethora of parameters for the simulation description which can be found [here](docs/launch_parameters.md). diff --git a/docs/launch_parameters.md b/docs/launch_parameters.md deleted file mode 100644 index b42a8fc..0000000 --- a/docs/launch_parameters.md +++ /dev/null @@ -1,112 +0,0 @@ -# Understanding the parameters taken by the simulator - -The [default.yml](vidur/config/default.yml) is the comprehensive list of all parameters taken by the simulator. While invoking the simulator, any of these parameters can be overrided. Running only `python -m vidur.main` means that all the parameters are taken from the `default.yml` file and no overrides. -The parameters descriptions are given below: - -1. `seed`: Random seed which is set in multiple random generators notably the request length and inter-request time generators. This is useful for reproducibility. -2. `log_level`: Logging level for the simulator. Not comprehensively supported currently. -3. `output_dir`: The directory which each invocation of the simulator creates its directory under. Eg: `./simulator_output/2023-11-20_11-31-40-523377`. -All the output files corresponding to the invocation are stored under this directory eg. the chrome trace, cdf plots etc. -4. `cache_dir`: The simulator has tiny models inside it to predict time taken by various model operations eg. `mlp_up_proj`. These model weights are cached in this directory. -5. `write_json_trace`: Whether to write the requests sent to the simulator in a json file. -6. `write_chrome_trace`: Whether to write the chrome trace. This is useful for debugging. Use `chrome://tracing` or `edge://tracing` to view the trace. -7. `write_metrics`: This is a blanket flag to enable/disable writing of all metrics. Should be set to `true` as metrics are the only thing that the simulator gives, no LLM is actually running inside it. -8. `cluster`: - 1. `num_replicas`: Number of replicas in the clusters. Replicas and independent and identical. - Suppose you have a DGX box with 8 GPUS and you want to serve `meta-llama/Llama-2-70b-hf`. - One deployment strategy is to run 2 replicas each with 4 GPUs running the model in tensor parallel degree 4. - Another deployment strategy is to run 1 replica with all the 8 GPUs running the model in tensor parallel degree 8. -9. `replica`: Configuration of each replica. - 1. `block_size`: This is a concept from vLLM. Each request has a number of tokens each of whose KV value needs to be cached. The cache is divided into blocks of size `block_size`. The number of blocks each request needs is `num_blocks = ceil(num_tokens / block_size)`. - 2. `memory_margin_fraction`: From vLLM. Fraction of memory that is left unused typically for `nccl`, `cuBLAS` libraries. This is not a strict constraint. Actual deployment does go over this limit. - 3. `num_pipeline_stages`: Pipeline parallel degree. This number must divide the number of layers in the model. - 4. `num_tensor_parallel_workers`: Tensor parallel degree. - 5. Model specs: Refer huggingface `config.json` for the model eg. - 1. `model_name`: Typically huggingface id of the model. Eg: `meta-llama/Llama-2-70b-hf`. Custom model architectures can be used but please take the resultd with a large grain of salt. - 2. `num_layers` - 3. `num_q_heads` - 4. `num_kv_heads` - 5. `embedding_dim` - 6. `mlp_hidden_dim` - 7. `use_gated_mlp` - 8. `vocab_size` - 6. GPU specs: - 1. `fp16_tflops`: TFLOPS of the GPU in FP16. This is used to predict the execution time of the model. - 2. `total_memory_gb`: Total memory of the GPU in GB. This is used in memory calculations of the model weights, KV cache etc. - 3. For `a100`: `fp16_tflops: 312`, `total_memory_gb: 80` -10. `request_generator`: The simulator contains a comprehensive request generator. See [here](vidur/request_generator) - 1. `provider`: The request generator to use. Currently supported are `synthetic`, `trace`. `synthetic` generates requests from a synthetic distribution. `trace` generates requests from a real-world trace. - 2. `max_tokens`: Maximum number of tokens in a request. Requests generated from the trace are capped / clipped at this number. `P:D ratio` is preserved in case of clipping. -11. `synthetic_request_generator`: This section is used to further define the synthetic request generator. Only required if `request_generator_provider` is set to `synthetic`. - 1. `length_provider`: The distribution of the request length. Currently supported are `uniform`, `trace` and `zipf`. - 2. `interval_provider`: The distribution of the inter-request time. Currently supported are `static`, `trace`, `poisson` and `gamma`. - 3. `min_tokens`: Minimum number of tokens in a request when `uniform`, `zipf` is used as the `length_provider`. TODO: Verify for `trace` as well. - 4. `prefill_to_decode_ratio`: Ratio of prefill tokens to decode tokens in a request. This is used in `uniform` length provider. TODO: Verify for `zipf` as well. - 5. `num_requests`: Number of requests to generate / select from the trace. -12. `trace_request_generator`: This section is used to to further define the trace request generator. Only required if `request_generator_provider` is set to `trace`. - 1. `trace_file`: Path to the trace file. - 2. `date`: Date of the trace to use. - 3. `prefill_scale_factor`: Scale factor to apply to the prefill tokens in the trace. Recommend to leave this value at 1. - 4. `decode_scale_factor`: Scale factor to apply to the decode tokens in the trace. Recommend to leave this value at 1. - 5. `time_scale_factor`: Scale factor to apply to the window time in the trace. This can be used to speed up / slow down the trace. Example, to compress a 24h trace to 1h. Scale factors drastically change the worload. Scaled traces cannot be typically directly compared to the original trace. -13. `trace_request_length_generator`: Only required if `request_generator_provider` is set to `synthetic` and `synthetic_request_length_provider` is set to `trace`. - 1. `trace_file`: Path to the trace file. This trace file is a csv like [cnn_dailymail_stats_llama2_tokenizer.csv](data/processed_traces/cnn_dailymail_stats_llama2_tokenizer.csv) - 2. `prefill_scale_factor`: See `trace_request_generator` section above. - 3. `decode_scale_factor`: See `trace_request_generator` section above. -14. `trace_request_interval_generator`: Only required if `request_generator_provider` is set to `synthetic` and `synthetic_request_interval_provider` is set to `trace`. - 1. `trace_file`: Path to the trace file. - 2. `start_time`: Start time of the trace to use. - 3. `end_time`: End time of the trace to use. - 4. `time_scale_factor`: See `trace_request_generator` section above. -15. `poisson_request_interval_generator`: Only required if `request_generator_provider` is set to `synthetic` and `synthetic_request_length_provider` is set to `poisson`. - 1. `qps`: Requests per second to hit the system with. -16. `gamma_request_interval_generator`: Only required if `request_generator_provider` is set to `synthetic` and `synthetic_request_length_provider` is set to `gamma`. - 1. `cv`: Coefficient of variation of the gamma distribution. - 2. `qps`: Requests per second to hit the system with. -17. `zipf_request_length_generator`: Only required if `request_generator_provider` is set to `synthetic` and `synthetic_request_length_provider` is set to `zipf`. - 1. `theta`: Shape parameter of the zipf distribution. - 2. `scramble`: Whether to scramble the zipf distribution. This is useful to avoid the zipf distribution to be skewed towards the start of the vocabulary. -18. `execution_time_predictor`: Type of the tiny models inside the simulator to predict the execution time of the model. - 1. `provider`: `sklearn`, `random_forrest` or `linear_regression`. -19. `sklearn_execution_time_predictor`: - 1. `compute_input_file`: `./data/profiling/a100/mlp.csv` - 2. `attention_input_file`: `./data/profiling/a100/mixed_attention.csv` - 3. `all_reduce_input_file`: `./data/profiling/a100/all_reduce.csv` - 4. `send_recv_input_file`: .`/data/profiling/a100/p2p_intra_node.csv` - 5. `cpu_overhead_input_file`: `./data/profiling/a100/cpu_overheads.csv` - 6. `k_fold_cv_splits`: `10` - 7. `no_cache`: `false` - 8. `kv_cache_prediction_granularity`: `8` - 9. `prediction_max_prefill_chunk_size:`4096` - 10. `prediction_max_batch_size`: `100` - 11. `prediction_max_tokens_per_request`: `4096` - 12. `attention_decode_overhead_percentage`: `0.0` - 13. `nccl_cpu_launch_overhead_ms`: `0.020` -20. `random_forrest_execution_time_predictor`: TODO. Recommend to use the `sklearn_execution_time_predictor` instead. -21. `linear_regression_execution_time_predictor`: TODO. Recommend to use the `sklearn_execution_time_predictor` instead. -22. `simulator`: - 1. `time_limit`: Time limit for the simulator to run. This is useful to run the simulator for a fixed amount of time. The simulator will stop after this time limit is reached. Default is no limit. TODO: Verify the functionality of this parameter. -23. `global_scheduler`: This is the scheduler which determines which replica to send the request to. - 1. `provider`: `round_robin`, `random`, `lor`. See [here](vidur/schedulers/global_schedulers) for more details. -24. `replica_scheduler`: This is the scheduler which determines how to schedule the requests on a replica. - 1. `provider`: `orca`, `sarathi`, and `vllm`. See [here](vidur/schedulers/replica_schedulers) for more details. - 2. `batch_size_cap`: Maximum permissible batch size. Set carefully for `orca`. Have a high limit for other schedulers. They will auto-adjust. - 3. `num_blocks`: TODO. Ignore this parameter for now. -25. `orca_scheduler`: Only required if `replica_scheduler_provider` is set to `orca`. - 1. `use_single_prefill_per_batch`: Whether to use a single prefill per batch. This is a non-standard param that if true `orca` scheduler quite competitive. -26. `sarathi_scheduler`: . Only required if `replica_scheduler_provider` is set to `sarathi`. - 1. `chunk_size`: The maximum number of tokens (prefill / decode) to process in a batch. Prefills are done progressively if the number of prefills tokens in a request is greater than this number. - 2. `enable_rolling_prefills`: Multiple prefills are done in a batch provided sum of the prefills is less than `chunk_size`. - 3. `prefill_fitting_tolerance`: Ignore this parameter. Leave it at 0.0. -27. `vllm_scheduler`: . Only required if `replica_scheduler_provider` is set to `vllm`. - 1. `watermark_blocks_fraction`: If this param is 0.01, then we consider the cache is full when 99% of the blocks are full. Prevents unnecessary swaps. - 2. `max_tokens_in_batch`: Maximum number of tokens in a batch. This is an additional limit on top of `batch_size_cap`. - 3. `max_batch_size_amplification_factor`: Ignore this parameter, leave it at `1`. -29. `metrics_store`: Configuration of the metrics store. The metrics store is a cental store which stores the metrics of the simulator. At simulation end, it dumps the metrics to various files typically `csv`, `png` and `json`. The metrics store is also responsible for uploading the metrics to `wandb`. - 1. `wandb_project`: Wandb project to upload to eg. `llm-simulator` - 2. `wandb_group` - 3. `wandb_run_name`: Leave empty string to auto-generate the run name. Recommend to have a run name to identify the run. - 4. `subsamples`: Number of subsamples to take from the metrics. This is useful to limit the number of datapoints of a metric. - 5. `save_table_to_wandb`: Whether to upload the csvs corresponding to the plots uploaded to wandb. Set it to true. - 6. `min_batch_idx`: Ignore this parameter. - 7. `max_batch_idx`: Ignore this parameter. diff --git a/vidur/config/flat_dataclass.py b/vidur/config/flat_dataclass.py index fe23dac..cee68c5 100644 --- a/vidur/config/flat_dataclass.py +++ b/vidur/config/flat_dataclass.py @@ -90,7 +90,7 @@ def create_from_cli_args(cls) -> Any: nargs = None action = None field_type = field.type - help_text = field.metadata.get("help", None) + help_text = cls.metadata_mapping[field.name].get("help", None) if is_list(field.type): assert is_composed_of_primitives(field.type) @@ -113,7 +113,10 @@ def create_from_cli_args(cls) -> Any: # handle cases with default and default factory args if field.default is not MISSING: - arg_params["default"] = field.default + value = field.default + if callable(value): + value = value() + arg_params["default"] = value elif field.default_factory is not MISSING: arg_params["default"] = field.default_factory() else: @@ -121,7 +124,6 @@ def create_from_cli_args(cls) -> Any: if nargs: arg_params["nargs"] = nargs - parser.add_argument(f"--{field.name}", **arg_params) args = parser.parse_args() @@ -139,6 +141,7 @@ def create_flat_dataclass(input_dataclass: Any) -> Any: processed_classes = set() dataclass_args = defaultdict(list) dataclass_dependencies = defaultdict(set) + metadata_mapping = {} def process_dataclass(_input_dataclass, prefix=""): if _input_dataclass in processed_classes: @@ -165,6 +168,7 @@ def process_dataclass(_input_dataclass, prefix=""): meta_fields_with_defaults.append( (type_field_name, type(default_value), default_value) ) + metadata_mapping[type_field_name] = field.metadata assert hasattr(field_type, "__dataclass_fields__") for subclass in get_all_subclasses(field_type): @@ -202,6 +206,7 @@ def process_dataclass(_input_dataclass, prefix=""): dataclass_args[_input_dataclass].append( (prefixed_name, field.name, field_type) ) + metadata_mapping[prefixed_name] = field.metadata process_dataclass(input_dataclass) @@ -211,6 +216,7 @@ def process_dataclass(_input_dataclass, prefix=""): # Metadata fields FlatClass.dataclass_args = dataclass_args FlatClass.dataclass_dependencies = dataclass_dependencies + FlatClass.metadata_mapping = metadata_mapping # Helper methods FlatClass.reconstruct_original_dataclass = reconstruct_original_dataclass