diff --git a/README.md b/README.md index e95a772..ee8a00b 100644 --- a/README.md +++ b/README.md @@ -48,9 +48,10 @@ To get started, clone this repository to your local machine and navigate to the ### 📦 Generate wheel package to share with others 1. *activate venv* 2. Update version inside setup.py if needed. -3. ```python setup.py bdist_wheel``` -4. Fetch from dir dist/ the .whl -5. This file can be installed via `pip install eureka_ml_insights.whl` +3. Install wheel package via ```pip install wheel``` +4. ```python setup.py bdist_wheel``` +5. Fetch from dir dist/ the .whl +6. This file can be installed via `pip install eureka_ml_insights.whl` ## 🚀 Quick start To reproduce the results of a pre-defined experiment pipeline, you can run the following command: @@ -148,4 +149,4 @@ A cross-cutting dimension for all capability evaluations is the evaluation of se A general rising concern on responsible AI evaluations is that there is a quick turnaround between new benchmarks being released and then included in content safety filters or in post training datasets. Because of this, scores on benchmarks focused on responsible and safe deployment may appear to be unusually high for most capable models. While the quick reaction is a positive development, from an evaluation and understanding perspective, the high scores indicate that the benchmarks are not sensitive enough to capture differences in alignment and safety processes followed for different models. At the same time, it is also the case that fielding thresholds for responsible AI measurements can be inherently higher and as such these evaluations will require a different interpretation lens. For example, a 5 percent error rate in instruction following for content length should not be weighed in the same way as a 5 percent error rate in detecting toxic content, or even a 5 percent success rates in jailbreak attacks. Therefore, successful and timely evaluations to this end depend on collaborative efforts that integrate red teaming, quantified evaluations, and human studies in the context of real-world applications. -Finally, Eureka and the set of associated benchmarks are only the initial snapshot of an effort that aims at reliably measuring progress in AI. Our team is excited about further collaborations with the open-source community and research, with the goal of sharing and extending current measurements for new capabilities and models. Our current roadmap involves enriching Eureka with more measurements around planning, reasoning, fairness, reliability and safety, and advanced multimodal capabilities for video and audio. \ No newline at end of file +Finally, Eureka and the set of associated benchmarks are only the initial snapshot of an effort that aims at reliably measuring progress in AI. Our team is excited about further collaborations with the open-source community and research, with the goal of sharing and extending current measurements for new capabilities and models. Our current roadmap involves enriching Eureka with more measurements around planning, reasoning, fairness, reliability and safety, and advanced multimodal capabilities for video and audio. diff --git a/requirements.txt b/requirements.txt index 6b7863f..ad44f66 100644 --- a/requirements.txt +++ b/requirements.txt @@ -3,4 +3,3 @@ isort==5.9.3 black==24.3.0 autoflake==1.7.5 mypy==1.10.1 -wheel \ No newline at end of file