-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add data science chapter
- Loading branch information
Showing
67 changed files
with
68,723 additions
and
1,349 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
# 数据科学背景知识 | ||
# 数据科学 | ||
|
||
Python 分布式编程主要服务于数据科学和人工智能等对高性能计算领域,本章主要介绍数据科学相关背景知识,如果相关基础较好,也可以直接跳过本章。 | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
(parallel-program-design)= | ||
# 并行程序设计方法 | ||
|
||
## PCAM | ||
|
||
如何设计软件和算法,使得程序可以并行运行在多核或者集群上?早在 1995 年,Ian Foster 在其书中提出了 PCAM 方法 {cite}`foster1995designing`,其思想可以用来指导并行算法的设计。PCAM 主要有四个步骤:切分(Partitioning)、通信(Communication)、聚集(Agglomeration)和分发(Mapping);{numref}`pcam-img` 展示了这四个步骤。 | ||
|
||
```{figure} ../img/ch-intro/pcam.png | ||
--- | ||
width: 400px | ||
name: pcam-img | ||
--- | ||
PCAM 方法 | ||
``` | ||
|
||
* Partitioning:将整个问题切分为多个子问题或子任务,切分既包括计算部分也包括数据部分。 | ||
* Communication:不同子任务之间通信方式,包括通信的数据结构、通信算法。 | ||
* Agglomeration:考虑到当前所拥有的硬件性能和编程难度,将上面两步进一步整合,将细粒度的任务整合成更高效的任务。 | ||
* Mapping:将整合好的任务分发给多个处理器。 | ||
|
||
比如,有一个超大矩阵,矩阵大小为 $M \times M$,这个矩阵大到无法放在单个计算节点上计算,现在想得到这个矩阵的最大值。设计并行算法时,可以考虑如下思路: | ||
|
||
* 将矩阵切分成子矩阵,每个子矩阵 $m \times m$ 大小,在每台计算节点上执行 `max()` 函数求得子矩阵的最大值。 | ||
* 将每个子矩阵的最大值汇集到一个计算节点,再该节点再次执行一下 `max()` 求得整个矩阵的最大值。 | ||
* 子矩阵切分方式 $m \times m$ 可以在单个计算节点上运行,聚集值再次求最大值也可以在单给计算节点上运行。 | ||
* 将以上计算分发到多个节点。 | ||
|
||
## 切分方式 | ||
|
||
设计并行程序最困难也是最关键的部分是如何进行切分,常见的切分方式有: | ||
|
||
* 任务并行:一个复杂的程序往往包含多个任务,将不同的任务交给不同的 Worker,如果任务之间没有太多复杂的依赖关系,这种方式可以很好地并发执行。 | ||
* 几何分解:所处理的数据结构化,比如矩阵可以根据一维或多维分开,分配给不同的 Worker,刚才提到的对矩阵求最大值就是一个例子。 | ||
|
||
## 案例:MapReduce | ||
|
||
Google 2004 年提出 MapReduce {cite}`dean2004MapReduce`,这是一种典型的大数据并行计算范式。{numref}`map-reduce` 展示了使用 MapReduce 进行词频统计的处理方式。 | ||
|
||
```{figure} ../img/ch-intro/map-reduce.png | ||
--- | ||
width: 600px | ||
name: map-reduce | ||
--- | ||
MapReduce 进行词频统计 | ||
``` | ||
|
||
MapReduce 中主要涉及四个阶段: | ||
|
||
* 切分(Split):将大数据切分成很多份小数据,每份小数据可以在单个 Worker 上计算。 | ||
* 映射(Map):对每个小数据执行 Map 操作,Map 是一个函数映射,程序员需要自定义 Map 函数,Map 函数输出一个键值对(Key-Value)。在词频统计的例子中,每出现一个词,计 1 次,Key 是词,Value 是 1,表示出现 1 次。 | ||
* 交换(Shuffle):将相同的 Key 归结到相同的 Worker 上。这一步涉及数据交换。词频统计的例子中,将相同的词发送到同一个 Worker 上。 | ||
* 聚合(Reduce):所有相同的 Key 进行聚合操作,程序员需要自定义 Reduce 函数。词频统计的例子中,之前 Shuffle 阶段将已经将数据归结到了一起,现在只需要将所有词频求和。 | ||
|
||
MapReduce 的编程范式深刻影响了 Apache Hadoop、Apache Spark、Dask 等开源项目。 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
(performance-metrics)= | ||
# 性能指标 | ||
|
||
为了客观地评估并行计算程序的性能,需要有一些标准。FLOPS 和加速比是最通用的指标,对于特定的问题,比如人工智能和大数据,也有一些特定的评测基准。 | ||
|
||
## FLOPS | ||
|
||
传统的高性能计算经常使用 FLOPS(Floating Point OPerations per Second)来衡量软硬件的性能。 | ||
|
||
:::{note} | ||
所谓浮点数,指的是计算机一定比特位数来表示小数。用更多的比特位数,数值越精确,精度越高,但计算的成本越高。业界已经形成了一些标准,电气电子工程师学会(Institute of Electrical and Electronics Engineers, IEEE)定义了 16 位浮点数(FP16)、32 位浮点数(FP32)和 64 位浮点数(FP64)在计算机中的表示方法。大部分科学计算任务需要 FP64,深度学习等任务只需要 FP32、FP16 甚至更低。严格意义上来讲,需要明确是 FP32 还是 FP64 精度下的 FLOPS。因为不同硬件所能提供的 FP32 和 FP64 算力有很大差异。 | ||
::: | ||
|
||
FLOPS 指每秒钟能够完成多少次浮点计算。如果进行一个 $n$ 维向量加法:$a + b$,所需的浮点计算次数为 $n$。将浮点计算次数除以时间,就是 FLOPS。 | ||
|
||
FLOPS 指标既依赖于硬件性能,也与软件和算法高度相关。{numref}`thread-process` 提到线程安全问题,{numref}`serial-parallel` 中有任务分发的过程,如果软件算法设计不够好,大量计算资源闲置,应用程序的 FLOPS 可能很低。 | ||
|
||
## 加速比 | ||
|
||
理论上,并行程序应该比对应的串行程序更快,所用时间更短。执行时间的缩短可以用**加速比**来衡量: | ||
|
||
$$ | ||
加速比 = \frac{t_s}{t_p} | ||
$$ | ||
|
||
其中 $t_s$ 为串行程序执行时间,$t_p$ 为并行程序执行时间。 | ||
|
||
在加速比指标基础上,还有一种衡量方法,叫做**效率**: | ||
|
||
$$ | ||
效率 = \frac{加速比}{N} = \frac{t_s}{N \cdot {t_p}} | ||
$$ | ||
|
||
其中 $N$ 为并行程序所使用的计算核心的数目。当加速比为 $N$ 时,串行程序可以被线性拓展到多个计算核心上,可以说并行程序获得了*线性加速比*。 | ||
|
||
线性加速比是最理想的情况,实际上很难达到。{numref}`serial-parallel` 中的示意图中可以看到,并行程序需要有 Scheduler 将不同的任务分发到多个 Worker 上,多个 Worker 之间还要通信。 | ||
|
||
另外,在使用 GPU 时,计算效率指标的 $N$ 取值应该是多少也有一定争议。GPU 上的计算核心成千上万,我们很难在一个 GPU 计算核心上测试得到 $t_s$;GPU 与 CPU 是协同工作的,计算加速比或效率时,是否要把 CPU 考虑进来? |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
(serial-parallel)= | ||
# 串行执行与并行执行 | ||
|
||
如果不对计算任务(Task)进行并行加速,大部分计算任务是串行执行的,即 {numref}`serial-timeline` 所示。这里的 Worker 可以是一个计算核心,也可以是集群中的一个节点。 | ||
|
||
```{figure} ../img/ch-intro/serial-timeline.svg | ||
--- | ||
width: 600px | ||
name: serial-timeline | ||
--- | ||
串行执行的时间轴示意图 | ||
``` | ||
|
||
集群和异构计算提供了更多可利用的计算核心,并行计算将计算任务分布到多个 Worker 上,如 {numref}`distributed-timeline` 所示。无论是在单机多核编程,还是在集群多机,都需要一个调度器(Scheduler)将计算任务分布到不同的 Worker 上。随着更多 Worker 参与,任务总时间缩短,节省的时间可用于其他任务。 | ||
|
||
```{figure} ../img/ch-intro/distributed-timeline.svg | ||
--- | ||
width: 600px | ||
name: distributed-timeline | ||
--- | ||
分布式执行的时间轴示意图 | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.