Numpy 进阶（待完善）

2018-09-17 Machine Learning

在机器学习等相关数据行业中，NumPy 库里经常用到的是 ndArray 类型，该类包含了很多方便的函数（方法和函数本质上是同一个东西），下面就详细介绍它们，可通过全文检索来查询

载入

NumPy 和 Pandas 入门一文里已介绍了 Numpy 的安装方式，这里我们终端里进入 Python 的交互式解释器（interactive interpreter）后，将直接载入 Numpy 库

$ python
Python 3.6.5 |Anaconda custom (64-bit)| (default, Apr 26 2018, 08:42:37) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np

ndArray 类型的常见 Methods

## shape - 查看线性空间的维度（注意观察以下 3 个例子的 shape 输出区别
>>> tensor_0 = np.array(5)
>>> tensor_0.shape
()     ## 零维张量（zero dimensional Tensor），即标量（Scalar）

>>> tensor_1 = np.array([1, 2, 3, 4])
>>> tensor_1.shape
(4,)   ## 一维张量（one dimensional Tensor），即矢量（Vector）—— 注：Python 不能将 (4) 识别为具有一个项的元组（tuple），所以它需要逗号，类似 1 被识别为 int ，1. 则被识别为 float  

>>> tensor_2 = np.array([[1], [2], [3], [4]])
>>> tensor_2.shape
(4, 1) ## 二维张量（two dimensional Tensor），即矩阵（Matrix）

>>> tensor_4 = np.array([[[[1],[2]],[[3],[4]],[[5],[6]]],[[[7],[8]],\
...     [[9],[10]],[[11],[12]]],[[[13],[14]],[[15],[16]],[[17],[17]]]])
>>> tensor_3.shape
(3, 3, 2, 1)  ## 四维张量（four dimensional Tensor）) —— 维度大于 2 的，全部统一称为 Tensor

## 注意 numpy 的 ndArray 类型是传「址」的，不是传「值」的
## 要用 numpy.copy() 来传「值」
>>> tensor_1_addr = tensor_1
>>> tensor_1_value = np.copy(tensor_1)
>>> tensor_1[3]=1
>>> tensor_1[0]=4
>>> tensor_1
array([4, 2, 3, 1])
>>> tensor_1_addr     ## 更改 tensor_1 之后，tensor_1_addr 随之改变
array([4, 2, 3, 1])
>>> tensor_1_value    ## 更改 tensor_1 之后，tensor_1_value 不受影响
array([1, 2, 3, 4])

## 计算平均值
>>> np.mean(tensor_1)
2.5

## 计算最大值
>>> np.amax(tensor_1)
4

## 计算最大值所对应的 index（tensor_1[3] 是 array 中的最大值 4）
>>> np.argmax(tensor_1)
3

## 计算最小值
>>> np.amin(tensor_1)
1

## 计算中位数
>>> np.median(tensor_1)
2.5

## 计算标准差 - ddof 参数代表贝塞尔校正，当用样本标准差估算总体标准差的时候需要调整求平均时的分母（通常调整为 N-1）
## ddof 参数默认值为 0 ，代表分母使用的是数据的总量 N ，更准确的来说分母是（N - ddof），当 ddof = 0 分母为 N
## The divisor used in calculations is (N - ddof), where N represents the number of elements.  
>>> np.std(tensor_1, ddof=0)
1.118033988749895

>>> items = [1, 2, 3, 4, 5]
## 计算自然对数的底 e 的幂 - 即 2.71828... 的幂
>>> np.exp(1)
2.718281828459045
>>> np.exp(items)  # 注意输入类型是 list , 返回的类型则是 ndArray
array([  2.71828183,   7.3890561 ,  20.08553692,  54.59815003,
       148.4131591 ])

## 计算自然对数的值 - 即以 2.71828... 为底的对数的值
>>> np.e
2.718281828459045
>>> np.log(np.e)
1.0

## 将两个 ndArray 类型简单首尾连接结合在一起
>>> np.append(tensor_1_addr, tensor_1_value)
array([4, 2, 3, 1, 1, 2, 3, 4])

ndArray 类型的矩阵相关计算

>>> matrx_a = np.array([[1,2,3,4],[5,6,7,8]])
>>> matrx_a
array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

>>> matrx_b = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
>>> matrx_b
array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

## 矩阵的转置 - 返回的是新索引，共享相同的数据（同一内存地址），要谨慎修改
## NumPy 在进行转置时不会实际移动内存中的任何数据，只是改变对原始矩阵的索引方式
>>> matrx_b.T
array([[ 1,  4,  7, 10],
       [ 2,  5,  8, 11],
       [ 3,  6,  9, 12]])

## 矩阵的元素级计算（注意区别于矩阵的乘法）
>>> matrx_a * 0.25
array([[0.25, 0.5 , 0.75, 1.  ],
       [1.25, 1.5 , 1.75, 2.  ]])

## 找出矩阵中所有元素的最大值
>>> np.nanmax(matrx_a)
8

## 向量点积
>>> a = [[1, 0], [0, 1]]
>>> b = [[4, 1], [2, 2]]
>>> np.dot(a, b)
array([[4, 1],
       [2, 2]])

## 矩阵的乘法
>>> np.matmul(matrx_a, matrx_b)
array([[ 70,  80,  90],
       [158, 184, 210]])

## 使用 reshape() 调整矩阵形状，以便进行矩阵的乘法
>>> matrx_a.reshape([1,8])
array([[1, 2, 3, 4, 5, 6, 7, 8]])
>>> matrx_a.reshape([8,1])
array([[1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8]])

## 创建一个和矩阵 matrx_a 相同行数和列数的新矩阵，并以 0 填充
>>> np.zeros_like(matrx_a)
array([[0, 0, 0, 0],
       [0, 0, 0, 0]])

ndArray 类型的风骚操作

## 根据另一个 list 里的 True 或 False（真假布尔值）来选出指定 list 里的元素
## True 或 False（真假布尔值）的 list 里的元素也可以是 0 和 1（0 在 python 中是 False，1 是 True）
>>> from itertools import compress   # 载入 intertools 库里的 compress 函数
>>> list_a = [1, 2, 4, 6]
>>> fil = [True, False, True, False]
>>> list(compress(list_a, fil))
[1, 4]

Axis 参数

Numpy 里的 Array 类型，Pandas 里的 Series 类型和 DataFrame 类型都通常能设置一个 axis 参数，代表操作方向（是对 column 操作，还是对 row 操作），但每次用的时候都需要先测试一下，才能确保参数值 1 和 0 到底分别代表哪个方向

对于 Array 类型和 Series 类型：

axis = 0
axis = 1

对于 DataFrame 类型：

axis = 0 或 axis = "index"
along the index（沿着 index 方向 - 垂直纵向）
axis = 1 或 axis = "columns"
along the column（沿着 column 的方向 - 水平横向）

`打赏`

QR Code for donation