数据预处理常用方法

数据获取

字母数字

可以通过string模块获取所有的大小写字母、数字和符号等，举例：

>>> import string
>>> string.ascii_lowercase
# 小写字母
'abcdefghijklmnopqrstuvwxyz'
>>> string.ascii_uppercase
# 大写字母
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> string.ascii_letters
# 大小写字母
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> string.digits
# 数字
'0123456789'
>>> string.hexdigits
# 十六进制
'0123456789abcdefABCDEF'
>>> string.punctuation
# 符号
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>> string.whitespace
# 空白符
' \t\n\r\x0b\x0c'
>>> string.printable
# 所有支持的字符
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

数据集处理

归一化

归一化可以根据数据来自己处理，比如图像数据的范围是0到255，想控制在0~1之间，那么就可以直接/255.，或者直接使用sklearn.preprocessing下的MinMaxScaler方法，举例：

>>> from sklearn.preprocessing import MinMaxScaler
>>> x = np.array([ -20,    1,    5,  100,  300, -200])
>>> x
array([ -20,    1,    5,  100,  300, -200])
>>> min_max_scaler = MinMaxScaler()
>>> min_max_scaler.fit_transform(x)
# 数据被缩放到了0到1之间
array([0.36 , 0.402, 0.41 , 0.6  , 1.   , 0.   ])

规范化（normalization）

使用sklearn.preprocessing下的normalize方法，默认使用l2正则化，举例：

>>> from sklearn.preprocessing import normalize
>>> x = np.array([ -20,    1,    5,  100,  300, -200])
>>> normalize(x, norm='l1')
# 使用l1正则化
array([[-0.03194888,  0.00159744,  0.00798722,  0.15974441,  0.47923323,
        -0.31948882]])
>>> x / np.sum(abs(x))
# 可以看出结果是一样的
array([-0.03194888,  0.00159744,  0.00798722,  0.15974441,  0.47923323,
       -0.31948882])
>>> np.sum(abs(normalize(x, norm='l1')))
# 并且计算也可以发现总和为1
1.0
>>> normalize(x)
# 默认l2正则化
array([[-0.05337111,  0.00266856,  0.01334278,  0.26685555,  0.80056665,
        -0.5337111 ]])
>>> x / np.sqrt(np.sum(np.square(x)))
# 计算发现结果一样
array([-0.05337111,  0.00266856,  0.01334278,  0.26685555,  0.80056665,
       -0.5337111 ])
>>> np.sum(np.square(normalize(x)))
# 可以看到结果为1
1.0

标准化（standardization）

将数据变成符合高斯分布，即均值为0，方差为1的分布，使用sklearn.preprocessing下的scale方法，举例：

>>> from sklearn.preprocessing import scale
>>> x = np.random.rand(2, 5)
>>> x
array([[0.13235262, 0.39446947, 0.24545936, 0.71063946, 0.84120157],
       [0.44020666, 0.93713114, 0.71579456, 0.35743245, 0.85335586]])
>>> scale(x)
array([[-1., -1., -1.,  1., -1.],
       [ 1.,  1.,  1., -1.,  1.]])
>>> scale(x).mean(), scale(x).std()
# 可以看出数据变成均值接近0，方差接近1的分布（std是标准差）
(2.2204460492503132e-17, 1.0)

随机按比例分数据集

使用sklearn.model_selection下的train_test_split方法，举例：

>>> from sklearn.model_selection import train_test_split
>>> x = [i for i in range(10)]
>>> train, test = train_test_split(x, test_size=0.2)
# 按80%训练集，20%测试集拆分
>>> train
[4, 5, 0, 3, 1, 9, 8, 7]
>>> test
[2, 6]

同时打乱x，y数据集

使用sklearn.utils下的shuffle方法，举例：

>>> from sklearn.utils import shuffle
>>> x = [1, 2, 3]
>>> y = [10, 20, 30]
>>> shuffle(x, y)
# 可以看出x、y按相同的规律被打乱
[[2, 3, 1], [20, 30, 10]]

编码方式

one-hot编码

使用kears.utils下的to_categorical方法，举例：

>>> from keras.utils import to_categorical
>>> to_categorical([1,2,5])
array([[0., 1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1.]], dtype=float32)

图像处理

彩色转黑白

图片之所以彩色是因为其为一个三通道的数组，即格式是：（高，宽，3），如果希望读入的是黑白图，那么只需要将三个通道改成一个即可，方法有很多，一种是在cv2读取的时候直接指定，举例：

>>> import cv2
>>> img = cv2.imread(filepath)
>>> img.shape
# 默认读入三通道
(127, 109, 3)
>>> img = cv2.imread(filepath, 0)
# 读入一通道，即黑白图
>>> img.shape
(127, 109)

其他的有比如只读入三通道的其中一个通道，或者三个通道合并成一个并取均值，举例：

>>> img = cv2.imread(filepath)
>>> img.shape
(127, 109, 3)
>>> img[:,:,0].shape
# 读第一个通道
(127, 109)
>>> x = ((img[:,:,0] + img[:,:,1] + img[:,:,2]) // 3).astype(np.uint8)
# 三个通道取均值，因为图片数组类型是uint8，所以要转回去
# 但是这个代码一般在图像数组情况下很可能出问题，不建议用，原因下面会说
>>> x.shape
# 可以看到也变成一个通道（但值未必是均值）
(127, 109)
>>> y = np.mean(img, axis=2).astype(np.uint8)
# 对通道取均值，建议用这种方式，原因下面会说
>>> y.shape
# 可以看到格式也正确
(127, 109)
>>> x == y
# 对比x和y，发现竟然很多地方结果不一样
array([[ True, False, False, ..., False,  True, False],
       [False,  True, False, ..., False,  True,  True],
       [ True,  True,  True, ..., False, False,  True],
       ...,
       [False, False, False, ..., False, False,  True],
       [False, False, False, ..., False, False,  True],
       [False, False,  True, ..., False, False, False]])
>>> img[0,1,:], x[0,1], y[0,1]
# 输入第一个False的地方，发现后者是对的，原因很简单：
# 数组原本类型是uint8，即最大值是255，所以如果直接相加，那么超出255就会重新从0开始
# 结果第一种方法的结果是：(135+121+115-255) // 3 = 38
# 而第二种则是直接对数值进行正常相加并取均值
(array([135, 121, 115], dtype=uint8), 38, 123)

或者使用cv2下的cvtColor方法和PIL图像处理的convert方法也可以，举例：

>>> import cv2
>>> img = cv2.imread(filepath)
>>> img.shape
(127, 109, 3)
>>> cv2.cvtColor(img, cv2.COLOR_BGR2GRAY).shape
# cv2转成灰度图
(127, 109)
>>> from PIL import Image
>>> img = Image.open(filepath)
>>> np.array(img).shape
(127, 109, 3)
>>> np.array(img.convert("L")).shape
# pil转灰度图
(127, 109)

修改尺寸

使用cv2下resize方法，举例：

>>> import cv2
>>> x = np.random.rand(10, 20)
# 定义一个高10，宽20的黑白图片（如果希望彩色，需要在最后一维添加三通道）
>>> x.shape
(10, 20)
>>> cv2.resize(x, (6, 10)).shape
# 修改成高10，宽6的黑白图片（注意resize的输入是先宽后高）
(10, 6)

合并图片

图片本身也是一种数组，所以合并图片也就是合并数组的问题

纵向合并

>>> import cv2
>>> img = cv2.imread(filepath)
>>> img.shape
(576, 1024, 3)
>>> np.concatenate((img, img)).shape
# 默认按纵向添加数组
(1152, 1024, 3)
>>> plt.imshow(np.concatenate((img, img)))
<matplotlib.image.AxesImage object at 0x00000198E5DDF240>
>>> plt.show()
# 可以查看图片内容

横向合并
和纵向基本一样，只是合并的维度需要制定一下，即np.concatenate那句修改成：np.concatenate((img, img), 1)即可

逆向处理

只要把对应维度通过切片操作逆序即可，举例：

>>> import cv2
>>> img = cv2.imread(filepath)
>>> plt.imshow(img[::-1,:,:])
# 上下倒序
<matplotlib.image.AxesImage object at 0x00000198E8E719B0>
>>> plt.show()
>>> plt.imshow(img[:,::-1,:])
# 左右倒序
<matplotlib.image.AxesImage object at 0x00000198E5968AC8>
>>> plt.show()

模糊处理

可以使用pillow模块下的filter方法处理，举例：

>>> from PIL import Image, ImageFilter
>>> img = Image.open(filepath)
>>> np.array(img.filter(ImageFilter.BLUR)).shape
# 均值模糊处理
(235, 209, 3)
>>> np.array(img.filter(ImageFilter.GaussianBlur)).shape
# 高斯模糊处理
(235, 209, 3)

cv2与pillow相互转换

举例：

>>> from PIL import Image, ImageFilter
>>> import cv2
>>> img = Image.open(filepath)
# pillow读取图片
>>> img
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=209x235 at 0x165BA6CD5C0>
>>> np.array(img).shape
# 直接可以转数组，然后就可以给cv2读了
(235, 209, 3)
>>> img2 = cv2.imread(filepath)
>>> img2.shape
(235, 209, 3)
>>> Image.fromarray(img2)
# pillow读cv2数组
<PIL.Image.Image image mode=RGB size=209x235 at 0x165BBC485C0>

文本处理

统一序列长度

使用keras.preprocessing.sequence下的pad_sequences方法，举例：

>>> from keras.preprocessing.sequence import pad_sequences
>>> pad_sequences([[1,2,3], [1,2,3,4], [1,2,3,4,5]], maxlen=5)
# 统一所有序列长度为5，不够的前面用0填充
array([[0, 0, 1, 2, 3],
       [0, 1, 2, 3, 4],
       [1, 2, 3, 4, 5]])
>>> pad_sequences([[1,2,3], [1,2,3,4], [1,2,3,4,5]], 5, 'float', padding='post', value=-1)
# 在上面的基础上将所有数据转float型，并且缺省部分在后面用-1填充（默认padding为`pre`，即在前面填充）
array([[ 1.,  2.,  3., -1., -1.],
       [ 1.,  2.,  3.,  4., -1.],
       [ 1.,  2.,  3.,  4.,  5.]])