Python-3.数据分析Pandas使用

Pandas官方定义: pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

Pandas是Python数据分析必备工具，实现数据分析的五个重要环节：

加载数据
整理数据
操作数据
构建数据模型
分析数据

Pandas介绍

Pandas主要特点

提供DataFrame对象。
从不同格式(Excel/CSV/SQL)文件中加载数据，转换为可处理对象。
按照数据行、列标签进行分组，对分组后的对象进行聚合和转换操作。
数据归一化和缺失值处理。
对DataFrame数据的列进行增、删、改操作。
处理不同格式数据集，如矩阵数据、异构数据表、时间序列。
可构建子集、切片、过滤、分组以及重新排序操作。

Pandas内置数据结构

数据类型	维度	说明
Series	一维	可存储各种数据类型，比如字符数、整数、浮点数、对象等，Series 用index和value属性来描述数据值
DataFrame	二维	二维表格型数据结构，包括行索引(index)和列索引(columns)，创建时，可指定相应索引值。

Series
DataFrame

Series对象

创建Series对象

'''
data    : 输入的数据，可以是列表、常量、ndarray 数组等。
index   : 索引值必须是惟一的，如果没有传递索引，则默认为 np.arange(n)。
dtype   : dtype表示数据类型，如果没有提供，则会自动判断得出。
copy    : 表示对 data 进行拷贝，默认为 False。
'''
import pandas as pd
s=pd.Series(data, index, dtype, copy)

import pandas as pd
import numpy as np

# 标量创建
s1 = pd.Series(5, np.arange(5))

# ndarray创建
data = np.array(['a','b','c','d'])
# 默认索引
s2 = pd.Series(data)
# 自定义索引标签
s3 = pd.Series(data, index=[100,101,102,103])

# dict创建
data_dict = {'a' : 0., 'b' : 1., 'c' : 2.}
s4 = pd.Series(data_dict)
s5 = pd.Series(data_dict, index=['b','c','d','a'])

print("----------s1----------")
print(s1)
print("----------s1----------")
print("----------s2----------")
print(s2)
print("----------s2----------")
print("----------s3----------")
print(s3)
print("----------s3----------")
print("----------s4----------")
print(s4)
print("----------s4----------")
print("----------s5----------")
print(s5)
print("----------s5----------")

# output
'''
----------s1----------
0    5
1    5
2    5
3    5
4    5
dtype: int64
----------s1----------
----------s2----------
0    a
1    b
2    c
3    d
dtype: object
----------s2----------
----------s3----------
100    a
101    b
102    c
103    d
dtype: object
----------s3----------
----------s4----------
a    0.0
b    1.0
c    2.0
dtype: float64
----------s4----------
----------s5----------
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64
----------s5----------
'''

读取Series数据

Series属性

attribute	description
axes	一维以列表的形式返回所有行索引标签。
dtype	返回对象的数据类型。
empty	返回一个布尔值，用于判断数据对象是否为空。
ndim	返回Series的维数，Series始终为1。
size	返回Series对象大小，元素数量
values	以 ndarray 的形式返回 Series 对象。
index	返回一个RangeIndex对象，用来描述索引的取值范围。

import pandas as pd
import numpy as np

s = pd.Series(np.random.randn(5))

print("----------s----------")
print(s)
print("---------s[0]---------")
print(s[0])
print("-----s[[0, 1, 2]]-----")
print(s[[0, 1, 2]])
print("--------s.axes--------")
print(s.axes)
print("--------s.dtype--------")
print(s.dtype)
print("--------s.empty--------")
print(s.empty)
print("--------s.ndim--------")
print(s.ndim)
print("--------s.size--------")
print(s.size)
print("--------s.values--------")
print(s.values)
print("--------s.index--------")
print(s.index)

# output
'''
----------s----------
0    0.160836
1   -0.289639
2    0.260978
3    0.194951
4   -0.012629
dtype: float64
---------s[0]---------
0.16083581207268774
-----s[[0, 1, 2]]-----
0    0.160836
1   -0.289639
2    0.260978
dtype: float64
--------s.axes--------
[RangeIndex(start=0, stop=5, step=1)]
--------s.dtype--------
float64
--------s.empty--------
False
--------s.ndim--------
1
--------s.size--------
5
--------s.values--------
[ 0.16083581 -0.28963936  0.26097825  0.1949514  -0.012629  ]
--------s.index--------
RangeIndex(start=0, stop=5, step=1)
'''

method	description	method	description
head()	返回前n行数据，默认显示前5行	isnull()	如果为值不存在或者缺失，则返回True
tail()	返回后n行数据，默认显示后5行	notnull()	如果值不存在或者缺失，则返回 False

import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(5))
print("--------s--------")
print (s)
# 返回前三行数据
print("----s.head()----")
print (s.head(3))
print("----s.tail()----")
print (s.tail(3))

s=pd.Series([1,2,3,None])
print("----pd.isnull()----")
print(pd.isnull(s))
print("----pd.notnull()----")
print(pd.notnull(s))

# output
'''
--------s--------
0   -0.863094
1    0.298486
2    1.344080
3   -0.420814
4   -1.042683
dtype: float64
----s.head()----
0   -0.863094
1    0.298486
2    1.344080
dtype: float64
----s.tail()----
2    1.344080
3   -0.420814
4   -1.042683
dtype: float64
----pd.isnull()----
0    False
1    False
2    False
3     True
dtype: bool
----pd.notnull()----
0     True
1     True
2     True
3    False
dtype: bool
'''

DataFrame对象

创建DataFrame数据

'''
data    : 输入的数据，可以是 ndarray，series，list，dict，标量以及一个 DataFrame。
index   : 行标签，如果没有传递 index 值，则默认行标签是 np.arange(n)，n 代表 data 的元素个数。
columns : 列标签，如果没有传递 columns 值，则默认列标签是 np.arange(n)。
dtype   : dtype表示每一列的数据类型。
copy    : 默认为 False，表示复制数据 data。
'''
import pandas as pd
pd.DataFrame( data, index, columns, dtype, copy)

import pandas as pd

list1 = [1, 2, 3, 4]
list2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
print("---------list创建---------")
print("---------list一维---------")
df1 = pd.DataFrame(list1)
print(df1)
print("---------list二维---------")
df2 = pd.DataFrame(list2)
print(df2)
df3 = pd.DataFrame(list2, columns = ['a', 'b', 'c', 'd'])
print("---------list指定columns---------")
print(df3)

print("---------dict创建---------")
dict1 = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28, 34, 29, 42]}
df4 = pd.DataFrame(dict1)
print(df4)
print("---------dict创建指定index---------")
df5 = pd.DataFrame(dict1, index = ['a', 'b', 'd', 'd'])
print(df5)

print("---------list嵌套dict创建---------")
list_dict = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df6 = pd.DataFrame(list_dict)
print(df6)
print("---------list嵌套dict创建指定index---------")
df7 = pd.DataFrame(list_dict, index = ['first', 'second'])
print(df7)

print("---------Series创建---------")
dict_series = {
    'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
    'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df8 = pd.DataFrame(dict_series)
print(df8)

# output
'''
---------list创建---------
---------list一维---------
   0
0  1
1  2
2  3
3  4
---------list二维---------
   0  1  2  3
0  1  2  3  4
1  5  6  7  8
---------list指定columns---------
   a  b  c  d
0  1  2  3  4
1  5  6  7  8
---------dict创建---------
    Name  Age
0    Tom   28
1   Jack   34
2  Steve   29
3  Ricky   42
---------dict创建指定index---------
    Name  Age
a    Tom   28
b   Jack   34
d  Steve   29
d  Ricky   42
---------list嵌套dict创建---------
   a   b     c
0  1   2   NaN
1  5  10  20.0
---------list嵌套dict创建指定index---------
        a   b     c
first   1   2   NaN
second  5  10  20.0
---------Series创建---------
   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4
'''

读取DataFrame数据

import pandas as pd

dict_series = {
    'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
    'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print("--------print--------")
print(df)
print("--------列索引读--------")
print(df['one'])

df['three'] = pd.Series([10, 11, 12, 13], index=['a', 'b', 'c', 'd'])
df['four'] = df['two'] + df['three']
print("--------列索引加运算--------")
print(df)

print("--------行索引读取--------")
print(df.loc['a'])
print("--------行索引位置读取--------")
print(df.iloc[1])

print("--------切片多行读取--------")
print(df[2:4])

# output
'''
--------print--------
   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4
--------列索引读--------
a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64
--------列索引加运算--------
   one  two  three  four
a  1.0    1     10    11
b  2.0    2     11    13
c  3.0    3     12    15
d  NaN    4     13    17
--------行索引读取--------
one       1.0
two       1.0
three    10.0
four     11.0
Name: a, dtype: float64
--------行索引位置读取--------
one       2.0
two       2.0
three    11.0
four     13.0
Name: b, dtype: float64
--------切片多行读取--------
   one  two  three  four
c  3.0    3     12    15
d  NaN    4     13    17
'''

DataFrame属性和方法

attr&method	desc	attr&method	desc
T	行和列转置	head()	返回前n行数据
axes	返回行、列标签组成的列表	tail()	返回后n行数据
dtypes	返回每一列的数据类型	shift()	将行/列移动指定的步幅长度
empty	返回数据对象是否为空，True为空	insert()	插入新的列
ndim	数组的维数，DataFrame为2	del()/pop()	删除列
shape	返回DataFrame形状，行列元组	append()	插入新的行
size	返回DataFrame中的元素数量	drop()	删除行
values	使用numpy数组表示DF中的元素值	-	-

Pandas读CSV文件

pd.read_csv(file_path_or_buffer, sep, header, names, index_col)

method	desc
filepath_or_buffer	文件路径
sep	列之间的分隔符。read_csv()默认为为’,’
header	默认将首行设为列名。`header=None`时应手动给出列名。
names	`header=None`时设置此字段使用列表初始化列名。
index_col	将某一列作为行级索引。若使用列表，则设置复合索引。
usecols	选择读取文件中的某些列。设置为为相应列的索引列表。
skiprows	跳过行。可选择跳过前n行或给出跳过的行索引列表。
encoding	编码。

import pandas as pd

df = pd.read_csv('nba.csv')
# 读取前5行，后5行
print(df)
# 读取前n行，默认5行
print(df.head())
# 读取后n行，默认5行
print(df.tail())
# 返回表格基本信息
print(df.info())

参考：www.runoob.com ，以nba.csv为例。

CSV文件（逗号分隔符文件，数据与数据之间使用逗号分隔）：

csv数据
pandas读取CSV文件：

Pandas写CSV文件

df.to_csv()

method	desc
filepath_or_buffer	文件路径
sep	列之间的分隔符。默认为’,’
na_rep	写入文件时dataFrame中缺失值的内容。默认空字符串。
columns	定义需要写入文件的列。
header	是否需要写入表头。默认为True。
index	会否需要写入行索引。默认为True。
encoding	编码。

Pandas常用统计方法

Series统计函数

func	desc	func	desc
pct_change()	百分比变化	rank()	排名
cov()	协方差	corr()	相关系数，pearson(default)、spearman()、kendall()

DataFrame统计函数

func	desc	func	desc
count()	统计某个非空值的数量	min()	求最小值
sum()	求和	max()	求最大值
mean()	求平均值	average()	加权平均值
median()	求中位数	prod()	求所有数值的乘积
mode()	求众数	cumsum()	计算累加和
std()	求标准差	cumprod()	计算累计积
corr()	计算数列或变量之间的相关系数	abs()	求绝对值

使用聚合类方法时需要指定轴(axis)参数，两种传参方式：

对行操作，默认使用 axis=0 或者使用 “index”， axis=0 表示按垂直方向进行计算；
对列操作，默认使用 axis=1 或者使用 “columns”，axis=1 则表示按水平方向进行计算。

axis轴示意图

Pandas数据合并

'''
left/right  : 两个不同的DataFrame对象。
on          : 用于指定连接的键（列标签名字），左右DataFrame必须同时存在。不指定则以DF列名交集为连接键。
how         : 合并类型，left/right/outer/inner join，类似mysql外键join。默认inner。
left_on     : 指定左侧DataFrame中作连接键的列名
right_on    : 指定右侧DataFrame中作连接键的列名
left_index  : 使用左侧DataFrame的行索引作为连接键，默认False。
right_index : 使用右侧DataFrame的行索引作为连接键，默认False。
sort        : 合并后的数据进行排序，默认True；False时按照how给定的参数值进行排序。
suffixes    : 字符串组成的元组。
copy        : 对数据进行复制，默认为True。
'''
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,left_index=False, right_index=False, sort=True,suffixes=('_x', '_y'), copy=True)

import pandas as pd 
left = pd.DataFrame({
    'id':[1,2,3,4], 
    'Name': ['AAA', 'BBB', 'CCC', 'DDD'], 
    'Age':[ 20, 30, 40, 50]}) 

right = pd.DataFrame({ 
    'id':[1,2,3,4], 
    'Name': ['AA', 'BB', 'CC', 'DDD'], 
    'Age':[ 20, 30, 40, 50]}) 
print("------left------")
print (left) 
print("------right------")
print (right)
print("------merge------")
print(pd.merge(left, right))
print("------merge_on_single------")
print(pd.merge(left, right, on="id"))
print("------merge_on_multi------")
print(pd.merge(left, right, on=["id", "Age"]))

print("------how------")
print(pd.merge(left, right, on="Age", how="left"))

# output
'''
------left------
   id Name  Age
0   1  AAA   20
1   2  BBB   30
2   3  CCC   40
3   4  DDD   50
------right------
   id Name  Age
0   1   AA   20
1   2   BB   30
2   3   CC   40
3   4  DDD   50
------merge------
   id Name  Age
0   4  DDD   50
------merge_on_single------
   id Name_x  Age_x Name_y  Age_y
0   1    AAA     20     AA     20
1   2    BBB     30     BB     30
2   3    CCC     40     CC     40
3   4    DDD     50    DDD     50
------merge_on_multi------
   id Name_x  Age Name_y
0   1    AAA   20     AA
1   2    BBB   30     BB
2   3    CCC   40     CC
3   4    DDD   50    DDD
------how------
   id_x Name_x  Age  id_y Name_y
0     1    AAA   20     1     AA
1     2    BBB   30     2     BB
2     3    CCC   40     3     CC
3     4    DDD   50     4    DDD
'''

Pandas数据分组聚合

跟SQL GROUP BY类似。对DataFrame对象进行分组操作。

import pandas as pd

df = DataFrame()
df.groupby("key")
df.groupby("key", axis=1)
df.groupby(["key1", "key2"])

按照nba.csv的Team进行分组。可分成30只队伍。

groupby分组

遍历分组数据：

遍历groupby分组

Pandas数据清洗

空数据包含四种：

'''
axis     : 默认axis=0，删除整行；axis=1，删除整列；
how      : 默认'any'，任何一行/一列为空则删除；'all'一行/一列都为空才删除；
thresh   : 设置需要多少非空值的数据才可以保留下来的。
subset   : 设置想要检查的列。如果是多个列，可以使用列名的 list 作为参数。
inplace  : 如果设置 True，将计算得到的值直接覆盖之前的值并返回 None，修改的是源数据。
'''
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Pandas时间序列

时间序列（time series），由时间构成的序列。

时间序列频率

alias	desc	alias	desc	alias	desc	alias	desc
B	工作日频率	MS	月开始频率	BQS	工作季度开始频率	T,min	每分钟频率
D	日历日频率	SMS	半月开始频率	A	年终频率	S	每秒钟频率
W	每周频率	BMS	工作月开始频率	BA	工作年度结束频率	L,ms	毫秒
M	月末频率	Q	季末频率	BAS	工作年度开始频率	U,us	微妙
SM	半月结束频率	BQ	工作季度结束频率	BH	营业时间频率	N	纳秒
BM	工作月结束频率	QS	季度开始频率	H	小时频率	—	—

创建时间范围

import pandas as pd

'''
start : 开始时间
end   : 结束时间
freq  : 时间频率，默认为 "D"（天）
date_range(start, end, freq) 
'''
#freq表示时间频率，每30min变化一次
print(pd.date_range("9:00", "18:10", freq="30min").time)

# output
'''
[datetime.time(9, 0) datetime.time(9, 30) datetime.time(10, 0)
 datetime.time(10, 30) datetime.time(11, 0) datetime.time(11, 30)
 datetime.time(12, 0) datetime.time(12, 30) datetime.time(13, 0)
 datetime.time(13, 30) datetime.time(14, 0) datetime.time(14, 30)
 datetime.time(15, 0) datetime.time(15, 30) datetime.time(16, 0)
 datetime.time(16, 30) datetime.time(17, 0) datetime.time(17, 30)
 datetime.time(18, 0)]
'''

转化为时间戳

import pandas as pd

print(pd.to_datetime(pd.Series(['Jun 3, 2020','2020-12-10', None])))

# output
'''
0   2020-06-03
1   2020-12-10
2          NaT
dtype: datetime64[ns]
'''

创建时间周期

import pandas as pd

#Y表示年
p = pd.period_range('2015','2022', freq='Y')

# output
'''
PeriodIndex(['2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022'], dtype='period[A-DEC]')
'''

Pandas核心操作

操作	使用
读取CSV格式的数据集	`pd.read_csv("csv_file")`
读取Excel数据集	`pd.read_excel("excel_file")`
将DF直接写入CSV文件	`df.to_csv("data.csv", sep=",", index=False)`
基本的数据集特征信息	`df.info()`
基本的数据集统计信息	`print(df.describe())`
将DF输出到一张表	`print(tabulate(print_table, headers=headers))`
列出所有列的名字	`df.columns`
删除缺失数据	`df.dropna(axis=0, how='any')`
替换缺失数据	`df.replace(to_replace=None, value=None)`
检查空值 NaN	`pd.isnull(object)`
删除特征	`df.drop('feature_variable_name', axis=1)`
将目标类型转换为浮点型	`pd.to_numeric(df["feature_name"], errors='coerce')`
将DF转换为NumPy数组	`df.as_matrix()`
取DataFrame的前面n行	`df.head(n)`
通过特征名取数据	`df.loc[feature_name]`
对DataFrame使用函数	`df["height"].apply(lambda height: 2 * height)`
重命名行	`df.rename(columns = {df.columns[2]:'size'}, inplace=True)`
取某一行的唯一实体	`df["name"].unique()`
访问子DataFrame	`new_df = df[["name", "size"]]`
总结数据信息	`df.sum()/df.min()/df.max()/df.mean()/df.median()`
给数据排序	`df.sort_values(ascending = False)`
布尔型索引	`df[df["size"] == 5]`
选定特定的值	`df.loc([0], ['size'])`