查看“Pandas”的源代码

Pandas是[[Python]]的一个开源软件库，用于数据分析，可以方便对数据进行处理、计算、分析、存储及可视化。

==简介==
===时间轴===
*2008年，开发者Wes McKinney在AQR Capital Management开始制作pandas来满足在财务数据上进行定量分析对高性能、灵活工具的需要。在离开AQR之前他说服管理者允许他将这个库开放源代码。
*2012年，另一个AQR雇员Chang She加入了这项努力并成为这个库的第二个主要贡献者。
*2015年，Pandas签约了NumFOCUS的一个财务赞助项目，它是美国的501(c)(3)非营利慈善团体。

===安装和导入===
使用pip安装Pandas
 pip install pandas
如果使用的是Anaconda等计算科学软件包，已经安装好了pandas库。

导入Pandas，在脚本顶部导入，一般写法如下：
 import pandas as pd

查看Pandas版本：
 pd.__version__

==数据结构==
pandas定义了2种数据类型，Series和DataFrame，大部分操作都在这两种数据类型上进行。

{{了解更多
|[https://pandas.pydata.org/docs/user_guide/dsintro.html Pandas 用户指南：数据结构]
}}
===Series===
Series是一个有轴标签（索引）的一维数组，能够保存任何数据类型（整数，字符串，浮点数，Python对象等）。轴标签称为<code>index</code>。和Python字典类似。

====创建Series====
创建Series的基本方法为，使用[[Pandas/pandas.Series|pandas.Series]]类新建一个Series对象，格式如下：
 pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
轴标签index不是必须，如果省略，轴标签默认为从0开始的整数数组。一些示例如下：
<syntaxhighlight lang="python" >
s = pd.Series(["foo", "bar", "foba"])
print(type(s))   #<class 'pandas.core.series.Series'>

s2 = pd.Series(["foo", "bar", "foba"], index=['b','d','c'])

# 创建日期索引
date_index = pd.date_range("2020-01-01", periods=3, freq="D")
s3 = pd.Series(["foo", "bar", "foba"], index=date_index)
</syntaxhighlight>

====Series数据操作====

===DataFrame===
DataFrame是有标记的二维的数据结构，具有可能不同类型的列。由数据，行标签（索引，index），列标签（列，columns）构成。您可以将其视为电子表格或SQL表，或Series对象的字典。它通常是最常用的Pandas对象。

{{了解更多|[https://pandas.pydata.org/docs/user_guide/dsintro.html#dataframe Pandas 用户指南：DataFrame]}}
====创建DataFrame====
创建DataFrame对象有多种方法：
* 使用<code>pandas.DataFrame()</code>构造方法
* 使用<code>pandas.DataFrame.from_dict()</code>方法，类似构造方法
* 使用<code>pandas.DataFrame.from_records()</code>方法，类似构造方法
* 使用函数从导入文件创建，如使用<code>pandas.read_csv()</code>函数导入csv文件创建一个DataFrame对象。

构造方法<code>pandas.DataFrame()</code>的格式为：
 pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)


===属性和方法===
下面将Series和DataFrame的属性、方法按作用分类展示。

表示例中s为一个Series对象，df为一个DataFrame对象：
<syntaxhighlight lang="python" >
>>> s = pd.Series(['a', 'b', 'c'])
>>> s
0    a
1    b
2    c
dtype: object

>>> df = pd.DataFrame([['foo', 22], ['bar', 25], ['test', 18]],columns=['name', 'age'])
>>> df

</syntaxhighlight>

{{了解更多
|[https://pandas.pydata.org/docs/reference/frame.html  Pandas API：DataFrame]
|[https://pandas.pydata.org/docs/reference/series.html Pandas API：Series]}}
====构造方法====
{| class="wikitable" 
|-
!方法名
!描述
!Series
!DataFrame
!示例
|-
|构造方法
|创建一个Series对象或DataFrame对象
|pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
|pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
|<code>s = pd.Series(["a", "b", "c"])</code>  <br \><br \><code>df = pd.DataFrame([['foo', 22], ['bar', 25], ['test', 18]],columns=['name', 'age'])</code>
|-
|}

====属性和基本信息====
{| class="wikitable" 
|-
!属性/方法
!描述
!Series
!DataFrame
!示例
|-
| index
| 索引（行标签）
|Series.index
|DataFrame.index
| <code>s.index</code>返回RangeIndex(start=0, stop=3, step=1) <br \> <code>df.index</code>
|-
| columns
| 列标签，Series无
| &minus;
|DataFrame.columns
| <code>df.columns</code>
|-
| axes
| 返回轴标签（行标签和列标签）的列表。<br \>Series返回[index] <br \>DataFrame返回[index, columns]
| Series.axes
| DataFrame.axes
| <code>s.axes</code>返回[RangeIndex(start=0, stop=3, step=1)]
|-
| dtypes
| 返回数据的Numpy数据类型（dtype对象）
|Series.index
|DataFrame.index
| <code>s.dtypes</code><br \> <code>df.dtypes</code>
|-
| dtype
| 返回数据的Numpy数据类型（dtype对象）
| Series.index
| &minus;
| <code>s.dtype</code>
|-
| array
| 返回 Series 或 Index 数据的数组，该数组为pangdas扩展的python数组.
| Series.index
| &minus;
| <code>s.array</code> <br \>返回：<PandasArray><br \>['a', 'b', 'c']<br \>Length: 3, dtype: object
|-
| attrs
| 此对象全局属性字典。
| Series.attrs
| DataFrame.attrs
| <code>s.attrs</code>返回{}
|-
| hasnans
| 如果有任何空值（如Python的None，np.NaN）返回True，否则返回False。
| Series.hasnans
| &minus;
| <code>s.hasnans</code> <br \>返回False
|-
| values
| 返回ndarray（NumPy的多维数组）或类似ndarray的形式。
| Series.values
| DataFrame.values
| <code>s.values</code>返回array(['a', 'b', 'c'], dtype=object)
|-
| ndim
| 返回数据的维数，Series返回1，DataFrame返回2
| Series.ndim
| DataFrame.ndim
| <code>s.ndim</code>返回1 <br \><code>df.ndim</code>返回2
|-
| size
| 返回数据中元素的个数
| Series.size
| DataFrame.size
| <code>s.size</code>返回3 <br \><code>df.ndim</code>返回6
|-
| shape
| 返回数据形状（行数和列数）的元组
| Series.shape
| DataFrame.shape
| <code>s.shape</code>返回(3, ) <br \><code>df.shape</code>返回(3, 2)
|-
| empty
| 返回是否为空，为空返回Ture
| Series.empty
| DataFrame.empty
| <code>s.empty</code>返回False <br \><code>df.empty</code>返回False 
|-
| name
| 返回Series的名称。
| Series.name
| &minus;
| <code>s.name</code>返回空
|-
| memory_usage()
| 返回Series或DataFrame的内存使用情况，单位Bytes。参数index默认为True，表示包含index。<br \>参数deep默认为False，表示不通过查询dtypes对象来深入了解数据的系统级内存使用情况
| Series.memory_usage(index=True, deep=False)
| DataFrame.memory_usage(index=True, deep=False)
| <code>s.memory_usage()</code>返回空152 <br \><code>df.memory_usage(index=False)</code>
|-
| info()
| 打印DataFrame的简要信息。
| &minus;
| DataFrame.info(verbose=True, buf=None, max_cols=None, memory_usage=True, null_counts=True)
| <code>df.info()</code>
|-
| select_dtypes()
| 根据列的dtypes返回符合条件的DataFrame子集
| &minus;
| DataFrame.select_dtypes(include=None, exclude=None)
| <code>df.select_dtypes(include=['float64'])</code>
|-
|}

====数据选取/索引标签/迭代====
{| class="wikitable"
|-
!属性/方法
!描述
!Series
!DataFrame
!示例
|-
| head()
| 返回前n行数据，默认前5行
| Series.head(n=5)
| DataFrame.head(n=5)
| <code>df.head()</code>返回df前5行数据<br \><code>df.head(10)</code>返回df前10行数据。
|-
| tail()
| 返回最后n行数据，默认最后5行
| Series.tail(n=5)
| DataFrame.tail(n=5)
| <code>df.tail()</code>返回df最后5行数据<br \><code>df.tail(10)</code>返回df最后10行数据。
|-
| at
| 通过行轴和列轴标签对获取或设置单个值。
| Series.at
| DataFrame.at
| <code>s.at[1]</code>返回'b'<br \><code>s.at[2]='d'</code>设置索引位置为第三的值等于'd' <br \><code>df.at[2, 'name']'</code>获取index=2，columns='name'点的值
|-
| iat
| 通过行轴和列轴整数位置获取或设置单个值。
| Series.iat
| DataFrame.iat
| <code>s.iat[1]</code><br \><code>s.iat[2]='d'</code>
|-
| loc
| 通过标签值或布尔数组访问一组行和列。
| [https://pandas.pydata.org/docs/reference/api/pandas.Series.loc.html#pandas.Series.loc Series.loc]
| [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc DataFrame.loc]
|<code>df.loc[2]</code>选取索引（行标签）值为2的行 <br \><code>df.loc[1:2]</code> 选取索引值为1到2的行 <br \><code><nowiki>df.loc[[1,2]]</nowiki></code>选取索引值为1和2的行 <br \><code>df.loc[1,'name']</code>选取行标签值为1，列标签值为'name'的单个值<br \><code>df.loc[[1:2],'name']</code>选取行标签值为1到2，列标签值为'name'的数据
|-
| iloc
| 通过标签整数位置或布尔数组访问一组行和列。
| Series.iloc
| DataFrame.iloc
|<code>s.iloc[2]</code>选取行标签位置为2的行 <br \><code>s.iloc[:2]</code> 选取索引为0到2（不包含2）的值 <br \><code><nowiki>s.iloc[[True,False,True]]</nowiki></code>选取索引位置为True的值 <br \><code>s.iloc[lambda x: x.index % 2 == 0]</code>选取索引为双数的值
|-
| insert
| 在指定位置插入列。
| &minus;
| DataFrame.insert(loc, column, value, allow_duplicates=False)
|
|-
| __iter__()
| Series返回值的迭代器 <br \>DataFrame返回轴的迭代器
| Series.__iter__()
| DataFrame.__iter__()
| <code>s.__iter__()</code>
|-
| items()
| Series遍历，返回索引和值的迭代器 <br \>DataFrame按列遍历，返回列标签和列的Series对迭代器。
| Series.items()
| DataFrame.items()
| <code>s.items()</code> <br \> <code>df.items()</code> <br \> <code>for label, content in df.items():</code>
|-
|}
====计算/描述统计====
{| class="wikitable"
|-
!属性/方法
!描述
!Series
!DataFrame
!示例
|-
| abs()
| 返回 Series/DataFrame 每个元素的绝对值。
| Series.abs()
| DataFrame.abs()
| <code>s.abs()</code> <br \> <code>df.abs()</code>
|-
|}

==Pandas绘图==
pandas绘图基于[[Matplotlib]]，pandas的DataFrame和Series都自带生成各类图表的plot方法，能够方便快速生成各种图表。

{{了解更多
|[https://pandas.pydata.org/docs/user_guide/visualization.html pandas文档：用户指南 - 可视化]
}}
===基本图形===
====折线图====
plot方法默认生成的就是折线图。如prices是一个DataFrame的含有收盘价close列，绘制收盘价的折线图：
<syntaxhighlight lang="python" >
s = prices['close']
s.plot() 

#设置图片大小，使用figsize参数
s.plot(figsize=(20,10)) 
</syntaxhighlight>

====条形图====
对于不连续标签，没有时间序列的数据，可以绘制条形图，使用以下两种方法：
*使用plot()函数，设置kind参数为‘bar’ or ‘barh’，
*使用plot.bar()函数，plot.barh()函数

<syntaxhighlight lang="python" >
df.plot(kind='bar')    #假设df为每天股票数据  
df.plot.bar()          
df.resample('A-DEC').mean().volume.plot(kind='bar')    #重采集每年成交量平均值，绘制条形图（volume为df的成交量列）

df.plot.bar(stacked=True)    #stacked=True表示堆积条形图
df.plot.barh(stacked=True)    #barh 表示水平条形图 </nowiki>
</syntaxhighlight>
====直方图====
直方图使用plot.hist()方法绘制，一般为频数分布直方图，x轴分区间，y轴为频数。组数用参数bins控制，如分20组bins=20
<syntaxhighlight lang="python" >
df.volume.plot.hist()    #df股票数据中成交量volume的频数分布直方图。
df.plot.hist(alpha=0.5)    #alpha=0.5 表示柱形的透明度为0.5
df.plot.hist(stacked=True, bins=20)    #stacked=True表示堆积绘制，bins=20表示分20组。
df.plot.hist(orientation='horizontal')    #orientation='horizontal' 表示水平直方图
df.plot.hist(cumulative=True)    #表示累计直方图  

df['close'].diff().hist()    #收盘价上应用diff函数，再绘制直方图
df.hist(color='k', bins=50)     #DataFrame.hist函数将每列绘制在不同的子图形上。
</syntaxhighlight>

====箱型图====
箱型图可以使用plot.box()函数或DataFrame的boxplot()绘制。
参数：
*color，用来设置颜色，通过传入颜色字典，如color={'boxes': 'DarkGreen', 'whiskers': 'DarkOrange', 'medians': 'DarkBlue', 'caps': 'Gray'}
*sym，用来设置异常值样式，如sym='r+'表示异常值用'红色+'表示。
<syntaxhighlight lang="python" >
df.plot.box()
df[['close','open', 'high']].plot.box()
#改变箱型颜色，通过传入颜色字典
color={'boxes': 'DarkGreen', 'whiskers': 'DarkOrange', 'medians': 'DarkBlue', 'caps': 'Gray'}
df.plot.box(color=color, sym='r+')    #sym用来设置异常值样式，'r+'表示'红色+'
df.plot.box(positions=[1, 4, 5, 6, 8])    #positions表示显示位置，df有5个列， 第一列显示在x轴1上，第二列显示在x轴4上，以此类推
df.plot.box(vert=False)    #表示绘制水平箱型图
df.boxplot()   

#绘制分层箱型图，通过设置by关键词创建分组，再按组，分别绘制箱型图。如下面例子，每列按A组，B组分别绘制箱型图。
df = pd.DataFrame(np.random.rand(10, 2), columns=['Col1', 'Col2'])
df['x'] = pd.Series(['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'])
df.boxplot(by='x')

#还可以再传入一个子分类，再进一步分组绘制。如：
df.boxplot(column=['Col1', 'Col2'], by=['X', 'Y'])
</syntaxhighlight>

====散点图====
散点图使用DataFrame.plot.scatter()方法绘制。通过参数x，y指定x轴和y轴的数据列。
<syntaxhighlight lang="python" >
df.plot.scatter(x='close', y='volume')    #假如df为每日股票数据，图表示收盘价与成交量的散点图

#将两组散点图绘制在一张图表上，重新ax参数如
ax = df.plot.scatter(x='close', y='volume', color='DarkBlue', label='Group 1')    #设置标签名label设置标名
df.plot.scatter(x='open', y='value', color='DarkGreen', label='Group 2', ax=ax)

#c参数表示圆点的颜色按按volume列大小来渐变表示。
df.plot.scatter(x='close', y='open', c='volume', s=50)    #s表示原点面积大小
df.plot.scatter(x='close', y='open', s=df['volume']/50000)  #圆点的大小也可以根据某列数值大小相应设置。
</syntaxhighlight>

====饼图====
饼图使用DataFrame.plot.pie()或Series.plot.pie()绘制。如果数据中有空值，会自动使用0填充。

===其他绘图函数===
这些绘图函数来自[https://pandas.pydata.org/pandas-docs/stable/reference/plotting.html pandas.plotting]模块。

====矩阵散点图（Scatter Matrix Plot）====
矩阵散点图（Scatter Matrix Plot）使用scatter_matrix()方法绘制
<syntaxhighlight lang="python" >
from pandas.plotting import scatter_matrix     #使用前需要从模块中导入该函数
scatter_matrix(df, alpha=0.2, figsize=(6, 6), diagonal='kde')    #假设df是每日股票数据，会每一列相对其他每一列生成一个散点图。
</syntaxhighlight>

====密度图（Density Plot）====
密度图使用Series.plot.kde()和DataFrame.plot.kde()函数。
 df.plot.kde()

====安德鲁斯曲线（Andrews Curves）====
安德鲁斯曲线
 
====平行坐标图（Parallel Coordinates）====

====Lag plot====

====自相关图（Autocorrelation Plot）====
自相关图

====自举图（Bootstrap plot）====

===绘图格式===
====预设置图形样式====
matplotlib 从1.5开始，可以预先设置样式，绘图前通过matplotlib.style.use(my_plot_style)。如matplotlib.style.use('ggplot') 定义ggplot-style plots.
====样式参数====
大多数绘图函数，可以通过一组参数来设置颜色。

====标签设置====
可通过设置legend参数为False来隐藏图片标签，如
 df.plot(legend=False)

====尺度====
*logy参数用来将y轴设置对数标尺
*logx参数用来将x轴设置对数标尺
*loglog参数用来将x轴和y轴设置对数标尺
 ts.plot(logy=True)

====双坐标图====
两组序列同x轴，但y轴数据不同，可以通过第二个序列设置参数：secondary_y=True，来设置第二个y轴。
<syntaxhighlight lang="python" >
#比如想在收盘价图形上显示cci指标：
prices['close'].plot()
prices['cci'].plot(secondary_y=True)

#第二个坐标轴要显示多个，可以直接传入列名
ax = df.plot(secondary_y=['cci', 'RSI'], mark_right=False)    #右边轴数据标签默认会加个右边，设置mark_right为False取消显示
ax.set_ylabel('CD scale')     #设置左边y轴名称
ax.right_ax.set_ylabel('AB scale')    #设置右边y轴名称
</syntaxhighlight>

====子图====
DataFrame的每一列可以绘制在不同的坐标轴(axis）中，使用subplots参数设置，例如：
 df.plot(subplots=True, figsize=(6, 6))

====子图布局====
子图布局使用关键词layout设置，
==资源==
===官网===
*[https://pandas.pydata.org/ Pandas官网]
*[https://pandas.pydata.org/docs/ Pandas文档]
*[https://pandas.pydata.org/docs/user_guide/10min.html Pandas 用户指南 - 10分钟入门Pandas]
*[https://pandas.pydata.org/docs/user_guide/index.html Pandas 用户指南]
*[https://pandas.pydata.org/docs/reference/index.html Pandas API参考]
*[https://github.com/pandas-dev/pandas Pandas 的 Github]

===相关网站===
*[https://quant.itiger.com/tquant/research/hub/classroom/detail?nid=4 老虎量化：pandas 介绍]
*[https://www.pypandas.cn/docs/ pypandas.cn：Pandas文档]
*[https://www.yiibai.com/pandas 易百教程：Pandas]

===书籍===
《利用Python进行数据分析 第2版》 - Wes McKinney

==参考文献==
*[https://zh.wikipedia.org/wiki/Pandas 维基百科：Pandas]
*[https://en.wikipedia.org/wiki/Pandas_(software) 维基百科：Pandas（英）]

[[分类:数据分析]]
[[分类:数据可视化]]