Pandas
Pandas是Python的一个开源软件库,用于数据分析,可以方便对数据进行处理、计算、分析、存储及可视化。
简介
时间轴
- 2008年,开发者Wes McKinney在AQR Capital Management开始制作pandas来满足在财务数据上进行定量分析对高性能、灵活工具的需要。在离开AQR之前他说服管理者允许他将这个库开放源代码。
- 2012年,另一个AQR雇员Chang She加入了这项努力并成为这个库的第二个主要贡献者。
- 2015年,Pandas签约了NumFOCUS的一个财务赞助项目,它是美国的501(c)(3)非营利慈善团体。
安装和导入
使用pip安装Pandas
pip install pandas
如果使用的是Anaconda等计算科学软件包,已经安装好了pandas库。
导入Pandas,在脚本顶部导入,一般写法如下:
import pandas as pd
查看Pandas版本:
pd.__version__
数据结构
pandas定义了2种数据类型,Series和DataFrame,大部分操作都在这两种数据类型上进行。
了解更多 >> Pandas 用户指南:数据结构
Series
Series是一个有轴标签(索引)的一维数组,能够保存任何数据类型(整数,字符串,浮点数,Python对象等)。轴标签称为index
。和Python字典类似。
创建Series
创建Series的基本方法为,使用pandas.Series类新建一个Series对象,格式如下:
pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
轴标签index不是必须,如果省略,轴标签默认为从0开始的整数数组。一些示例如下:
s = pd.Series(["foo", "bar", "foba"])
print(type(s)) #<class 'pandas.core.series.Series'>
s2 = pd.Series(["foo", "bar", "foba"], index=['b','d','c'])
# 创建日期索引
date_index = pd.date_range("2020-01-01", periods=3, freq="D")
s3 = pd.Series(["foo", "bar", "foba"], index=date_index)
Series数据操作
Series属性
下表示例中s为Series对象:
>>> s = pd.Series(['a', 'b', 'c'])
>>> s
0 a
1 b
2 c
dtype: object
属性名 | 描述 | 示例 | 结果 |
---|---|---|---|
T | 返回转置,根据定义,Series转置为自身。 | s.T | 自身 |
array | 返回 Series 或 Index 数据的数组,该数组为pangdas扩展的python数组. | s.array | <PandasArray> ['a', 'b', 'c'] Length: 3, dtype: object |
at | 通过行轴和列轴标签获取或设置单个值。 | s.at[1] s.at[2]='d' |
'b' |
attrs | 此对象全局属性字典。 | s.attrs | {} |
axes | 返回行轴标签的列表。 | s.axes | [RangeIndex(start=0, stop=3, step=1)] |
dtype | 返回数据的Numpy数据类型 | s.dtype | dtype('O') |
dtypes | 返回数据的Numpy数据类型 | s.dtypes | dtype('O') |
hasnans | 如果有任何空值(如Python的None,np.NaN)返回True,否则返回False。 | s2 = pd.Series(['a', None, 'c']) s2.hasnans |
True |
iat | 通过行轴和列轴整数位置获取或设置单个值。 | s.iat[1] s.iat[2]='d' |
'b' |
iloc | 通过索引(行轴)整数位置获取或设置值。 | 1. s.iloc[2] 2. s.iloc[:2] 3. s.iloc[[True,False,True]] 4. s.iloc[lambda x: x.index % 2 == 0]
|
1. 'b' 2. 选取索引为0到2(不包含2)的值 3. 选取索引位置为True的值 4. 选取索引为双数的值 |
index | The index (axis labels) of the Series. | ||
is_monotonic | Return boolean if values in the object are monotonic_increasing. | ||
is_monotonic_decreasing | Return boolean if values in the object are monotonic_decreasing. | ||
is_monotonic_increasing | Alias for is_monotonic. | ||
is_unique | Return boolean if values in the object are unique. | ||
loc | Access a group of rows and columns by label(s) or a boolean array. | ||
name | Return the name of the Series. | ||
nbytes | Return the number of bytes in the underlying data. | ||
ndim | Number of dimensions of the underlying data, by definition 1. | ||
shape | Return a tuple of the shape of the underlying data. | ||
size | Return the number of elements in the underlying data. | ||
values | Return Series as ndarray or ndarray-like depending on the dtype. |
了解更多 >> Pandas API:pandas.Series
Series方法
方法 | 描述 | 示例 | 结果 |
---|---|---|---|
abs() | 返回 Series/DataFrame 每个元素的绝对值。 | s.abs() | |
add(other[, level, fill_value, axis]) | Return Addition of series and other, element-wise (binary operator add). | ||
add_prefix(prefix) | Prefix labels with string prefix. | ||
add_suffix(suffix) | Suffix labels with string suffix. | ||
agg([func, axis]) | Aggregate using one or more operations over the specified axis. | ||
aggregate([func, axis]) | Aggregate using one or more operations over the specified axis. | ||
align(other[, join, axis, level, copy, …]) | Align two objects on their axes with the specified join method. | ||
all([axis, bool_only, skipna, level]) | Return whether all elements are True, potentially over an axis. | ||
any([axis, bool_only, skipna, level]) | Return whether any element is True, potentially over an axis. | ||
append(to_append[, ignore_index, …]) | Concatenate two or more Series. | ||
apply(func[, convert_dtype, args]) | Invoke function on values of Series. | ||
argmax([axis, skipna]) | Return int position of the largest value in the Series. | ||
argmin([axis, skipna]) | Return int position of the smallest value in the Series. | ||
argsort([axis, kind, order]) | Return the integer indices that would sort the Series values. | ||
asfreq(freq[, method, how, normalize, …]) | Convert TimeSeries to specified frequency. | ||
asof(where[, subset]) | Return the last row(s) without any NaNs before where. | ||
astype(dtype[, copy, errors]) | Cast a pandas object to a specified dtype dtype. | ||
at_time(time[, asof, axis]) | Select values at particular time of day (e.g., 9:30AM). | ||
autocorr([lag]) | Compute the lag-N autocorrelation. | ||
backfill([axis, inplace, limit, downcast]) | Synonym for DataFrame.fillna() with method='bfill'. | ||
between(left, right[, inclusive]) | Return boolean Series equivalent to left <= series <= right. | ||
between_time(start_time, end_time[, …]) | Select values between particular times of the day (e.g., 9:00-9:30 AM). | ||
bfill([axis, inplace, limit, downcast]) | Synonym for DataFrame.fillna() with method='bfill'. | ||
bool() | Return the bool of a single element Series or DataFrame. | ||
cat | alias of pandas.core.arrays.categorical.CategoricalAccessor | ||
clip([lower, upper, axis, inplace]) | Trim values at input threshold(s). | ||
combine(other, func[, fill_value]) | Combine the Series with a Series or scalar according to func. | ||
combine_first(other) | Combine Series values, choosing the calling Series’s values first. | ||
compare(other[, align_axis, keep_shape, …]) | Compare to another Series and show the differences. | ||
convert_dtypes([infer_objects, …]) | Convert columns to best possible dtypes using dtypes supporting pd.NA. | ||
copy([deep]) | Make a copy of this object’s indices and data. | ||
corr(other[, method, min_periods]) | Compute correlation with other Series, excluding missing values. | ||
count([level]) | Return number of non-NA/null observations in the Series. | ||
cov(other[, min_periods, ddof]) | Compute covariance with Series, excluding missing values. | ||
cummax([axis, skipna]) | Return cumulative maximum over a DataFrame or Series axis. | ||
cummin([axis, skipna]) | Return cumulative minimum over a DataFrame or Series axis. | ||
cumprod([axis, skipna]) | Return cumulative product over a DataFrame or Series axis. | ||
cumsum([axis, skipna]) | Return cumulative sum over a DataFrame or Series axis. | ||
describe([percentiles, include, exclude, …]) | Generate descriptive statistics. | ||
diff([periods]) | First discrete difference of element. | ||
div(other[, level, fill_value, axis]) | Return Floating division of series and other, element-wise (binary operator truediv). | ||
divide(other[, level, fill_value, axis]) | Return Floating division of series and other, element-wise (binary operator truediv). | ||
divmod(other[, level, fill_value, axis]) | Return Integer division and modulo of series and other, element-wise (binary operator divmod). | ||
dot(other) | Compute the dot product between the Series and the columns of other. | ||
drop([labels, axis, index, columns, level, …]) | Return Series with specified index labels removed. | ||
drop_duplicates([keep, inplace]) | Return Series with duplicate values removed. | ||
droplevel(level[, axis]) | Return DataFrame with requested index / column level(s) removed. | ||
dropna([axis, inplace, how]) | Return a new Series with missing values removed. | ||
dt | alias of pandas.core.indexes.accessors.CombinedDatetimelikeProperties | ||
duplicated([keep]) | Indicate duplicate Series values. | ||
eq(other[, level, fill_value, axis]) | Return Equal to of series and other, element-wise (binary operator eq). | ||
equals(other) | Test whether two objects contain the same elements. | ||
ewm([com, span, halflife, alpha, …]) | Provide exponential weighted (EW) functions. | ||
expanding([min_periods, center, axis]) | Provide expanding transformations. | ||
explode([ignore_index]) | Transform each element of a list-like to a row. | ||
factorize([sort, na_sentinel]) | Encode the object as an enumerated type or categorical variable. | ||
ffill([axis, inplace, limit, downcast]) | Synonym for DataFrame.fillna() with method='ffill'. | ||
fillna([value, method, axis, inplace, …]) | Fill NA/NaN values using the specified method. | ||
filter([items, like, regex, axis]) | Subset the dataframe rows or columns according to the specified index labels. | ||
first(offset) | Select initial periods of time series data based on a date offset. | ||
first_valid_index() | Return index for first non-NA/null value. | ||
floordiv(other[, level, fill_value, axis]) | Return Integer division of series and other, element-wise (binary operator floordiv). | ||
ge(other[, level, fill_value, axis]) | Return Greater than or equal to of series and other, element-wise (binary operator ge). | ||
get(key[, default]) | Get item from object for given key (ex: DataFrame column). | ||
groupby([by, axis, level, as_index, sort, …]) | Group Series using a mapper or by a Series of columns. | ||
gt(other[, level, fill_value, axis]) | Return Greater than of series and other, element-wise (binary operator gt). | ||
head([n]) | Return the first n rows. | ||
hist([by, ax, grid, xlabelsize, xrot, …]) | Draw histogram of the input series using matplotlib. | ||
idxmax([axis, skipna]) | Return the row label of the maximum value. | ||
idxmin([axis, skipna]) | Return the row label of the minimum value. | ||
infer_objects() | Attempt to infer better dtypes for object columns. | ||
interpolate([method, axis, limit, inplace, …]) | Please note that only method='linear' is supported for DataFrame/Series with a MultiIndex. | ||
isin(values) | Whether elements in Series are contained in values. | ||
isna() | Detect missing values. | ||
isnull() | Detect missing values. | ||
item() | Return the first element of the underlying data as a python scalar. | ||
items() | Lazily iterate over (index, value) tuples. | ||
iteritems() | Lazily iterate over (index, value) tuples. | ||
keys() | Return alias for index. | ||
kurt([axis, skipna, level, numeric_only]) | Return unbiased kurtosis over requested axis. | ||
kurtosis([axis, skipna, level, numeric_only]) | Return unbiased kurtosis over requested axis. | ||
last(offset) | Select final periods of time series data based on a date offset. | ||
last_valid_index() | Return index for last non-NA/null value. | ||
le(other[, level, fill_value, axis]) | Return Less than or equal to of series and other, element-wise (binary operator le). | ||
lt(other[, level, fill_value, axis]) | Return Less than of series and other, element-wise (binary operator lt). | ||
mad([axis, skipna, level]) | Return the mean absolute deviation of the values for the requested axis. | ||
map(arg[, na_action]) | Map values of Series according to input correspondence. | ||
mask(cond[, other, inplace, axis, level, …]) | Replace values where the condition is True. | ||
max([axis, skipna, level, numeric_only]) | Return the maximum of the values for the requested axis. | ||
mean([axis, skipna, level, numeric_only]) | Return the mean of the values for the requested axis. | ||
median([axis, skipna, level, numeric_only]) | Return the median of the values for the requested axis. | ||
memory_usage([index, deep]) | Return the memory usage of the Series. | ||
min([axis, skipna, level, numeric_only]) | Return the minimum of the values for the requested axis. | ||
mod(other[, level, fill_value, axis]) | Return Modulo of series and other, element-wise (binary operator mod). | ||
mode([dropna]) | Return the mode(s) of the dataset. | ||
mul(other[, level, fill_value, axis]) | Return Multiplication of series and other, element-wise (binary operator mul). | ||
multiply(other[, level, fill_value, axis]) | Return Multiplication of series and other, element-wise (binary operator mul). | ||
ne(other[, level, fill_value, axis]) | Return Not equal to of series and other, element-wise (binary operator ne). | ||
nlargest([n, keep]) | Return the largest n elements. | ||
notna() | Detect existing (non-missing) values. | ||
notnull() | Detect existing (non-missing) values. | ||
nsmallest([n, keep]) | Return the smallest n elements. | ||
nunique([dropna]) | Return number of unique elements in the object. | ||
pad([axis, inplace, limit, downcast]) | Synonym for DataFrame.fillna() with method='ffill'. | ||
pct_change([periods, fill_method, limit, freq]) | Percentage change between the current and a prior element. | ||
pipe(func, *args, **kwargs) | Apply func(self, *args, **kwargs). | ||
plot | alias of pandas.plotting._core.PlotAccessor | ||
pop(item) | Return item and drops from series. | ||
pow(other[, level, fill_value, axis]) | Return Exponential power of series and other, element-wise (binary operator pow). | ||
prod([axis, skipna, level, numeric_only, …]) | Return the product of the values for the requested axis. | ||
product([axis, skipna, level, numeric_only, …]) | Return the product of the values for the requested axis. | ||
quantile([q, interpolation]) | Return value at the given quantile. | ||
radd(other[, level, fill_value, axis]) | Return Addition of series and other, element-wise (binary operator radd). | ||
rank([axis, method, numeric_only, …]) | Compute numerical data ranks (1 through n) along axis. | ||
ravel([order]) | Return the flattened underlying data as an ndarray. | ||
rdiv(other[, level, fill_value, axis]) | Return Floating division of series and other, element-wise (binary operator rtruediv). | ||
rdivmod(other[, level, fill_value, axis]) | Return Integer division and modulo of series and other, element-wise (binary operator rdivmod). | ||
reindex([index]) | Conform Series to new index with optional filling logic. | ||
reindex_like(other[, method, copy, limit, …]) | Return an object with matching indices as other object. | ||
rename([index, axis, copy, inplace, level, …]) | Alter Series index labels or name. | ||
rename_axis(**kwargs) | Set the name of the axis for the index or columns. | ||
reorder_levels(order) | Rearrange index levels using input order. | ||
repeat(repeats[, axis]) | Repeat elements of a Series. | ||
replace([to_replace, value, inplace, limit, …]) | Replace values given in to_replace with value. | ||
resample(rule[, axis, closed, label, …]) | Resample time-series data. | ||
reset_index([level, drop, name, inplace]) | Generate a new DataFrame or Series with the index reset. | ||
rfloordiv(other[, level, fill_value, axis]) | Return Integer division of series and other, element-wise (binary operator rfloordiv). | ||
rmod(other[, level, fill_value, axis]) | Return Modulo of series and other, element-wise (binary operator rmod). | ||
rmul(other[, level, fill_value, axis]) | Return Multiplication of series and other, element-wise (binary operator rmul). | ||
rolling(window[, min_periods, center, …]) | Provide rolling window calculations. | ||
round([decimals]) | Round each value in a Series to the given number of decimals. | ||
rpow(other[, level, fill_value, axis]) | Return Exponential power of series and other, element-wise (binary operator rpow). | ||
rsub(other[, level, fill_value, axis]) | Return Subtraction of series and other, element-wise (binary operator rsub). | ||
rtruediv(other[, level, fill_value, axis]) | Return Floating division of series and other, element-wise (binary operator rtruediv). | ||
sample([n, frac, replace, weights, …]) | Return a random sample of items from an axis of object. | ||
searchsorted(value[, side, sorter]) | Find indices where elements should be inserted to maintain order. | ||
sem([axis, skipna, level, ddof, numeric_only]) | Return unbiased standard error of the mean over requested axis. | ||
set_axis(labels[, axis, inplace]) | Assign desired index to given axis. | ||
shift([periods, freq, axis, fill_value]) | Shift index by desired number of periods with an optional time freq. | ||
skew([axis, skipna, level, numeric_only]) | Return unbiased skew over requested axis. | ||
slice_shift([periods, axis]) | Equivalent to shift without copying data. | ||
sort_index([axis, level, ascending, …]) | Sort Series by index labels. | ||
sort_values([axis, ascending, inplace, …]) | Sort by the values. | ||
sparse | alias of pandas.core.arrays.sparse.accessor.SparseAccessor | ||
squeeze([axis]) | Squeeze 1 dimensional axis objects into scalars. | ||
std([axis, skipna, level, ddof, numeric_only]) | Return sample standard deviation over requested axis. | ||
str | alias of pandas.core.strings.StringMethods | ||
sub(other[, level, fill_value, axis]) | Return Subtraction of series and other, element-wise (binary operator sub). | ||
subtract(other[, level, fill_value, axis]) | Return Subtraction of series and other, element-wise (binary operator sub). | ||
sum([axis, skipna, level, numeric_only, …]) | Return the sum of the values for the requested axis. | ||
swapaxes(axis1, axis2[, copy]) | Interchange axes and swap values axes appropriately. | ||
swaplevel([i, j, copy]) | Swap levels i and j in a MultiIndex. | ||
tail([n]) | Return the last n rows. | ||
take(indices[, axis, is_copy]) | Return the elements in the given positional indices along an axis. | ||
to_clipboard([excel, sep]) | Copy object to the system clipboard. | ||
to_csv([path_or_buf, sep, na_rep, …]) | Write object to a comma-separated values (csv) file. | ||
to_dict([into]) | Convert Series to {label -> value} dict or dict-like object. | ||
to_excel(excel_writer[, sheet_name, na_rep, …]) | Write object to an Excel sheet. | ||
to_frame([name]) | Convert Series to DataFrame. | ||
to_hdf(path_or_buf, key[, mode, complevel, …]) | Write the contained data to an HDF5 file using HDFStore. | ||
to_json([path_or_buf, orient, date_format, …]) | Convert the object to a JSON string. | ||
to_latex([buf, columns, col_space, header, …]) | Render object to a LaTeX tabular, longtable, or nested table/tabular. | ||
to_list() | Return a list of the values. | ||
to_markdown([buf, mode, index]) | Print Series in Markdown-friendly format. | ||
to_numpy([dtype, copy, na_value]) | A NumPy ndarray representing the values in this Series or Index. | ||
to_period([freq, copy]) | Convert Series from DatetimeIndex to PeriodIndex. | ||
to_pickle(path[, compression, protocol]) | Pickle (serialize) object to file. | ||
to_sql(name, con[, schema, if_exists, …]) | Write records stored in a DataFrame to a SQL database. | ||
to_string([buf, na_rep, float_format, …]) | Render a string representation of the Series. | ||
to_timestamp([freq, how, copy]) | Cast to DatetimeIndex of Timestamps, at beginning of period. | ||
to_xarray() | Return an xarray object from the pandas object. | ||
tolist() | Return a list of the values. | ||
transform(func[, axis]) | Call func on self producing a Series with transformed values. | ||
transpose(*args, **kwargs) | Return the transpose, which is by definition self. | ||
truediv(other[, level, fill_value, axis]) | Return Floating division of series and other, element-wise (binary operator truediv). | ||
truncate([before, after, axis, copy]) | Truncate a Series or DataFrame before and after some index value. | ||
tshift([periods, freq, axis]) | (DEPRECATED) Shift the time index, using the index’s frequency if available. | ||
tz_convert(tz[, axis, level, copy]) | Convert tz-aware axis to target time zone. | ||
tz_localize(tz[, axis, level, copy, …]) | Localize tz-naive index of a Series or DataFrame to target time zone. | ||
unique() | Return unique values of Series object. | ||
unstack([level, fill_value]) | Unstack, also known as pivot, Series with MultiIndex to produce DataFrame. | ||
update(other) | Modify Series in place using values from passed Series. | ||
value_counts([normalize, sort, ascending, …]) | Return a Series containing counts of unique values. | ||
var([axis, skipna, level, ddof, numeric_only]) | Return unbiased variance over requested axis. | ||
view([dtype]) | Create a new view of the Series. | ||
where(cond[, other, inplace, axis, level, …]) | Replace values where the condition is False. | ||
xs(key[, axis, level, drop_level]) | Return cross-section from the Series/DataFrame. |
了解更多 >> Pandas API:pandas.Series
DataFrame
DataFrame是有标记的二维的数据结构,具有可能不同类型的列。由数据,行标签(索引,index),列标签(列,columns)构成。您可以将其视为电子表格或SQL表,或Series对象的字典。它通常是最常用的Pandas对象。
了解更多 >> pandas文档:用户指南 - DataFrame
创建DataFrame
创建DataFrame对象有多种方法:
- 使用
pandas.DataFrame()
构造方法 - 使用
pandas.DataFrame.from_dict()
方法,类似构造方法 - 使用
pandas.DataFrame.from_records()
方法,类似构造方法 - 使用函数从导入文件创建,如使用
pandas.read_csv()
函数导入csv文件创建一个DataFrame对象。
构造方法pandas.DataFrame()
的格式为:
pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
Pandas绘图
pandas绘图基于Matplotlib,pandas的DataFrame和Series都自带生成各类图表的plot方法,能够方便快速生成各种图表。
了解更多 >> pandas文档:用户指南 - 可视化
基本图形
折线图
plot方法默认生成的就是折线图。如prices是一个DataFrame的含有收盘价close列,绘制收盘价的折线图:
s = prices['close']
s.plot()
#设置图片大小,使用figsize参数
s.plot(figsize=(20,10))
条形图
对于不连续标签,没有时间序列的数据,可以绘制条形图,使用以下两种方法:
- 使用plot()函数,设置kind参数为‘bar’ or ‘barh’,
- 使用plot.bar()函数,plot.barh()函数
df.plot(kind='bar') #假设df为每天股票数据
df.plot.bar()
df.resample('A-DEC').mean().volume.plot(kind='bar') #重采集每年成交量平均值,绘制条形图(volume为df的成交量列)
df.plot.bar(stacked=True) #stacked=True表示堆积条形图
df.plot.barh(stacked=True) #barh 表示水平条形图 </nowiki>
直方图
直方图使用plot.hist()方法绘制,一般为频数分布直方图,x轴分区间,y轴为频数。组数用参数bins控制,如分20组bins=20
df.volume.plot.hist() #df股票数据中成交量volume的频数分布直方图。
df.plot.hist(alpha=0.5) #alpha=0.5 表示柱形的透明度为0.5
df.plot.hist(stacked=True, bins=20) #stacked=True表示堆积绘制,bins=20表示分20组。
df.plot.hist(orientation='horizontal') #orientation='horizontal' 表示水平直方图
df.plot.hist(cumulative=True) #表示累计直方图
df['close'].diff().hist() #收盘价上应用diff函数,再绘制直方图
df.hist(color='k', bins=50) #DataFrame.hist函数将每列绘制在不同的子图形上。
箱型图
箱型图可以使用plot.box()函数或DataFrame的boxplot()绘制。 参数:
- color,用来设置颜色,通过传入颜色字典,如color={'boxes': 'DarkGreen', 'whiskers': 'DarkOrange', 'medians': 'DarkBlue', 'caps': 'Gray'}
- sym,用来设置异常值样式,如sym='r+'表示异常值用'红色+'表示。
df.plot.box()
df[['close','open', 'high']].plot.box()
#改变箱型颜色,通过传入颜色字典
color={'boxes': 'DarkGreen', 'whiskers': 'DarkOrange', 'medians': 'DarkBlue', 'caps': 'Gray'}
df.plot.box(color=color, sym='r+') #sym用来设置异常值样式,'r+'表示'红色+'
df.plot.box(positions=[1, 4, 5, 6, 8]) #positions表示显示位置,df有5个列, 第一列显示在x轴1上,第二列显示在x轴4上,以此类推
df.plot.box(vert=False) #表示绘制水平箱型图
df.boxplot()
#绘制分层箱型图,通过设置by关键词创建分组,再按组,分别绘制箱型图。如下面例子,每列按A组,B组分别绘制箱型图。
df = pd.DataFrame(np.random.rand(10, 2), columns=['Col1', 'Col2'])
df['x'] = pd.Series(['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'])
df.boxplot(by='x')
#还可以再传入一个子分类,再进一步分组绘制。如:
df.boxplot(column=['Col1', 'Col2'], by=['X', 'Y'])
散点图
散点图使用DataFrame.plot.scatter()方法绘制。通过参数x,y指定x轴和y轴的数据列。
df.plot.scatter(x='close', y='volume') #假如df为每日股票数据,图表示收盘价与成交量的散点图
#将两组散点图绘制在一张图表上,重新ax参数如
ax = df.plot.scatter(x='close', y='volume', color='DarkBlue', label='Group 1') #设置标签名label设置标名
df.plot.scatter(x='open', y='value', color='DarkGreen', label='Group 2', ax=ax)
#c参数表示圆点的颜色按按volume列大小来渐变表示。
df.plot.scatter(x='close', y='open', c='volume', s=50) #s表示原点面积大小
df.plot.scatter(x='close', y='open', s=df['volume']/50000) #圆点的大小也可以根据某列数值大小相应设置。
饼图
饼图使用DataFrame.plot.pie()或Series.plot.pie()绘制。如果数据中有空值,会自动使用0填充。
其他绘图函数
这些绘图函数来自pandas.plotting模块。
矩阵散点图(Scatter Matrix Plot)
矩阵散点图(Scatter Matrix Plot)使用scatter_matrix()方法绘制
from pandas.plotting import scatter_matrix #使用前需要从模块中导入该函数
scatter_matrix(df, alpha=0.2, figsize=(6, 6), diagonal='kde') #假设df是每日股票数据,会每一列相对其他每一列生成一个散点图。
密度图(Density Plot)
密度图使用Series.plot.kde()和DataFrame.plot.kde()函数。
df.plot.kde()
安德鲁斯曲线(Andrews Curves)
安德鲁斯曲线
平行坐标图(Parallel Coordinates)
Lag plot
自相关图(Autocorrelation Plot)
自相关图
自举图(Bootstrap plot)
绘图格式
预设置图形样式
matplotlib 从1.5开始,可以预先设置样式,绘图前通过matplotlib.style.use(my_plot_style)。如matplotlib.style.use('ggplot') 定义ggplot-style plots.
样式参数
大多数绘图函数,可以通过一组参数来设置颜色。
标签设置
可通过设置legend参数为False来隐藏图片标签,如
df.plot(legend=False)
尺度
- logy参数用来将y轴设置对数标尺
- logx参数用来将x轴设置对数标尺
- loglog参数用来将x轴和y轴设置对数标尺
ts.plot(logy=True)
双坐标图
两组序列同x轴,但y轴数据不同,可以通过第二个序列设置参数:secondary_y=True,来设置第二个y轴。
#比如想在收盘价图形上显示cci指标:
prices['close'].plot()
prices['cci'].plot(secondary_y=True)
#第二个坐标轴要显示多个,可以直接传入列名
ax = df.plot(secondary_y=['cci', 'RSI'], mark_right=False) #右边轴数据标签默认会加个右边,设置mark_right为False取消显示
ax.set_ylabel('CD scale') #设置左边y轴名称
ax.right_ax.set_ylabel('AB scale') #设置右边y轴名称
子图
DataFrame的每一列可以绘制在不同的坐标轴(axis)中,使用subplots参数设置,例如:
df.plot(subplots=True, figsize=(6, 6))
子图布局
子图布局使用关键词layout设置,
资源
官网
相关网站
书籍
《利用Python进行数据分析 第2版》 - Wes McKinney