Data Loading
pd.read_csv()/read_table()
sep
: 默认分别为','
and'\t'
,也可以是正则表达式header
: 默认第一行为 header;None
则无 header,显示时为 range(n);<int>
表明第 int+1 行做 header;如果是list of int
,则生成multi-indexnames
: array-like (List of column names to use),自定义列名index_col
: int or name of column(s), 指定作为index的列;多列就形成 multi-indexskiprows
: list-like or integer, 跳过指定行comment
: 注释行,从而忽略 e.g.comment='#'
na_values
: scalar, str, list-like, or dict, 认为是 nan 的值
Data Saving
df.to_csv()
path
sep
index/header=True
: 是否输出 index 和 columns 名称na_rep
: string, 缺失值的输出结果, default to''
columns
: list of columns to write
Basic attributes
- shape (返回一个 tuple)
- dtypes
- index
- columns
- values
- index/columns.name (针对普通的, type是
str
) - index/columns.names (针对MultiIndex,普通的也可以用,只是type为
pandas.core.indexes.frozen.FrozenList
) - index/columns.is_unique: 返回唯一值
- iloc/loc: slicing, indexing
(pd.Index()
,pd.Series()
等有name
参数,pd.MultiIndex()
有names
参数,name
参数也有,但只是为了兼容)
Data viewing
df.head(<num>) # default to 5
df.tail(<num>)
Index reconstruction
reindex()
: 构造符合新的 index 的数据,因此可以实现 index 的组成以及顺序的改变等index/columns
: list of labels, 按给出的顺序以及名字重构method
:{'ffill', 'bfill'}
, 指定填充引入的缺失值的方法fill_value
: scalar, default np.NaN,当 reindex 引入缺失值是用于填充
In [98]: frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
....: index=['a', 'c', 'd'],
....: columns=['Ohio', 'Texas', 'California'])
In [99]: frame
Out[99]:
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
In [100]: frame2 = frame.reindex(['a', 'b', 'c', 'd'])
In [101]: frame2
Out[101]:
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
set_index(<column name or list of column names>)
: 选取列作为新的 index,多列就构成 multi-indexdrop=True
: 是否在新生成的对象的 values 中仍保留这些列
reset_index()
: 默认将所有的 index level 转回成 column,处于最前列level
: int, str, tuple, or list, 仅移除 index 所指定的 leveldrop=False
: 设置成 True, 相当于将原来的 index 抛弃,index 变成整数inplace=False
: 直接修改原对象还是生成新的对象col_level
: int or str, default 0,如果 columns 是多重索引,指定移除的 index 要插到哪个 level,默认为第一个 level
>>> index = pd.MultiIndex.from_tuples([('bird', 'falcon'),
... ('bird', 'parrot'),
... ('mammal', 'lion'),
... ('mammal', 'monkey')],
... names=['class', 'name'])
>>> columns = pd.MultiIndex.from_tuples([('speed', 'max'),
... ('species', 'type')])
>>> df = pd.DataFrame([(389.0, 'fly'),
... ( 24.0, 'fly'),
... ( 80.5, 'run'),
... (np.nan, 'jump')],
... index=index,
... columns=columns)
>>> df
speed species
max type
class name
bird falcon 389.0 fly
parrot 24.0 fly
mammal lion 80.5 run
monkey NaN jump
>>> df.reset_index(level='class') # 将 index 中的 class 移除,默认在 columns 第一个 level,与 speed,species 同层
class speed species
max type
name
falcon bird 389.0 fly
parrot bird 24.0 fly
lion mammal 80.5 run
monkey mammal NaN jump
>>> df.reset_index(level='class', col_level=1) # 指定 col_level 后移除的 index 变成与 max,type 同层
speed species
class max type
name
falcon bird 389.0 fly
parrot bird 24.0 fly
lion mammal 80.5 run
monkey mammal NaN jump
sort_index()
: 对 index 或 columns 进行排序axis=0
level
: int or level name or list of ints or list of level names, multi-index中使用,对level排序ascending=True
drop()
: 舍弃指定 index 或 columnslabels
: single label or list-like,指定要舍弃的行或列axis
: 指定从哪个轴去寻找前面指定的 labellevel
: 多重索引的话还得指定在哪个 level 寻找 labelinplace=False
rename()
: 对 index 或 columns 进行重命名,也可以以map
方法作用于 index 或 columnsindex/columns
: 可以是函数,或者字典,其中key为原名,value为新名inplace=False
Value reconstruction
sort_values()
by
: str or list of str, 指定用于排序的keyaxis
: 指定需要进行排序的轴ascending=True
na_position
: [‘first’, ‘last’], 默认缺失值排最后,即 lastinplace=False
duplicated()
: 返回一个 boolean Series,标记各行是否为重复subset
: column label or sequence of labels, 根据指定列来判断是否重复,默认为所有列keep
:'first'
: Mark duplicates as True except for the first occurrence, default'last'
: Mark duplicates as True except for the last occurrence'False'
: Mark all duplicates as True
drop_duplicates()
: 删除重复行,等价于df[~df.duplicated()]
subset
keep
inplace
replace()
: 替换值, 可以是 原值(多个值就用list),新值; 如果是不同值分别替换,每一对可以是 list 或 dictdata.replace(-999, np.nan) data.replace([-999, -1000], np.nan) data.replace([-999, -1000], [np.nan, 0]) data.replace({-999: np.nan, -1000: 0})
Missing value
dropna()
: 根据缺失值情况删除行或列axis
: {0 or ‘index’, 1 or ‘columns’}, default 0, 行或列存在缺失值就进行移除how
: {‘all’, ‘any’}, 默认为 any,即存在缺失值就删除整行或整列thresh
: int, 行或列中有效值数量少于 int,则删除该行或列inplace
fillna()
value
: scalar(所有缺失值均替换成该值)或 dict(不同的列对应不同的替换值,貌似只有对axis=0可行?)method
: {‘ffill’,…}, 填补缺失值的方法limit
: int, If method is specified, this is the maximum number of consecutive NaN values to forward/backward fillinplace
Merge, join and concatenate
pd.merge()
(一般原来的index会被抛弃,除了left_index
和right_index
均为True
时)left/right
: DataFramehow
: {‘inner’, ‘outer’, ‘left’, ‘right’}, 默认inneron
: label or list, 指定作为合并的key的labelleft_on/right_on
(选作key的列均会在values中出现,所以可能需要用drop()
剔除)left_index/right_index=False
suffixes
: 2-length sequence, 默认(‘_x’, ‘_y’)sort=False
: 是否根据合并时使用的key来进行字典排序
(df.merge()类似,df.join()默认根据index合并,且默认left join)
pd.concat()
obj
: a sequence of Series, DataFrameaxis
join
: {‘outer’, ‘inner’}, 默认outerjoin_axes
: list of labels, 结果中保留的labels, 是与axis指定的不同axis上的labelkeys
: sequence, 长度与obj
一致,从而构造出multi-index,并作为最外层的level (也算是一种避免原objs的index的重复导致结果混乱的方法)names
: list, 生成的multi-index各level的nameignore_index=False
: 是否忽略原来的index,而使用0,…,n-1verify_integrity=False
: 检查结果中index是否存在重复,若重复则raise exception
Reshaping
stack()
: Pivot a level of the (possibly hierarchical)column labels
, returning a DataFrame (or Series in the case of an object with a single level of column labels) having a hierarchical index with a newinner-most
level of row labels. The level involved willautomatically get sorted
level
: int, string, or list of these, default last level, stack使用的leveldropna=True
: Whether to drop rows in the resulting Frame/Series with no valid values
unstack()
: Pivot a level of the (necessarily hierarchical)index labels
, returning a DataFrame having a new level of column labels whoseinner-most
level consists of the pivoted index labels. If the index is not a MultiIndex, the output will be a Series (the analogue of stack when the columns are not a MultiIndex). The level involved willautomatically get sorted
level
fill_value
: replace NaN with this value if the unstack produces missing values
pivot()
: Reshape data based oncolumn values
. Uses unique values from index / columns to form axes of the resulting DataFrameindex=None
: string or object, column name to use to make new frame’s index. Default to uses existing index, 且不能有重复(与pivot_table()
的区别)columns=None
: string or object, column name to use to make new frame’s columnsvalues=None
: string or object, Column name to use for populating new frame’s values. By default, all remaining columns will be used and the result will have hierarchically indexed columns, 除了被选为index和columns的列外,其余列成为多维列最外层的level>>> df = pd.DataFrame({'foo': ['one','one','one','two','two','two'], 'bar': ['A', 'B', 'C', 'A', 'B', 'C'], 'baz': [1, 2, 3, 4, 5, 6], 'bac': [-1,-2,-3,-4,-5,-6]}) >>> df.pivot(index='foo', columns='bar', values='baz') # 相当于df.pivot(index='foo', columns='bar')['baz'] A B C one 1 2 3 two 4 5 6 >>> df.pivot(index='foo', columns='bar') bac baz bar A B C A B C foo one -1 -2 -3 1 2 3 two -4 -5 -6 4 5 6
pd.melt()
: merges multiple columns into one, producing a DataFrame that is longer than the inputframe
: DataFrameid_vars
: tuple, list, or ndarray, column(s) to use as identifier variables, 用来做group indicator的列,不会被合并; 即便只选了一列,也得是list形式的; 也可以什么都不选value_vars
: tuple, list, or ndarray, column(s) to unpivot, 即要合并的列. Default to uses all columns that are not set asid_vars
, 没选的列就被扔掉了
Groupby
groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)
by
: mapping, function, str, or iterable- A str or list of strs may be passed to group by the columns
- If a function, it’s called on each value of the object’s index (把具有相同函数返回结果的作为一组)
- If a
dict
orSeries
is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned) (index对应相同VALUE的就归为一组,例如{'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
, B和C的就作为一组) - If an
ndarray
is passed, the values are used as-is determine the groups
as_index
: boolean, default True. For aggregated output, return object with group labels as the index;否则就把index转成values最前面的列-
sort
: boolean, default True. Sort group keys. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group grouped.aggregate()
或grouped.agg()
: 可以针对每一列实现更加灵活地使用函数- 单个函数或string of function name
- list of functions,从而对每列数据实现多个统计
- dict of column name:functions (or list of functions),不同列可以给定不同的函数,结果中也只会包含这些列,其他的省略了;不可以直接多列放在一起,因为会判定成keyerror
grouped.filter(func, dropna=True, *args, **kwargs)
: Return a copy of a DataFrame excluding elements from groups that do not satisfy the boolean criterion specified by func. 即利用func作用于各分组,得出结果为True的分组,然后利用该值去对原DataFrame进行slicingfunc
: Function to apply to each subframe. Should return True or Falsedropna
: Drop groups that do not pass the filter; if False, groups that evaluate False are filled withNaN
>>> rng = np.random.RandomState(0) >>> df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'], 'data1': range(6), 'data2': rng.randint(0, 10, 6)}, columns = ['key', 'data1', 'data2']) >>> def filter_func(x): return x['data2'].std() > 4 >>> print(df.groupby('key').filter(filter_func)) key data1 data2 1 B 1 0 2 C 2 3 4 B 4 7 5 C 5 9
grouped.transform(func, *args, **kwargs)
: Call function producing a like-indexed DataFrame on each group and return a DataFrame having the same indexes as the original object filled with the transformed values>>> df.groupby('key').transform(lambda x: x - x.mean()) data1 data2 0 -1.5 1.0 1 -1.5 -3.5 2 -1.5 -3.0 3 1.5 -1.0 4 1.5 3.5 5 1.5 3.0
grouped.apply(func, *args, **kwargs)
: apply an arbitrary function to the group results. The function should take aDataFrame
, and return either aPandas object
(e.g., DataFrame, Series) or ascalar
; the combine operation will be tailored to the type of output returned
Selection
selection by label .loc
The .loc
attribute accept the following valid inputs:
- A single label, e.g.
5
or'a'
- A list or array of labels
['a', 'b', 'c']
- A slice object with labels
'a':'f'
(both the start and the stop are included) - A boolean array
- A callable
# 注意label的匹配,尤其是通过boolean array进行索引时 >>> df1 = pd.DataFrame(np.random.randn(6,4), index=list('abcdef'), columns=list('ABCD')) >>> df1 >>> A B C D a 0.132003 -0.827317 -0.076467 -1.187678 b 1.130127 -1.436737 -1.413681 1.607920 c 1.024180 0.569605 0.875906 -2.211372 d 0.974466 -2.006747 -0.410001 -0.078638 e 0.545952 -1.219217 -1.226825 0.769804 f -1.281247 -0.727707 -0.121306 -0.097883 >>> df1.loc['a'] > 0 >>> A True B False C False D False Name: a, dtype: bool >>> df1.loc[:, df1.loc['a'] > 0] >>> A a 0.132003 b 1.130127 c 1.024180 d 0.974466 e 0.545952 f -1.281247 >>> df1.loc[:, df1.loc['a'] > 0] >>> IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
selection by position .iloc
The .iloc
attribute accept the following valid inputs:
- An integer e.g.
5
- A list or array of integers
[4, 3, 0]
- A slice object with ints
1:7
- A boolean array
- A callable
# 由于boolean array通常是带有index的,所以一般不会直接做iloc的输入 # 如果非要做输入,就得使用其values属性 >>> df2 = pd.DataFrame(np.random.randn(4,4), index=list('abcd'), columns=list('ABCD')) >>> df2 >>> A B C D a 1.561034 1.004562 -0.404225 -1.232107 b -0.804208 -1.746975 0.643511 0.156833 c -0.827571 0.480988 -0.977209 1.430038 d -0.395301 0.974563 1.375785 -1.065290 >>> df2.iloc[df2['A']>0] >>> ValueError: iLocation based boolean indexing cannot use an indexable as a mask >>> >>> df2.iloc[(df2['A']>0).values] >>> A B C D a 1.561034 1.004562 -0.404225 -1.232107 >>> df2.iloc[:, (df2['A']>0).values] >>> A a 1.561034 b -0.804208 c -0.827571 d -0.395301
Boolean indexing
The operators are:
|
for or&
for and~
for not
(表达式都用括号括起来以防止歧义,e.g.s[~(s>0)]
)