文章詳情頁

Pandas數據類型之category的用法

瀏覽：125日期：2022-06-15 16:13:42

創建category使用Series創建

在創建Series的同時添加dtype='category'就可以創建好category了。category分為兩部分，一部分是order，一部分是字面量：

In [1]: s = pd.Series(['a', 'b', 'c', 'a'], dtype='category')In [2]: sOut[2]: 0 a1 b2 c3 adtype: categoryCategories (3, object): [’a’, ’b’, ’c’]

可以將DF中的Series轉換為category：

In [3]: df = pd.DataFrame({'A': ['a', 'b', 'c', 'a']})In [4]: df['B'] = df['A'].astype('category')In [5]: df['B']Out[32]: 0 a1 b2 c3 aName: B, dtype: categoryCategories (3, object): [a, b, c]

可以創建好一個pandas.Categorical ，將其作為參數傳遞給Series：

In [10]: raw_cat = pd.Categorical( ....: ['a', 'b', 'c', 'a'], categories=['b', 'c', 'd'], ordered=False ....: ) ....: In [11]: s = pd.Series(raw_cat)In [12]: sOut[12]: 0 NaN1 b2 c3 NaNdtype: categoryCategories (3, object): [’b’, ’c’, ’d’]使用DF創建

創建DataFrame的時候，也可以傳入 dtype='category'：

In [17]: df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')}, dtype='category')In [18]: df.dtypesOut[18]: A categoryB categorydtype: object

DF中的A和B都是一個category:

In [19]: df['A']Out[19]: 0 a1 b2 c3 aName: A, dtype: categoryCategories (3, object): [’a’, ’b’, ’c’]In [20]: df['B']Out[20]: 0 b1 c2 c3 dName: B, dtype: categoryCategories (3, object): [’b’, ’c’, ’d’]

或者使用df.astype('category')將DF中所有的Series轉換為category:

In [21]: df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})In [22]: df_cat = df.astype('category')In [23]: df_cat.dtypesOut[23]: A categoryB categorydtype: object創建控制

默認情況下傳入dtype=’category’ 創建出來的category使用的是默認值：

1.Categories是從數據中推斷出來的。

2.Categories是沒有大小順序的。

可以顯示創建CategoricalDtype來修改上面的兩個默認值：

In [26]: from pandas.api.types import CategoricalDtypeIn [27]: s = pd.Series(['a', 'b', 'c', 'a'])In [28]: cat_type = CategoricalDtype(categories=['b', 'c', 'd'], ordered=True)In [29]: s_cat = s.astype(cat_type)In [30]: s_catOut[30]: 0 NaN1 b2 c3 NaNdtype: categoryCategories (3, object): [’b’ < ’c’ < ’d’]

同樣的CategoricalDtype還可以用在DF中：

In [31]: from pandas.api.types import CategoricalDtypeIn [32]: df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})In [33]: cat_type = CategoricalDtype(categories=list('abcd'), ordered=True)In [34]: df_cat = df.astype(cat_type)In [35]: df_cat['A']Out[35]: 0 a1 b2 c3 aName: A, dtype: categoryCategories (4, object): [’a’ < ’b’ < ’c’ < ’d’]In [36]: df_cat['B']Out[36]: 0 b1 c2 c3 dName: B, dtype: categoryCategories (4, object): [’a’ < ’b’ < ’c’ < ’d’]轉換為原始類型

使用Series.astype(original_dtype) 或者 np.asarray(categorical)可以將Category轉換為原始類型：

In [39]: s = pd.Series(['a', 'b', 'c', 'a'])In [40]: sOut[40]: 0 a1 b2 c3 adtype: objectIn [41]: s2 = s.astype('category')In [42]: s2Out[42]: 0 a1 b2 c3 adtype: categoryCategories (3, object): [’a’, ’b’, ’c’]In [43]: s2.astype(str)Out[43]: 0 a1 b2 c3 adtype: objectIn [44]: np.asarray(s2)Out[44]: array([’a’, ’b’, ’c’, ’a’], dtype=object)categories的操作獲取category的屬性

Categorical數據有 categories 和 ordered 兩個屬性。可以通過s.cat.categories 和 s.cat.ordered來獲取：

In [57]: s = pd.Series(['a', 'b', 'c', 'a'], dtype='category')In [58]: s.cat.categoriesOut[58]: Index([’a’, ’b’, ’c’], dtype=’object’)In [59]: s.cat.orderedOut[59]: False

重排category的順序：

In [60]: s = pd.Series(pd.Categorical(['a', 'b', 'c', 'a'], categories=['c', 'b', 'a']))In [61]: s.cat.categoriesOut[61]: Index([’c’, ’b’, ’a’], dtype=’object’)In [62]: s.cat.orderedOut[62]: False重命名categories

通過給s.cat.categories賦值可以重命名categories:

In [67]: s = pd.Series(['a', 'b', 'c', 'a'], dtype='category')In [68]: sOut[68]: 0 a1 b2 c3 adtype: categoryCategories (3, object): [’a’, ’b’, ’c’]In [69]: s.cat.categories = ['Group %s' % g for g in s.cat.categories]In [70]: sOut[70]: 0 Group a1 Group b2 Group c3 Group adtype: categoryCategories (3, object): [’Group a’, ’Group b’, ’Group c’]

使用rename_categories可以達到同樣的效果：

In [71]: s = s.cat.rename_categories([1, 2, 3])In [72]: sOut[72]: 0 11 22 33 1dtype: categoryCategories (3, int64): [1, 2, 3]

或者使用字典對象：

# You can also pass a dict-like object to map the renamingIn [73]: s = s.cat.rename_categories({1: 'x', 2: 'y', 3: 'z'})In [74]: sOut[74]: 0 x1 y2 z3 xdtype: categoryCategories (3, object): [’x’, ’y’, ’z’]使用add_categories添加category

可以使用add_categories來添加category:

In [77]: s = s.cat.add_categories([4])In [78]: s.cat.categoriesOut[78]: Index([’x’, ’y’, ’z’, 4], dtype=’object’)In [79]: sOut[79]: 0 x1 y2 z3 xdtype: categoryCategories (4, object): [’x’, ’y’, ’z’, 4]使用remove_categories刪除category

In [80]: s = s.cat.remove_categories([4])In [81]: sOut[81]: 0 x1 y2 z3 xdtype: categoryCategories (3, object): [’x’, ’y’, ’z’]刪除未使用的cagtegory

In [82]: s = pd.Series(pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c', 'd']))In [83]: sOut[83]: 0 a1 b2 adtype: categoryCategories (4, object): [’a’, ’b’, ’c’, ’d’]In [84]: s.cat.remove_unused_categories()Out[84]: 0 a1 b2 adtype: categoryCategories (2, object): [’a’, ’b’]重置cagtegory

使用set_categories()可以同時進行添加和刪除category操作：

In [85]: s = pd.Series(['one', 'two', 'four', '-'], dtype='category')In [86]: sOut[86]: 0 one1 two2 four3 -dtype: categoryCategories (4, object): [’-’, ’four’, ’one’, ’two’]In [87]: s = s.cat.set_categories(['one', 'two', 'three', 'four'])In [88]: sOut[88]: 0 one1 two2 four3 NaNdtype: categoryCategories (4, object): [’one’, ’two’, ’three’, ’four’]category排序

如果category創建的時候帶有 ordered=True ，那么可以對其進行排序操作：

In [91]: s = pd.Series(['a', 'b', 'c', 'a']).astype(CategoricalDtype(ordered=True))In [92]: s.sort_values(inplace=True)In [93]: sOut[93]: 0 a3 a1 b2 cdtype: categoryCategories (3, object): [’a’ < ’b’ < ’c’]In [94]: s.min(), s.max()Out[94]: (’a’, ’c’)

可以使用 as_ordered() 或者 as_unordered() 來強制排序或者不排序：

In [95]: s.cat.as_ordered()Out[95]: 0 a3 a1 b2 cdtype: categoryCategories (3, object): [’a’ < ’b’ < ’c’]In [96]: s.cat.as_unordered()Out[96]: 0 a3 a1 b2 cdtype: categoryCategories (3, object): [’a’, ’b’, ’c’]重排序

使用Categorical.reorder_categories() 可以對現有的category進行重排序：

In [103]: s = pd.Series([1, 2, 3, 1], dtype='category')In [104]: s = s.cat.reorder_categories([2, 3, 1], ordered=True)In [105]: sOut[105]: 0 11 22 33 1dtype: categoryCategories (3, int64): [2 < 3 < 1]多列排序

sort_values 支持多列進行排序：

In [109]: dfs = pd.DataFrame( .....: { .....: 'A': pd.Categorical( .....: list('bbeebbaa'), .....: categories=['e', 'a', 'b'], .....: ordered=True, .....: ), .....: 'B': [1, 2, 1, 2, 2, 1, 2, 1], .....: } .....: ) .....: In [110]: dfs.sort_values(by=['A', 'B'])Out[110]: A B2 e 13 e 27 a 16 a 20 b 15 b 11 b 24 b 2比較操作

如果創建的時候設置了ordered==True ，那么category之間就可以進行比較操作。支持 ==, !=, >, >=, <, 和 <=這些操作符。

In [113]: cat = pd.Series([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))In [114]: cat_base = pd.Series([2, 2, 2]).astype(CategoricalDtype([3, 2, 1], ordered=True))In [115]: cat_base2 = pd.Series([2, 2, 2]).astype(CategoricalDtype(ordered=True))In [119]: cat > cat_baseOut[119]: 0 True1 False2 Falsedtype: boolIn [120]: cat > 2Out[120]: 0 True1 False2 Falsedtype: bool其他操作

Cagetory本質上來說還是一個Series，所以Series的操作category基本上都可以使用，比如： Series.min(), Series.max() 和 Series.mode()。

value_counts：

In [131]: s = pd.Series(pd.Categorical(['a', 'b', 'c', 'c'], categories=['c', 'a', 'b', 'd']))In [132]: s.value_counts()Out[132]: c 2a 1b 1d 0dtype: int64

DataFrame.sum()：

In [133]: columns = pd.Categorical( .....: ['One', 'One', 'Two'], categories=['One', 'Two', 'Three'], ordered=True .....: ) .....: In [134]: df = pd.DataFrame( .....: data=[[1, 2, 3], [4, 5, 6]], .....: columns=pd.MultiIndex.from_arrays([['A', 'B', 'B'], columns]), .....: ) .....: In [135]: df.sum(axis=1, level=1)Out[135]: One Two Three0 3 3 01 9 6 0

Groupby：

In [136]: cats = pd.Categorical( .....: ['a', 'b', 'b', 'b', 'c', 'c', 'c'], categories=['a', 'b', 'c', 'd'] .....: ) .....: In [137]: df = pd.DataFrame({'cats': cats, 'values': [1, 2, 2, 2, 3, 4, 5]})In [138]: df.groupby('cats').mean()Out[138]: valuescatsa1.0b2.0c4.0dNaNIn [139]: cats2 = pd.Categorical(['a', 'a', 'b', 'b'], categories=['a', 'b', 'c'])In [140]: df2 = pd.DataFrame( .....: { .....: 'cats': cats2, .....: 'B': ['c', 'd', 'c', 'd'], .....: 'values': [1, 2, 3, 4], .....: } .....: ) .....: In [141]: df2.groupby(['cats', 'B']).mean()Out[141]: valuescats Ba c 1.0 d 2.0b c 3.0 d 4.0c c NaN d NaN

Pivot tables：

In [142]: raw_cat = pd.Categorical(['a', 'a', 'b', 'b'], categories=['a', 'b', 'c'])In [143]: df = pd.DataFrame({'A': raw_cat, 'B': ['c', 'd', 'c', 'd'], 'values': [1, 2, 3, 4]})In [144]: pd.pivot_table(df, values='values', index=['A', 'B'])Out[144]: valuesA Ba c 1 d 2b c 3 d 4

到此這篇關于Pandas數據類型之category的用法的文章就介紹到這了,更多相關category的用法內容請搜索好吧啦網以前的文章或繼續瀏覽下面的相關文章希望大家以后多多支持好吧啦網！

上一條：python geopandas讀取、創建shapefile文件的方法下一條：Python爬蟲框架之Scrapy中Spider的用法

相關文章：

1. JAMon(Java Application Monitor)備忘記2. SpringBoot+TestNG單元測試的實現3. Java GZip 基于內存實現壓縮和解壓的方法4. IntelliJ IDEA設置默認瀏覽器的方法5. Docker容器如何更新打包并上傳到阿里云6. VMware中如何安裝Ubuntu7. Springboot 全局日期格式化處理的實現8. python 浮點數四舍五入需要注意的地方9. idea配置jdk的操作方法10. 完美解決vue 中多個echarts圖表自適應的問題

排行榜

					
					Docker容器如何更新打包并上傳到阿里云
IntelliJ IDEA設置默認瀏覽器的方法
VMware中如何安裝Ubuntu
idea配置jdk的操作方法
JAMon(Java Application Monitor)備忘記
Java GZip 基于內存實現壓縮和解壓的方法
python 浮點數四舍五入需要注意的地方
完美解決vue 中多個echarts圖表自適應的問題
Springboot 全局日期格式化處理的實現
SpringBoot+TestNG單元測試的實現
golang:json 反序列化的[]和nil操作