熊猫数据框根据条件更改列中的值

米玛

我在下面有一个大的数据框:

可在此处https://github.com/ENLK/Py-Projects-/blob/master/education_val.csv中找到用作示例“ education_val.csv”的数据

import pandas as pd 

edu = pd.read_csv('education_val.csv')
del edu['Unnamed: 0']
edu.head(10)

ID  Year    Education
22445   1991    higher education
29925   1991    No qualifications
76165   1991    No qualifications
223725  1991    Other
280165  1991    intermediate qualifications
333205  1991    No qualifications
387605  1991    higher education
541285  1991    No qualifications
541965  1991    No qualifications
599765  1991    No qualifications

Education的值是:

edu.Education.value_counts()

intermediate qualifications 153705
higher education    67020
No qualifications   55842
Other   36915

我想通过以下方式替换“教育”列中的值:

  1. 如果an在列中的年份中ID具有值higher educationEducation该列的所有未来年份ID也将higher educationEducation列中。

  2. 如果anID具有intermediate qualifications一年中的值,则该年份的所有未来年份IDintermediate qualifications在相应的Education列中。但是,如果该值higher education出现在ID此后的任何年份中,则在以后的年份中higher education替换intermediate qualifications该值,而不管是否为OtherNo qualifications occur

例如,在下面的DataFrame中,ID22445具有higher education年份中的1991值,Educationfor的所有后续值22445都应higher education在以后的年份(直至year)中替换2017

edu.loc[edu['ID'] == 22445]

ID  Year    Education
22445   1991    higher education
22445   1992    higher education
22445   1993    higher education
22445   1994    higher education
22445   1995    higher education
22445   1996    intermediate qualifications
22445   1997    intermediate qualifications
22445   1998    Other
22445   1999    No qualifications
22445   2000    intermediate qualifications
22445   2001    intermediate qualifications
22445   2002    intermediate qualifications
22445   2003    intermediate qualifications
22445   2004    intermediate qualifications
22445   2005    intermediate qualifications
22445   2006    intermediate qualifications
22445   2007    intermediate qualifications
22445   2008    intermediate qualifications
22445   2010    intermediate qualifications
22445   2011    intermediate qualifications
22445   2012    intermediate qualifications
22445   2013    intermediate qualifications
22445   2014    intermediate qualifications
22445   2015    intermediate qualifications
22445   2016    intermediate qualifications
22445   2017    intermediate qualifications

同样,ID1587125在数据框下面的值是intermediate qualifications在今年1991,和改变higher education1993Education未来年份(从1993年开始)中该列中的所有后续值1587125都应为higher education

edu.loc[edu['ID'] == 1587125]

ID  Year    Education
1587125 1991    intermediate qualifications
1587125 1992    intermediate qualifications
1587125 1993    higher education
1587125 1994    higher education
1587125 1995    higher education
1587125 1996    higher education
1587125 1997    higher education
1587125 1998    higher education
1587125 1999    higher education
1587125 2000    higher education
1587125 2001    higher education
1587125 2002    higher education
1587125 2003    higher education
1587125 2004    Other
1587125 2005    No qualifications
1587125 2006    intermediate qualifications
1587125 2007    intermediate qualifications
1587125 2008    intermediate qualifications
1587125 2010    intermediate qualifications
1587125 2011    higher education
1587125 2012    higher education
1587125 2013    higher education
1587125 2014    higher education
1587125 2015    higher education
1587125 2016    higher education
1587125 2017    higher education

数据中有12,057个唯一值ID,并且该列的Year范围是1991年至2017年。如何Education根据上述条件更改所有12、057个我不确定如何针对所有唯一的IDs以统一的方式执行此操作此处用作示例的示例数据附在上面的Github链接中。提前谢谢了。

斯科特·波士顿

您可以使用如下分类数据来做到这一点:

df = pd.read_csv('https://raw.githubusercontent.com/ENLK/Py-Projects-/master/education_val.csv')

eddtype = pd.CategoricalDtype(['No qualifications', 
                               'Other',
                               'intermediate qualifications',
                               'higher education'], 
                               ordered=True)
df['EducationCat'] = df['Education'].str.strip().astype(eddtype)

df['EduMax'] = df.sort_values('Year').groupby('ID')['EducationCat']\
                 .transform(lambda x: eddtype.categories[x.cat.codes.cummax()] )

它被明确地分解了,因此您可以看到我正在使用的数据操作。

  1. 按顺序创建Education类别dtype
  2. 接下来,将“教育”列的dtype更改为使用该分类dtype(EducationCat)
  3. 使用分类代码执行cummax计算
  4. 通过索引返回由cummax计算(EduMax)定义的类别

输出:

df[df['ID'] == 1587125]

            ID  Year                    Education                 EducationCat                       EduMax
18      1587125  1991  intermediate qualifications  intermediate qualifications  intermediate qualifications
12075   1587125  1992  intermediate qualifications  intermediate qualifications  intermediate qualifications
24132   1587125  1993             higher education             higher education             higher education
36189   1587125  1994             higher education             higher education             higher education
48246   1587125  1995             higher education             higher education             higher education
60303   1587125  1996             higher education             higher education             higher education
72360   1587125  1997             higher education             higher education             higher education
84417   1587125  1998             higher education             higher education             higher education
96474   1587125  1999             higher education             higher education             higher education
108531  1587125  2000             higher education             higher education             higher education
120588  1587125  2001             higher education             higher education             higher education
132645  1587125  2002             higher education             higher education             higher education
144702  1587125  2003             higher education             higher education             higher education
156759  1587125  2004                        Other                        Other             higher education
168816  1587125  2005            No qualifications            No qualifications             higher education
180873  1587125  2006  intermediate qualifications  intermediate qualifications             higher education
192930  1587125  2007  intermediate qualifications  intermediate qualifications             higher education
204987  1587125  2008  intermediate qualifications  intermediate qualifications             higher education
217044  1587125  2010  intermediate qualifications  intermediate qualifications             higher education
229101  1587125  2011             higher education             higher education             higher education
241158  1587125  2012             higher education             higher education             higher education
253215  1587125  2013             higher education             higher education             higher education
265272  1587125  2014             higher education             higher education             higher education
277329  1587125  2015             higher education             higher education             higher education
289386  1587125  2016             higher education             higher education             higher education
301443  1587125  2017             higher education             higher education             higher education

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章

根据熊猫数据框中的条件为列分配值

根据条件替换熊猫数据框列中的值

根据熊猫数据框中的特定ID更改列的值

根据条件更改 r 中的值列数据框

根据条件更改熊猫中的列值

根据数据框中的其他列更改熊猫数据框的列值

如何根据多种条件更改熊猫数据框列系列中的特定单元格值?

根据熊猫的条件更改数据框的所有值

根据条件在数据框列中更改值,在另一列中更改值

更改熊猫数据框列中的值

如何根据熊猫数据框数据透视表中的条件获取列中的值?

根据条件在熊猫数据框中添加新列,并替换不同列中的Nan值

熊猫-根据条件更改数据框中的单元格值

熊猫数据框 - 根据条件设置列值

根据熊猫数据框中其他列的条件和值创建新列

根据条件在熊猫数据框中的多行中设置值

根据重复项更改熊猫数据框列的值

如何根据熊猫数据框中的条件移动列中的值

根据熊猫中的另一个数据框更改列中的值

根据多种条件替换熊猫数据框中的值

根据条件屏蔽熊猫数据框中的值

如何根据条件替换熊猫数据框中的值?

如何根据熊猫中的条件为数据框的子集的列分配值?

如何根据“标识符列”和熊猫数据框中的附加条件替换值?

根据value_counts()更改熊猫数据框中的值

根据熊猫数据框中的列值获取更改日期

根据条件在熊猫数据框中创建一列

在熊猫数据框中根据条件为组创建列

根据条件在熊猫数据框中删除列