我在下面有一个大的数据框:
可在此处https://github.com/ENLK/Py-Projects-/blob/master/education_val.csv中找到用作示例“ education_val.csv”的数据
import pandas as pd
edu = pd.read_csv('education_val.csv')
del edu['Unnamed: 0']
edu.head(10)
ID Year Education
22445 1991 higher education
29925 1991 No qualifications
76165 1991 No qualifications
223725 1991 Other
280165 1991 intermediate qualifications
333205 1991 No qualifications
387605 1991 higher education
541285 1991 No qualifications
541965 1991 No qualifications
599765 1991 No qualifications
列Education
中的值是:
edu.Education.value_counts()
intermediate qualifications 153705
higher education 67020
No qualifications 55842
Other 36915
我想通过以下方式替换“教育”列中的值:
如果an在列中的年份中ID
具有值higher education
,Education
则该列的所有未来年份ID
也将higher education
在Education
列中。
如果anID
具有intermediate qualifications
一年中的值,则该年份的所有未来年份ID
将intermediate qualifications
在相应的Education
列中。但是,如果该值higher education
出现在ID
此后的任何年份中,则在以后的年份中higher education
替换intermediate qualifications
该值,而不管是否为Other
或No qualifications occur
。
例如,在下面的DataFrame中,ID
22445具有higher education
年份中的1991
值,Education
for的所有后续值22445
都应higher education
在以后的年份(直至year)中替换为2017
。
edu.loc[edu['ID'] == 22445]
ID Year Education
22445 1991 higher education
22445 1992 higher education
22445 1993 higher education
22445 1994 higher education
22445 1995 higher education
22445 1996 intermediate qualifications
22445 1997 intermediate qualifications
22445 1998 Other
22445 1999 No qualifications
22445 2000 intermediate qualifications
22445 2001 intermediate qualifications
22445 2002 intermediate qualifications
22445 2003 intermediate qualifications
22445 2004 intermediate qualifications
22445 2005 intermediate qualifications
22445 2006 intermediate qualifications
22445 2007 intermediate qualifications
22445 2008 intermediate qualifications
22445 2010 intermediate qualifications
22445 2011 intermediate qualifications
22445 2012 intermediate qualifications
22445 2013 intermediate qualifications
22445 2014 intermediate qualifications
22445 2015 intermediate qualifications
22445 2016 intermediate qualifications
22445 2017 intermediate qualifications
同样,ID
1587125在数据框下面的值是intermediate qualifications
在今年1991
,和改变higher education
的1993
。Education
未来年份(从1993年开始)中该列中的所有后续值1587125
都应为higher education
。
edu.loc[edu['ID'] == 1587125]
ID Year Education
1587125 1991 intermediate qualifications
1587125 1992 intermediate qualifications
1587125 1993 higher education
1587125 1994 higher education
1587125 1995 higher education
1587125 1996 higher education
1587125 1997 higher education
1587125 1998 higher education
1587125 1999 higher education
1587125 2000 higher education
1587125 2001 higher education
1587125 2002 higher education
1587125 2003 higher education
1587125 2004 Other
1587125 2005 No qualifications
1587125 2006 intermediate qualifications
1587125 2007 intermediate qualifications
1587125 2008 intermediate qualifications
1587125 2010 intermediate qualifications
1587125 2011 higher education
1587125 2012 higher education
1587125 2013 higher education
1587125 2014 higher education
1587125 2015 higher education
1587125 2016 higher education
1587125 2017 higher education
数据中有12,057个唯一值ID
,并且该列的Year
范围是1991年至2017年。如何Education
根据上述条件更改所有12、057个值?我不确定如何针对所有唯一的ID
s以统一的方式执行此操作。此处用作示例的示例数据附在上面的Github链接中。提前谢谢了。
您可以使用如下分类数据来做到这一点:
df = pd.read_csv('https://raw.githubusercontent.com/ENLK/Py-Projects-/master/education_val.csv')
eddtype = pd.CategoricalDtype(['No qualifications',
'Other',
'intermediate qualifications',
'higher education'],
ordered=True)
df['EducationCat'] = df['Education'].str.strip().astype(eddtype)
df['EduMax'] = df.sort_values('Year').groupby('ID')['EducationCat']\
.transform(lambda x: eddtype.categories[x.cat.codes.cummax()] )
它被明确地分解了,因此您可以看到我正在使用的数据操作。
输出:
df[df['ID'] == 1587125]
ID Year Education EducationCat EduMax
18 1587125 1991 intermediate qualifications intermediate qualifications intermediate qualifications
12075 1587125 1992 intermediate qualifications intermediate qualifications intermediate qualifications
24132 1587125 1993 higher education higher education higher education
36189 1587125 1994 higher education higher education higher education
48246 1587125 1995 higher education higher education higher education
60303 1587125 1996 higher education higher education higher education
72360 1587125 1997 higher education higher education higher education
84417 1587125 1998 higher education higher education higher education
96474 1587125 1999 higher education higher education higher education
108531 1587125 2000 higher education higher education higher education
120588 1587125 2001 higher education higher education higher education
132645 1587125 2002 higher education higher education higher education
144702 1587125 2003 higher education higher education higher education
156759 1587125 2004 Other Other higher education
168816 1587125 2005 No qualifications No qualifications higher education
180873 1587125 2006 intermediate qualifications intermediate qualifications higher education
192930 1587125 2007 intermediate qualifications intermediate qualifications higher education
204987 1587125 2008 intermediate qualifications intermediate qualifications higher education
217044 1587125 2010 intermediate qualifications intermediate qualifications higher education
229101 1587125 2011 higher education higher education higher education
241158 1587125 2012 higher education higher education higher education
253215 1587125 2013 higher education higher education higher education
265272 1587125 2014 higher education higher education higher education
277329 1587125 2015 higher education higher education higher education
289386 1587125 2016 higher education higher education higher education
301443 1587125 2017 higher education higher education higher education
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句