这个问题与这里的帖子有关[搜索并返回python数据框下面的行并转置
我有一个数据框,其中每一行都在网上刮擦了文本,其中包含体育选择信息(全部都在同一列中)。链接文章中的解决方案效果很好,但是由于文本中没有一致的模式,因此我发现了更多麻烦。这是我的DF:
print(df):
Col A
Race 1 - Handicap
14 - NAME
3 - NAME
5 - NAME
6 - NAME
4 - NAME
Race Overview: lorem ipsum etc etc
Race 2 - Sprint
12 - NAME
10 - NAME
8 - NAME
11 - NAME
Race Overview: Second lorem ipsum etc etc
Race 3 - Sprint
1 - NAME
14 - NAME
8 - NAME
6 - NAME
Race 4 - Handicap
1 - NAME
14 - NAME
8 - NAME
#Race numbers may run up to 15-20
我正在尝试将其转换为:
print(df):
Race Name | Selection No | Selection | Race Overview
Race 1 - Handicap | 1 | 14 - Name | Race Overview: lorem ipsum etc etc
Race 1 - Handicap | 2 | 3 - Name | Race Overview: lorem ipsum etc etc
Race 1 - Handicap | 3 | 5 - Name | Race Overview: lorem ipsum etc etc
Race 1 - Handicap | 4 | 6 - Name | Race Overview: lorem ipsum etc etc
Race 1 - Handicap | 5 | 4 - Name | Race Overview: lorem ipsum etc etc
Race 2 - Sprint | 1 | 12 - Name | Race Overview: Second lorem ipsum etc etc
Race 2 - Sprint | 2 | 10 - Name | Race Overview: Second lorem ipsum etc etc
Race 2 - Sprint | 3 | 8 - Name | Race Overview: Second lorem ipsum etc etc
Race 2 - Sprint | 4 | 11 - Name | Race Overview: Second lorem ipsum etc etc
Race 3 - Sprint | 1 | 1 - Name |
Race 3 - Sprint | 2 | 14 - Name |
Race 3 - Sprint | 3 | 8 - Name |
Race 3 - Sprint | 4 | 6 - Name |
Race 4 - Sprint | 1 | 1 - Name |
Race 4 - Sprint | 2 | 14 - Name |
Race 4 - Sprint | 3 | 8 - Name |
如果图案是基于6行的重复圆柱,则此函数用于转置:
df2 = (
pd.DataFrame(data = df['Col A'].values.reshape(-1, 6))
.set_index([0, 5])
.stack()
.rename_axis(index=['Race Name','Race Overview','Selection No'])
.to_frame('Selection')
.reset_index()
)
是否需要在每行之间找到"Race [0-9] -"
行,然后df2
对每个模式运行以上行?
任何帮助将非常感激。谢谢!
采用:
#get Race values by pattern
df['Race Name'] = df['Col A'].where(df['Col A'].str.contains('Race [0-9]+ -'))
#get Selection values by pattern - starting numeric of original column
df['Selection'] = df['Col A'].where(df['Col A'].str.contains('^[0-9]+'))
#get info column
df['Race Overview'] = df['Col A'].where(df['Race Name'].isna() & df['Selection'].isna())
#forward and back filling per helper groups
s1 = df['Selection'].isna().cumsum()
s2 = df['Race Overview'].notna().iloc[::-1].cumsum()
df['Race Name'] = df.groupby(s1)['Race Name'].ffill()
df['Race Overview'] = df.groupby(s2)['Race Overview'].bfill()
#remove rows by missing values and also original column
df = df.dropna(subset=['Race Name', 'Selection']).drop('Col A', axis=1)
#added counter
df.insert(1, 'Selection No', df.groupby('Race Name').cumcount().add(1))
print (df)
Race Name Selection No Selection \
4 Race 1 - Handicap 1 14 - NAME
5 Race 1 - Handicap 2 3 - NAME
6 Race 1 - Handicap 3 5 - NAME
7 Race 1 - Handicap 4 6 - NAME
8 Race 1 - Handicap 5 4 - NAME
11 Race 2 - Sprint 1 12 - NAME
12 Race 2 - Sprint 2 10 - NAME
13 Race 2 - Sprint 3 8 - NAME
14 Race 2 - Sprint 4 11 - NAME
17 Race 3 - Sprint 1 1 - NAME
18 Race 3 - Sprint 2 14 - NAME
19 Race 3 - Sprint 3 8 - NAME
20 Race 3 - Sprint 4 6 - NAME
22 Race 4 - Handicap 1 1 - NAME
23 Race 4 - Handicap 2 14 - NAME
24 Race 4 - Handicap 3 8 - NAME
Race Overview
4 Race Overview: lorem ipsum etc etc
5 Race Overview: lorem ipsum etc etc
6 Race Overview: lorem ipsum etc etc
7 Race Overview: lorem ipsum etc etc
8 Race Overview: lorem ipsum etc etc
11 Race Overview: Second lorem ipsum etc etc
12 Race Overview: Second lorem ipsum etc etc
13 Race Overview: Second lorem ipsum etc etc
14 Race Overview: Second lorem ipsum etc etc
17 NaN
18 NaN
19 NaN
20 NaN
22 NaN
23 NaN
24 NaN
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句