我正在寻找比较两个数据帧(df-a和df-b),并从1个数据帧(df-b)中查找给定ID和日期的位置,该日期范围内ID与其他数据帧(df-a)匹配的日期范围)。然后,我想剥离df-a中的所有列,并将它们连接到匹配的df-b中。例如
如果我有一个数据框df-a,则格式为df-a:
ID Start_Date End_Date A B C D E
0 cd2 2020-06-01 2020-06-24 'a' 'b' 'c' 10 20
1 cd2 2020-06-24 2020-07-21
2 cd56 2020-06-10 2020-07-03
3 cd915 2020-04-28 2020-07-21
4 cd103 2020-04-13 2020-04-24
和df-b in
ID Date
0 cd2 2020-05-12
1 cd2 2020-04-12
2 cd2 2020-06-10
3 cd15 2020-04-28
4 cd193 2020-04-13
我想要像这样的输出df-c =
ID Date Start_Date End_Date A B C D E
0 cd2 2020-05-12 - - - - - - -
1 cd2 2020-04-12 - - - - - - -
2 cd2 2020-06-10 2020-06-01 2020-06-11 'a' 'b' 'c' 10 20
3 cd15 2020-04-28 - - - - - - -
4 cd193 2020-04-13 - - - - - - -
在上一篇文章中,我得到了一个绝妙的答案,该答案允许比较数据帧并在满足此条件的任何地方删除,但我一直在努力寻找如何从df-a中适当提取信息的方法。当前尝试如下!
df_c=df_b.copy()
ar=[]
for i in range(df_c.shape[0]):
currentID = df_c.stafnum[i]
currentDate = df_c.Date[i]
df_a_entriesForCurrentID = df_a.loc[df_a.stafnum == currentID]
for j in range(df_a_entriesForCurrentID.shape[0]):
startDate = df_a_entriesForCurrentID.iloc[j,:].Leave_Start_Date
endDate = df_a_entriesForCurrentID.iloc[j,:].Leave_End_Date
if (startDate <= currentDate <= endDate):
print(df_c.loc[i])
print(df_a_entriesForCurrentID.iloc[j,:])
#df_d=pd.concat([df_c.loc[i], df_a_entriesForCurrentID.iloc[j,:]], axis=0)
#df_fin_2=df_fin.append(df_d, ignore_index=True)
#ar.append(df_d)
因此,您要进行某种“软”匹配。这是一种尝试向量化日期范围匹配的解决方案。
# notice working with dates as strings, inequalities will only work if dates in format y-m-d
# otherwise it is safer to parse all date columns like `df_a.Date = pd.to_datetime(df_a)`
# create a groupby object once so we can efficiently filter df_b inside the loop
# good idea if df_b is considerably large and has many different IDs
gdf_b = df_b.groupby('ID')
b_IDs = gdf_b.indices # returns a dictionary with grouped rows {ID: arr(integer-indices)}
matched = [] # so we can collect matched rows from df_b
# iterate over rows with `.itertuples()`, more efficient than iterating range(len(df_a))
for i, ID, date in df_a.itertuples():
if ID in b_IDs:
gID = gdf_b.get_group(ID) # get the filtered df_b
inrange = gID.Start_Date.le(date) & gID.End_Date.ge(date)
if any(inrange):
matched.append(
gID.loc[inrange.idxmax()] # get the first row with date inrange
.values[1:] # use the array without column indices and slice `ID` out
)
else:
matched.append([np.nan] * (df_b.shape[1] - 1)) # no date inrange, fill with NaNs
else:
matched.append([np.nan] * (df_b.shape[1] - 1)) # no ID match, fill with NaNs
df_c = df_a.join(pd.DataFrame(matched, columns=df_b.columns[1:]))
print(df_c)
输出量
ID Date Start_Date End_Date A B C D E
0 cd2 2020-05-12 NaN NaN NaN NaN NaN NaN NaN
1 cd2 2020-04-12 NaN NaN NaN NaN NaN NaN NaN
2 cd2 2020-06-10 2020-06-01 2020-06-24 a b c 10.0 20.0
3 cd15 2020-04-28 NaN NaN NaN NaN NaN NaN NaN
4 cd193 2020-04-13 NaN NaN NaN NaN NaN NaN NaN
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句