我有一个包含 2000 行、两列的大数据框,每列行由一个包含大约 1000 个点的列表组成。我想同时删除两列中的负值,然后计算最小值和最大值。目前我正在通过for
循环进行,需要 30 分钟才能完成。我可以通过向量化操作来做同样的操作吗?
预期的解决方法:
df = pd.DataFrame({'x':[[-1,0,1,2,10],[1.5,2,4,5]],'y':[[2.5,2.4,2.3,1.5,0.1],[5,4.5,3,-0.1]]})
df =
x y
0 [-1, 0, 1, 2, 10] [2.5, 2.4, 2.3, 1.5, 0.1]
1 [1.5, 2, 4, 5] [5, 4.5, 3, -0.1]
### x, y are paired data coming from field. Ex, (-1,2.5), (0,2.4)
# First step: drop negative values in both x and y columns.
# Find a negative x or y and drop the pair.
# Ex, in first row, drop (-1,2.5) pair. That is, -1 in x and 2.5 in y.
# After dropping negative values
df =
x y
0 [0, 1, 2, 10] [2.4, 2.3, 1.5, 0.1]
1 [1.5, 2, 4] [5, 4.5, 3]
### Setp2: Find Max in each column
df =
x y xmax ymax
0 [0, 1, 2, 10] [2.4, 2.3, 1.5, 0.1] 10 2.4
1 [1.5, 2, 4] [5, 4.5, 3] 4 5
### Setp3: Find y@xmax, x@ymax in each column
df =
x y xmax ymax y@xmax x@ymax
0 [0, 1, 2, 10] [2.4, 2.3, 1.5, 0.1] 10 2.4 0.1 0
1 [1.5, 2, 4] [5, 4.5, 3] 4 5 3 1.5
目前的解决方案:它正在工作,但需要大量时间。
for i in range(len(df)):
### create an auxiliary dataframe
auxdf = pd.DataFrame({'x':df['x'].loc[i],'y':df['y'].loc[i]})
## Step1: drop negative values
auxdf = auxdf[(auxdf['x']>0)&(auxdf['y']>0)]
### Step2: Max in x and y
xmax = auxdf['x'].max()
ymax = auxdf['y'].max()
### Step3: x@ymax, y@xmax
xatymax = auxdf['x'].loc[auxdf['y'].idxmax()]
yatxmax = auxdf['y'].loc[auxdf['x'].idxmax()]
### finally I append xmax,ymax,xatymax,yatxmax to the df
做这个向量化操作会最小化时间吗?
numpy
def fast():
for v in df[['x', 'y']].to_numpy():
a = np.array([*v])
a = a[:, (a >= 0).all(axis=0)]
i = a.argmax(1)
yield (*a[[0, 1], i], *a[[1, 0], i])
df[['xmax', 'ymax', 'y@xmax', 'x@ymax']] = list(fast())
print(df)
x y xmax ymax y@xmax x@ymax
0 [-1, 0, 1, 2, 10] [2.5, 2.4, 2.3, 1.5, 0.1] 10.0 2.4 0.1 0.0
1 [1.5, 2, 4, 5] [5, 4.5, 3, -0.1] 4.0 5.0 3.0 1.5
在带有20000
行的示例数据帧上
df = pd.concat([df] * 20000, ignore_index=True)
%%timeit
_ = list(fast())
# 1.10 s ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句