我正在尝试为以下情况找出有效的拆分/应用/合并方案。考虑demoAll
下面定义的熊猫数据框:
import datetime
import pandas as pd
demoA = pd.DataFrame({'date':[datetime.date(2010,1,1), datetime.date(2010,1,2), datetime.date(2010,1,3)],
'ticker':['A', 'A', 'A'],
'x1':[10,20,30],
'close':[120, 133, 129]}).set_index('date', drop=True)
demoB = pd.DataFrame({'date':[datetime.date(2010,1,1), datetime.date(2010,1,2), datetime.date(2010,1,3)],
'ticker':['B', 'B', 'B'],
'x1':[18,11,45],
'close':[50, 49, 51]}).set_index('date', drop=True)
demoAll = pd.concat([demoA, demoB])
print(demoAll)
结果是:
ticker x1 close
date
2010-01-01 A 10 120
2010-01-02 A 20 133
2010-01-03 A 30 129
2010-01-01 B 18 50
2010-01-02 B 11 49
2010-01-03 B 45 51
我也有一个股票代码到模型对象的字典映射
ticker2model = {'A':model_A, 'B':model_B,...}
其中每个模型都有一个self.predict(df)
方法,该方法可以接收整个数据帧并返回一系列相同长度的数据。
我现在想创建一个新列,demoAll['predictions']
对应于这些预测。最干净/最有效的方法是什么?注意事项:
demoAll
是特定股票行情的数据框的串联,每个数据框仅按日期编制索引。因此,的索引demoAll
不是唯一的。(但是,日期/股票代码的组合是唯一的。)
我的想法一直是做类似以下示例的操作,但是遇到索引,数据类型强制和运行时间慢的问题。实际数据集非常大(行和列)。
demoAll['predictions'] = demoAll.groupby('ticker').apply(
lambda x: ticker2model[x.name].predict(x)
)
我可能会误解了您通过模型传递的信息以进行预测,但是如果我正确理解,我将执行以下操作:
predictions
的列demoAll
demoAll
demoAll['predictions']
使用您的代码的示例:
# get non 'ticker' columns
non_ticker_cols = [col for col in demoAll.columns if col is not 'ticker']
# get unique set of tickers
tickers = demoAll.ticker.unique()
# create and prepopulate the predictions column
demoAll['predictions'] = 0
for ticker in tickers:
# get boolean Series to filter the Dataframes by.
filter_by_ticker = demoAll.ticker == ticker
# filter, predict and allocate
demoAll.loc[filter_by_ticker, 'predictions'] = ticker2model[
ticker].predict(
demoAll.loc[filter_by_ticker,
non_ticker_cols]
)
输出如下所示:
ticker x1 close predictions
date
2010-01-01 A 10 120 10.0
2010-01-02 A 20 133 10.0
2010-01-03 A 30 129 10.0
2010-01-01 B 18 50 100.0
2010-01-02 B 11 49 100.0
2010-01-03 B 45 51 100.0
使用比较
我们可以每行使用Apply,但是正如您提到的那样,它会变慢。我将两者进行比较,以给出加速的概念。
设定
我将使用DummyRegressor
fromsklearn
来允许我调用predict
方法并创建您在问题中提到的字典。
model_a = DummyRegressor(strategy='mean')
model_b = DummyRegressor(strategy='median')
model_a.fit([[10,14]], y=np.array([10]))
model_b.fit([[200,200]], [100])
ticker2model = {'A':model_a, 'B':model_b}
将两者都定义为函数
def predict_by_ticker_filter(df, model_dict):
# get non 'ticker' columns
non_ticker_cols = [col for col in df.columns if col is not 'ticker']
# get unique set of tickers
tickers = df.ticker.unique()
# create and prepopulate the predictions column
df['predictions'] = 0
for ticker in tickers:
# get boolean Series to filter the Dataframes by.
filter_by_ticker = df.ticker==ticker
# filter, predict and allocate
df.loc[filter_by_ticker,'predictions'] = model_dict[ticker].predict(
df.loc[filter_by_ticker,
non_ticker_cols]
)
return df
def model_apply_by_row(df_row, model_dict):
# includes some conversions to list to allow the predict method to run
return model_dict[df_row['ticker']].predict([df_row[['x1','close']].tolist()])[0]
我timeit
在函数调用中使用的性能可得出以下结果
在您的示例中demoAll
:
model_apply_by_row
%timeit demoAll.apply(model_apply_by_row,model_dict=ticker2model, axis=1)
3.78 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Forecast_by_ticker_filter
%timeit predict_by_ticker_filter(demoAll, ticker2model)
6.24 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
将的大小demoAll
增加到(606, 3)
:
model_apply_by_row
%timeit demoAll.apply(model_apply_by_row,model_dict=ticker2model, axis=1)
320 ms ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Forecast_by_ticker_filter
%timeit predict_by_ticker_filter(demoAll, ticker2model)
6.1 ms ± 512 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
将的大小demoAll
增加到(6006, 3)
:
model_apply_by_row
%timeit demoAll.apply(model_apply_by_row,model_dict=ticker2model, axis=1)
3.15 s ± 371 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Forecast_by_ticker_filter
%timeit predict_by_ticker_filter(demoAll, ticker2model)
9.1 ms ± 767 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句