将Pandas数据帧转换为频率矩阵

乔什·基德(Josh Kidd)

我正在尝试将具有三列(日期,开始,结束)的熊猫数据帧转换为频率矩阵。我的输入数据框如下所示:

Date,                Start, End
2016-09-02 09:16:00  18     16
2016-09-02 16:14:10  16      1
2016-09-02 06:17:21  18     17
2016-09-02 05:51:07  23     17
2016-09-02 18:34:44  18     17
2016-09-02 05:44:44  20      4
2016-09-02 09:25:22  18     17
2016-09-02 22:27:44  18     17
2016-09-02 16:02:46   0     18
2016-09-02 15:35:07  17     17
2016-09-02 16:06:42   8     17
2016-09-02 14:47:04  16     23
2016-09-02 07:47:24  20      1
...

“开始”和“结束”的值是介于0之间的整数23“日期”是日期时间。我试图创建的频率矩阵是24 x 24 csv,其中行i和列j输入中出现“ End” =i和“ Start” =j的次数。例如,以上数据将创建:

    0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0
 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0
 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
13, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0
17, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4, 0, 0, 0, 0, 1
18, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
19, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
22, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
23, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0

要获得额外的帮助,是否可以通过每15分钟创建一个单独的矩阵的方式来完成?这将是672个矩阵,因为此日期范围是一周。我是一个自学成才的初学者,我真的想不出如何以Python方式解决此问题,任何解决方案或建议都将不胜感激。

臀部

用简单的计数创建矩阵,然后拆开其中一列:

mat = df.groupby(['Start', 'End']).count().unstack(level=0)

清理日期级别:

mat.columns = mat.columns.droplevel(0)

现在重新索引行和列并转换为整数:

mat.reindex(*[range(0,24)]*2).fillna(0)

详细说明

首先,您计算给定(开始,结束)对出现的次数。groupby针对这两列的结果实际上带回了一个多索引。

df.groupby(['Start', 'End']).count()
Out[134]: 
           Date
Start End      
0     18      1
8     17      1
16    1       1
      23      1
17    17      1
18    16      1
      17      4
20    1       1
      4       1
23    17      1

我们想要从该结果中获得的是在列中获取开始索引。取消堆叠是这样的:

df.groupby(['Start', 'End']).count().unstack(level=0)
Out[135]: 
      Date                              
Start   0    8    16   17   18   20   23
End                                     
1      NaN  NaN  1.0  NaN  NaN  1.0  NaN
4      NaN  NaN  NaN  NaN  NaN  1.0  NaN
16     NaN  NaN  NaN  NaN  1.0  NaN  NaN
17     NaN  1.0  NaN  1.0  4.0  NaN  1.0
18     1.0  NaN  NaN  NaN  NaN  NaN  NaN
23     NaN  NaN  1.0  NaN  NaN  NaN  NaN

取消堆栈的结果是将“开始”列作为附加的列索引级别移动到当前“日期”列索引的顶部(请参见下文)。这就是为什么我们之后将级别0降低。另一种方法-根据您当前的源代码-可能是预先过滤掉Date列,然后进行堆栈将带来一个层次。

_.columns
Out[136]: 
MultiIndex(levels=[['Date'], [0, 8, 16, 17, 18, 20, 23]],
           labels=[[0, 0, 0, 0, 0, 0, 0], [0, 1, 2, 3, 4, 5, 6]],
           names=[None, 'Start'])

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章