使用Timegrouper'1M'来按列分组和求和正在弄乱我的日期索引熊猫python

迈克尔·珀杜(Michael Perdue)

已发现该错误:下面是作为解决方案发布的代码段。关于我的结果的问题根源于数据源(FEC.GOV)。我找到了,现在继续前进。感谢社区一直以来对这个问题的耐心,耐心,帮助等!

由于已提出解决方案,建议可以在github站点上的代码片段上工作,因此,我提供了以下指向原始文件的链接(http://fec.gov/finance/disclosure/ftpdet.shtml#a2011_2012)。我使用的是2008年到2014年的数据文件:pas212.zip,数据名称:(各委员会对候选人的捐款(和其他支出))。同样,下面的代码可以在[ https://github.com/Michae108/python-coding.git]中找到预先感谢您为解决此问题提供的任何帮助。我已经工作了三天了,这应该是一个非常简单的任务。我导入并连接4个“ |” 分隔值文件。阅读为pd.df; 将日期列设置为date.time。这给了我以下输出:

              cmte_id trans_typ entity_typ state  amount     fec_id    cand_id
date                                                                          
2007-08-15  C00112250       24K        ORG    DC    2000  C00431569  P00003392
2007-09-26  C00119040       24K        CCM    FL    1000  C00367680  H2FL05127
2007-09-26  C00119040       24K        CCM    MD    1000  C00140715  H2MD05155
2007-07-20  C00346296       24K        CCM    CA    1000  C00434571  H8CA37137

其次,我希望能够按一个月的频率对索引进行分组。然后,我想根据[trans_typ]和[cand_id]对[金额]求和。

这是我这样做的代码:

import numpy as np
import pandas as pd
import glob

df = pd.concat((pd.read_csv(f, sep='|', header=None, low_memory=False, \
    names=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', \
    '12', '13', 'date', '15', '16', '17', '18', '19', '20', \
    '21', '22'], index_col=None, dtype={'date':str}) for f in \
    glob.glob('/home/jayaramdas/anaconda3/Thesis/FEC_data/itpas2_data/itpas2**.txt')))

df.dropna(subset=['17'], inplace=True)  
df.dropna(subset=['date'], inplace=True)  
df['date'] = pd.to_datetime(df['date'], format='%m%d%Y')
df1 = df.set_index('date')
df2 = df1[['1', '6', '7', '10', '15', '16', '17']].copy() 
df2.columns = ['cmte_id', 'trans_typ', 'entity_typ', 'state', 'amount',\
               'fec_id','cand_id']

df2['amount'] = df2['amount'].astype(float)

grouper = df2.groupby([pd.TimeGrouper('1M'), 'cand_id', 'trans_typ'])

df = grouper['amount'].sum()
grouper['amount'].sum().unstack().fillna(0)
#print (df.head())

这是我从运行代码的输出:

    trans_typ   24A     24C     24E     24F     24K     24N     24R     24Z
date    cand_id                                 
1954-07-31  S8AK00090   0   0   0   0   1000    0   0   0
1985-09-30  H8OH18088   0   0   36  0   0   0   0   0
1997-04-30  S6ND00058   0   0   0   0   1000    0   0   0

如您所见,运行分组依据后,日期列被弄乱了。我敢肯定,我的约会日期不会再回到2007年。我已经尝试完成以下简单任务:按1个月的时间段进行分组,然后将[金额]乘以[trans_typ]和[cand_id]。似乎应该很简单,但是我没有找到解决方案。我已经阅读了关于Stackoverflow的许多问题,并尝试了不同的技术来解决该问题。有人对此有想法吗?

如果有帮助,以下是我的原始数据示例:

C00409409|N|Q2|P|29992447808|24K|CCM|PERRIELLO FOR CONGRESS|IVY|VA|22945|||06262009|500|C00438788|H8VA05106|D310246|424490|||4072320091116608455
C00409409|N|Q2|P|29992447807|24K|CCM|JOHN BOCCIERI FOR CONGRESS|ALLIANCE|OH|44601|||06262009|500|C00435065|H8OH16058|D310244|424490|||4072320091116608452
C00409409|N|Q2|P|29992447807|24K|CCM|MIKE MCMAHON FOR CONGRESS|STATEN ISLAND|NY|10301|||06262009|500|C00451138|H8NY13077|D310245|424490|||4072320091116608453
C00409409|N|Q2|P|29992447808|24K|CCM|MINNICK FOR CONGRESS|BOISE|ID|83701|||06262009|500|C00441105|H8ID01090|D310243|424490|||4072320091116608454
C00409409|N|Q2|P|29992447807|24K|CCM|ADLER FOR CONGRESS|MARLTON|NJ|08053|||06262009|500|C00439067|H8NJ03156|D310247|424490|||4072320091116608451
C00435164|N|Q2|P|29992448007|24K|CCM|ALEXI FOR ILLINOIS EXPLORATORY COMMITTEE||||||06292009|1500|C00459586|S0IL00204|SB21.4124|424495|||4071620091116385529
耶斯列尔

这非常复杂。Date_parser返回错误,因此第一列date已转换为stringin read_csv然后,对列date进行了转换,to_datetime并删除了所有NaN值。最后,您可以使用groupbyunstack

import pandas as pd
import glob



#change path by your 
df = pd.concat((pd.read_csv(f, 
                            sep='|', 
                            header=None, 
                            names=['cmte_id', '2', '3', '4', '5', 'trans_typ', 'entity_typ', '8', '9', 'state', '11', 'employer', 'occupation', 'date', 'amount', 'fec_id', 'cand_id', '18', '19', '20', '21', '22'], 
                            usecols= ['date', 'cmte_id', 'trans_typ', 'entity_typ', 'state', 'employer', 'occupation', 'amount', 'fec_id', 'cand_id'],
                            dtype={'date': str}
                           ) for f in glob.glob('test/itpas2_data/itpas2**.txt')), ignore_index=True)


#parse column date to datetime
df['date'] = pd.to_datetime(df['date'], format='%m%d%Y')

#remove rows, where date is NaN
df = df[df['date'].notnull()]

#set column date to index
df = df.set_index('date')

g = df.groupby([pd.TimeGrouper('1M'), 'cand_id', 'trans_typ'])['amount'].sum()
print g.unstack().fillna(0)

trans_typ                24A  24C   24E  24F     24K  24N  24R  24Z
date       cand_id                                                 
2001-09-30 H2HI02110       0    0     0    0    2500    0    0    0
2007-03-31 S6TN00216       0    0     0    0    2000    0    0    0
2007-10-31 H8IL21021       0    0     0    0   -1000    0    0    0
2008-03-31 S6TN00216       0    0     0    0    1000    0    0    0
2008-07-31 H2PA11098       0    0     0    0    1000    0    0    0
           H4KS03105       0    0     0    0   49664    0    0    0
           H6KS03183       0    0     0    0    1000    0    0    0
2008-10-31 H8KS02090       0    0     0    0    1000    0    0    0
           S6TN00216       0    0     0    0    1500    0    0    0
2008-12-31 H6KS01146       0    0     0    0    2000    0    0    0
2009-02-28 S6OH00163       0    0     0    0   -1000    0    0    0
2009-03-31 S2KY00012       0    0     0    0    2000    0    0    0
           S6WY00068       0    0     0    0   -2500    0    0    0
2009-06-30 S6TN00216       0    0     0    0   -1000    0    0    0
2009-08-31 S0MO00183       0    0     0    0    1000    0    0    0
2009-09-30 S0NY00410       0    0     0    0    1000    0    0    0
2009-10-31 S6OH00163       0    0     0    0   -2500    0    0    0
           S6WY00068       0    0     0    0   -1000    0    0    0
2009-11-30 H8MO09153       0    0     0    0     500    0    0    0
           S0NY00410       0    0     0    0   -1000    0    0    0
           S6OH00163       0    0     0    0    -500    0    0    0
2009-12-31 H0MO00019       0    0     0    0     500    0    0    0
           S6TN00216       0    0     0    0   -1000    0    0    0
2010-01-31 H0CT03072       0    0     0    0     250    0    0    0
           S0MA00109       0    0     0    0    5000    0    0    0
2010-02-28 S6TN00216       0    0     0    0   -1500    0    0    0
2010-03-31 H0MO00019       0    0     0    0     500    0    0    0
           S0NY00410       0    0     0    0   -2500    0    0    0
2010-05-31 H0MO06149       0    0     0    0     530    0    0    0
           S6OH00163       0    0     0    0   -1000    0    0    0
...                      ...  ...   ...  ...     ...  ...  ...  ...
2012-12-31 S6UT00063       0    0     0    0    5000    0    0    0
           S6VA00093       0    0     0    0   97250    0    0    0
           S6WY00068       0    0     0    0    1500    0    0    0
           S6WY00126       0    0     0    0   11000    0    0    0
           S8AK00090       0    0     0    0  132350    0    0    0
           S8CO00172       0    0     0    0   88500    0    0    0
           S8DE00079       0    0     0    0    6000    0    0    0
           S8FL00166       0    0     0    0    -932    0    0  651
           S8ID00027       0    0     0    0   13000    0    0  326
           S8ID00092       0    0     0    0    2500    0    0    0
           S8MI00158       0    0     0    0    7500    0    0    0
           S8MI00281     110    0     0    0    3000    0    0    0
           S8MN00438       0    0     0    0   65500    0    0    0
           S8MS00055       0    0     0    0   21500    0    0    0
           S8MS00196       0    0     0    0     500    0    0  650
           S8MT00010       0    0     0    0  185350    0    0    0
           S8NC00239       0    0     0    0   67000    0    0    0
           S8NE00067       0   40     0    0       0    0    0    0
           S8NE00117       0    0     0    0   13000    0    0    0
           S8NJ00392       0    0     0    0   -5000    0    0    0
           S8NM00168       0    0     0    0   -2000    0    0    0
           S8NM00184       0    0     0    0   51000    0    0    0
           S8NY00082       0    0     0    0    1000    0    0    0
           S8OR00207       0    0     0    0   23500    0    0    0
           S8VA00214       0    0   120    0   -2000    0    0    0
           S8WA00194       0    0     0    0   -4500    0    0    0
2013-10-31 P80003338  314379    0     0    0       0    0    0    0
           S8VA00214   14063    0     0    0       0    0    0    0
2013-11-30 H2NJ03183       0    0  2333    0       0    0    0    0
2014-10-31 S6PA00217       0    0     0    0    1500    0    0    0

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章