我探索了一种基于XML的API,用于处理与工作相关的事情,它来自仓库数据。理想情况下,我想使用python在python中进行一些分析。
Aggregate(aggregate_dimension_value_list=[ DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 28, 19, 30, tzinfo= UTC )) , None, StringAggregateDimensionValue(value=u'VIRTUALLY_LABELED_CASE') ], quantity=127) ,
Aggregate(aggregate_dimension_value_list=[ DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 28, 19, 30, tzinfo= UTC )) , StringAggregateDimensionValue(value=u'PPTransMergeNonCon') , StringAggregateDimensionValue(value=u'PRIME_BIN_RANDOM_STOW') ], quantity=15)
Aggregate(aggregate_dimension_value_list=[ DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 27, 21, 0, tzinfo= UTC )) , StringAggregateDimensionValue(value=u'PPTransFRA1') , StringAggregateDimensionValue(value=u'PRIME_BIN_RANDOM_STOW') ], quantity=8) ,
在我在VIM中做了一些查找和替换后,数据看起来像上面的流(我知道我可以在python中编写脚本)。如何最好地将这种奇怪的格式导入Pandas?理想情况下,我希望有日期时间,String聚合维度值和数量。但是,在此解析所需的数据中有很多“无”。在一个数据帧中,进行一些分析将很容易,但是我在这里有些困惑(感觉很像n00b)。
编辑:这是我得到并想要解析的未过正则和未替换的数据。它不是真正的XML,所以XML不起作用。
[<DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 26, 20, 30, tzinfo=<UTC
>))>, <StringAggregateDimensionValue(value=u'PPTransCGN1')>, <
StringAggregateDimensionValue(value=u'PRIME_BIN_RANDOM_STOW')>], quantity=992)>, <
StringAggregateDimensionValue(value=u'PPTransLEJ1')>, <StringAggregateDimensionValue(
value=u'PRIME_BIN_RANDOM_STOW')>], quantity=945)>, <Aggregate(
aggregate_dimension_value_list=[<DateAggregateDimensionValue(value=datetime.datetime(2013
, 8, 23, 19, 30, tzinfo=<UTC>))>, None, <StringAggregateDimensionValue(value=u'TOTE')>],
quantity=87)>, <Aggregate(aggregate_dimension_value_list=[<DateAggregateDimensionValue(
value=datetime.datetime(2013, 8, 27, 17, 30, tzinfo=<UTC>))>, <
StringAggregateDimensionValue(value=u'PPTransMUC3')>, <StringAggregateDimensionValue(
value=u'TOTE')>], quantity=14)>, <Aggregate(aggregate_dimension_value_list=[<
DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 27, 20, 30, tzinfo=<UTC
>))>, <StringAggregateDimensionValue(value=u'PPTransEUK5')>, <
StringAggregateDimensionValue(value=u'PRIME_BIN_RANDOM_STOW')>], quantity=339)>, <
Aggregate(aggregate_dimension_value_list=[<DateAggregateDimensionValue(value=datetime.
datetime(2013, 8, 26, 20, 30, tzinfo=<UTC>))>, <StringAggregateDimensionValue(value=u
'PPTransCGN1')>, <StringAggregateDimensionValue(value=u'TOTE')>], quantity=1731)>, <
Aggregate(aggregate_dimension_value_list=[<DateAggregateDimensionValue(value=datetime.
datetime(2013, 8, 26, 19, 30, tzinfo=<UTC>))>, <StringAggregateDimensionValue(value=u
'PPTransEUK5')>, quantity=444)>, <Aggregate(aggregate_dimension_value_list=[<
DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 26, 19, 30, tzinfo=<UTC
>))>, <StringAggregateDimensionValue(value=u'PPTransEUK5')>, <
StringAggregateDimensionValue(value=u'TOTE')>], quantity=28)>, <Aggregate(
aggregate_dimension_value_list=[<DateAggregateDimensionValue(value=datetime.datetime(2013
, 8, 28, 19, 30, tzinfo=<UTC>))>, <StringAggregateDimensionValue(value=u'PPTransORY1')>,
<StringAggregateDimensionValue(value=u'PRIME_BIN_RANDOM_STOW')>], quantity=69)>, <
Aggregate(aggregate_dimension_value_list=<Aggregate(aggregate_dimension_value_list=[<
DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 26, 19, 30, tzinfo=<UTC
>))>, <StringAggregateDimensionValue(value=u'PPTransMAD4')>, <
StringAggregateDimensionValue(value=u'PRIME_BIN_RANDOM_STOW')>], quantity=47)>, <
Aggregate(aggregate_dimension_value_list=[<DateAggregateDimensionValue(value=datetime.
datetime(2013, 8, 26, 21, 0, tzinfo=<UTC>))>, None, None], quantity=78)>
如果您更喜欢解析器的功能,这里有一个针对您的问题的pyparsing刺:
from pyparsing import Suppress,QuotedString,Word,alphas,nums,alphanums,Keyword,Optional
import datetime
# define UTC timezone for sake of eval
if hasattr(datetime,"timezone"):
UTC = datetime.timezone(datetime.timedelta(0),"UTC")
else:
UTC = None
_ = Suppress
evaltokens = lambda s,l,t: eval(''.join(t))
timevalue = 'datetime.datetime' + QuotedString('(', endQuoteChar=')', unquoteResults=False)
timevalue.setParseAction(evaltokens)
strvalue = 'u' + QuotedString("'", unquoteResults=False)
strvalue.setParseAction(evaltokens)
nonevalue = Keyword("None").setParseAction(lambda s,l,t: [None])
intvalue = Word(nums).setParseAction(lambda s,l,t: int(t[0]))
COMMA = Optional(_(","))
valuedexpr = lambda expr: (Word(alphas) + "(" + "value" + "=" + expr + ")").setParseAction(lambda t: t[4])
lineexpr = (_("Aggregate(aggregate_dimension_value_list=[") +
valuedexpr(timevalue)("timestamp") + COMMA +
(nonevalue | valuedexpr(strvalue))("s1") + COMMA +
(nonevalue | valuedexpr(strvalue))("s2") + COMMA +
"]" + COMMA +
"quantity=" + intvalue("qty"))
用于lineexpr.searchString
从每个聚合中提取数据:
for data in lineexpr.searchString(sample):
print data.dump()
print data.qty
print
给予:
[datetime.datetime(2013, 8, 28, 19, 30), None, u'VIRTUALLY_LABELED_CASE', ']', 'quantity=', 127]
- qty: 127
- s1: None
- s2: VIRTUALLY_LABELED_CASE
- timestamp: 2013-08-28 19:30:00
127
[datetime.datetime(2013, 8, 28, 19, 30), u'PPTransMergeNonCon', u'PRIME_BIN_RANDOM_STOW', ']', 'quantity=', 15]
- qty: 15
- s1: PPTransMergeNonCon
- s2: PRIME_BIN_RANDOM_STOW
- timestamp: 2013-08-28 19:30:00
15
[datetime.datetime(2013, 8, 27, 21, 0), u'PPTransFRA1', u'PRIME_BIN_RANDOM_STOW', ']', 'quantity=', 8]
- qty: 8
- s1: PPTransFRA1
- s2: PRIME_BIN_RANDOM_STOW
- timestamp: 2013-08-27 21:00:00
8
dump()
将显示所有可用的命名结果值-请注意如何使用可以直接访问数量属性data.qty
。在中为您设置了结果名称“ qty”的定义"quantity=" + intvalue("qty")
。timestamp
,s1
和s2
可以类似地访问。(其中还有一点eval
不足之处,清理工作留给读者练习。)
编辑:
这是经过修改的pyparsing解析器,用于处理原始的原始XML风格的东西。所做的更改确实很小:
from pyparsing import Suppress,QuotedString,Word,alphas,nums,alphanums,Keyword,Optional, ungroup
import datetime
# define UTC timezone for sake of eval
if hasattr(datetime,"timezone"):
UTC = datetime.timezone(datetime.timedelta(0),"UTC")
else:
UTC = None
_ = Suppress
evaltokens = lambda s,l,t: eval(''.join(t))
timevalue = 'datetime.datetime' + QuotedString('(', endQuoteChar=')', unquoteResults=False)
replUTC = lambda s,l,t: ''.join(t).replace("< UTC>","UTC").replace("<UTC >","UTC").replace("<UTC>","UTC")
timevalue.setParseAction(replUTC, evaltokens)
strvalue = 'u' + QuotedString("'", unquoteResults=False)
strvalue.setParseAction(evaltokens)
nonevalue = Keyword("None").setParseAction(lambda s,l,t: [None])
intvalue = Word(nums).setParseAction(lambda s,l,t: int(t[0]))
COMMA = Optional(_(","))
LT,GT,LPAR,RPAR,LBRACK,RBRACK = map(Suppress,"<>()[]")
#~ valuedexpr = lambda expr: (Word(alphas) + "(" + "value" + "=" + expr + ")").setParseAction(lambda t: t[4])
valuedexpr = lambda expr: ungroup(LT + (Word(alphas) + "(" + "value" + "=" + expr("value") + ")" + GT).setParseAction(lambda t: t.value))
#~ lineexpr = (_("Aggregate(aggregate_dimension_value_list=[") +
#~ valuedexpr(timevalue)("timestamp") + COMMA +
#~ (nonevalue | valuedexpr(strvalue))("s1") + COMMA +
#~ (nonevalue | valuedexpr(strvalue))("s2") + COMMA +
#~ "]" + COMMA +
#~ "quantity=" + intvalue("qty"))
lineexpr = (LT + "Aggregate" + LPAR + "aggregate_dimension_value_list" + "=" + LBRACK +
valuedexpr(timevalue)("timestamp") + COMMA +
(nonevalue | valuedexpr(strvalue))("s1") + COMMA +
(nonevalue | valuedexpr(strvalue))("s2") +
RBRACK + COMMA +
"quantity=" + intvalue("qty") + RPAR + GT)
从您粘贴的文本(其中某些文本格式不正确)中给出:
['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 20, 30), u'PPTransCGN1', u'PRIME_BIN_RANDOM_STOW', 'quantity=', 992]
- qty: 992
- s1: PPTransCGN1
- s2: PRIME_BIN_RANDOM_STOW
- timestamp: 2013-08-26 20:30:00
992
['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 23, 19, 30), None, u'TOTE', 'quantity=', 87]
- qty: 87
- s1: None
- s2: TOTE
- timestamp: 2013-08-23 19:30:00
87
['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 27, 17, 30), u'PPTransMUC3', u'TOTE', 'quantity=', 14]
- qty: 14
- s1: PPTransMUC3
- s2: TOTE
- timestamp: 2013-08-27 17:30:00
14
['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 27, 20, 30), u'PPTransEUK5', u'PRIME_BIN_RANDOM_STOW', 'quantity=', 339]
- qty: 339
- s1: PPTransEUK5
- s2: PRIME_BIN_RANDOM_STOW
- timestamp: 2013-08-27 20:30:00
339
['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 20, 30), u'PPTransCGN1', u'TOTE', 'quantity=', 1731]
- qty: 1731
- s1: PPTransCGN1
- s2: TOTE
- timestamp: 2013-08-26 20:30:00
1731
['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 19, 30), u'PPTransEUK5', u'TOTE', 'quantity=', 28]
- qty: 28
- s1: PPTransEUK5
- s2: TOTE
- timestamp: 2013-08-26 19:30:00
28
['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 28, 19, 30), u'PPTransORY1', u'PRIME_BIN_RANDOM_STOW', 'quantity=', 69]
- qty: 69
- s1: PPTransORY1
- s2: PRIME_BIN_RANDOM_STOW
- timestamp: 2013-08-28 19:30:00
69
['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 19, 30), u'PPTransMAD4', u'PRIME_BIN_RANDOM_STOW', 'quantity=', 47]
- qty: 47
- s1: PPTransMAD4
- s2: PRIME_BIN_RANDOM_STOW
- timestamp: 2013-08-26 19:30:00
47
['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 21, 0), None, None, 'quantity=', 78]
- qty: 78
- s1: None
- s2: None
- timestamp: 2013-08-26 21:00:00
78
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句