我在 Pandas 中使用来自 UCI 机器学习存储库的在线零售数据集,我正在设置一个包含CustomerID
第一级和InvoiceNo
第二级的多索引。这是代码:
import pandas as pd
customers = pd.read_excel('https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx')
customers.set_index(['CustomerID', 'InvoiceNo'], inplace=True)
StockCode Description Quantity
CustomerID InvoiceNo
17850.0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
536365 71053 WHITE METAL LANTERN 6
536365 84406B CREAM CUPID HEARTS COAT HANGER 8
536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
536365 22752 SET 7 BABUSHKA NESTING BOXES 2
536365 21730 GLASS STAR FROSTED T-LIGHT HOLDER 6
536366 22633 HAND WARMER UNION JACK 6
536366 22632 HAND WARMER RED POLKA DOT 6
13047.0 536367 84879 ASSORTED COLOUR BIRD ORNAMENT 32
536367 22745 POPPY'S PLAYHOUSE BEDROOM 6
536367 22748 POPPY'S PLAYHOUSE KITCHEN 6
536367 22749 FELTCRAFT PRINCESS CHARLOTTE DOLL 8
536367 22310 IVORY KNITTED MUG COSY 6
536367 84969 BOX OF 6 ASSORTED COLOUR TEASPOONS 6
536367 22623 BOX OF VINTAGE JIGSAW BLOCKS 3
536367 22622 BOX OF VINTAGE ALPHABET BLOCKS 2
536367 21754 HOME BUILDING BLOCK WORD 3
536367 21755 LOVE BUILDING BLOCK WORD 3
536367 21777 RECIPE BOX WITH METAL HEART 4
536367 48187 DOORMAT NEW ENGLAND 4
536368 22960 JAM MAKING SET WITH JARS 6
536368 22913 RED COAT RACK PARIS FASHION 3
536368 22912 YELLOW COAT RACK PARIS FASHION 3
536368 22914 BLUE COAT RACK PARIS FASHION 3
536369 21756 BATH BUILDING BLOCK WORD 3
12583.0 536370 22728 ALARM CLOCK BAKELIKE PINK 24
536370 22727 ALARM CLOCK BAKELIKE RED 24
536370 22726 ALARM CLOCK BAKELIKE GREEN 12
536370 21724 PANDA AND BUNNIES STICKER SHEET 12
536370 21883 STARS GIFT TAPE 24
536370 10002 INFLATABLE POLITICAL GLOBE 48
536370 21791 VINTAGE HEADS AND TAILS CARD GAME 24
536370 21035 SET/2 RED RETROSPOT TEA TOWELS 18
536370 22326 ROUND SNACK BOXES SET OF4 WOODLAND 24
536370 22629 SPACEBOY LUNCH BOX 24
536370 22659 LUNCH BOX I LOVE LONDON 24
536370 22631 CIRCUS PARADE LUNCH BOX 24
536370 22661 CHARLOTTE BAG DOLLY GIRL DESIGN 20
536370 21731 RED TOADSTOOL LED NIGHT LIGHT 24
536370 22900 SET 2 TEA TOWELS I LOVE LONDON 24
536370 21913 VINTAGE SEASIDE JIGSAW PUZZLES 12
536370 22540 MINI JIGSAW CIRCUS PARADE 24
536370 22544 MINI JIGSAW SPACEBOY 24
536370 22492 MINI PAINT SET VINTAGE 36
536370 POST POSTAGE 3
13748.0 536371 22086 PAPER CHAIN KIT 50'S CHRISTMAS 80
17850.0 536372 22632 HAND WARMER RED POLKA DOT 6
536372 22633 HAND WARMER UNION JACK 6
536373 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
在这种情况下,我如何获得每个客户的第一张发票?即我如何只返回536367
客户的发票号码13047
?
编辑:更清楚地说,我不希望每个客户只第一次出现发票。我想保持的列如StockCode
,Description
和Quantity
。
换句话说,我只想保留每个客户的第一个发票编号中的交易信息。以客户 ID13047
为例,第一张发票是536367
这样我想保留该发票的记录,但我想删除发票536368
,536369
因为它不是它的第一张发票。结果如下所示:
StockCode Description Quantity
CustomerID InvoiceNo
17850.0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
536365 71053 WHITE METAL LANTERN 6
536365 84406B CREAM CUPID HEARTS COAT HANGER 8
536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
536365 22752 SET 7 BABUSHKA NESTING BOXES 2
536365 21730 GLASS STAR FROSTED T-LIGHT HOLDER 6
536366 22633 HAND WARMER UNION JACK 6
536366 22632 HAND WARMER RED POLKA DOT 6
13047.0 536367 84879 ASSORTED COLOUR BIRD ORNAMENT 32
536367 22745 POPPY'S PLAYHOUSE BEDROOM 6
536367 22748 POPPY'S PLAYHOUSE KITCHEN 6
536367 22749 FELTCRAFT PRINCESS CHARLOTTE DOLL 8
536367 22310 IVORY KNITTED MUG COSY 6
536367 84969 BOX OF 6 ASSORTED COLOUR TEASPOONS 6
536367 22623 BOX OF VINTAGE JIGSAW BLOCKS 3
536367 22622 BOX OF VINTAGE ALPHABET BLOCKS 2
536367 21754 HOME BUILDING BLOCK WORD 3
536367 21755 LOVE BUILDING BLOCK WORD 3
536367 21777 RECIPE BOX WITH METAL HEART 4
536367 48187 DOORMAT NEW ENGLAND 4
12583.0 536370 22728 ALARM CLOCK BAKELIKE PINK 24
536370 22727 ALARM CLOCK BAKELIKE RED 24
536370 22726 ALARM CLOCK BAKELIKE GREEN 12
536370 21724 PANDA AND BUNNIES STICKER SHEET 12
536370 21883 STARS GIFT TAPE 24
536370 10002 INFLATABLE POLITICAL GLOBE 48
536370 21791 VINTAGE HEADS AND TAILS CARD GAME 24
536370 21035 SET/2 RED RETROSPOT TEA TOWELS 18
536370 22326 ROUND SNACK BOXES SET OF4 WOODLAND 24
536370 22629 SPACEBOY LUNCH BOX 24
536370 22659 LUNCH BOX I LOVE LONDON 24
536370 22631 CIRCUS PARADE LUNCH BOX 24
536370 22661 CHARLOTTE BAG DOLLY GIRL DESIGN 20
536370 21731 RED TOADSTOOL LED NIGHT LIGHT 24
536370 22900 SET 2 TEA TOWELS I LOVE LONDON 24
536370 21913 VINTAGE SEASIDE JIGSAW PUZZLES 12
536370 22540 MINI JIGSAW CIRCUS PARADE 24
536370 22544 MINI JIGSAW SPACEBOY 24
536370 22492 MINI PAINT SET VINTAGE 36
536370 POST POSTAGE 3
13748.0 536371 22086 PAPER CHAIN KIT 50'S CHRISTMAS 80
17850.0 536372 22632 HAND WARMER RED POLKA DOT 6
536372 22633 HAND WARMER UNION JACK 6
感觉有点短,但似乎有效。拉出发票编号,groupby
客户 ID,选择每组中的第一个发票:
customer_ids = customers.index.get_level_values(0)
invoices = pd.Series(customers.index.get_level_values(1))
first_invoices = invoices.groupby(customer_ids, sort=False).first()
结果:
CustomerID
17850.0 536365
13047.0 536367
12583.0 536370
13748.0 536371
15100.0 536374
13436.0 581233
15520.0 581241
13298.0 581385
14569.0 581416
12713.0 581578
Name: InvoiceNo, Length: 4372, dtype: object
然后你就可以做出list
的tuple
对(客户ID,发票编号),并使用该索引的原始数据:
ids_invoices = list(zip(first_invoices.index, first_invoices))
customers = customers[customers.index.isin(ids_invoices)]
结果:
StockCode ... Country
CustomerID InvoiceNo ...
17850.0 536365 85123A ... United Kingdom
536365 71053 ... United Kingdom
536365 84406B ... United Kingdom
536365 84029G ... United Kingdom
536365 84029E ... United Kingdom
536365 22752 ... United Kingdom
536365 21730 ... United Kingdom
13047.0 536367 84879 ... United Kingdom
536367 22745 ... United Kingdom
536367 22748 ... United Kingdom
536367 22749 ... United Kingdom
536367 22310 ... United Kingdom
536367 84969 ... United Kingdom
536367 22623 ... United Kingdom
536367 22622 ... United Kingdom
536367 21754 ... United Kingdom
536367 21755 ... United Kingdom
536367 21777 ... United Kingdom
536367 48187 ... United Kingdom
12583.0 536370 22728 ... France
536370 22727 ... France
536370 22726 ... France
536370 21724 ... France
536370 21883 ... France
536370 10002 ... France
536370 21791 ... France
536370 21035 ... France
536370 22326 ... France
536370 22629 ... France
536370 22659 ... France
536370 22631 ... France
536370 22661 ... France
536370 21731 ... France
536370 22900 ... France
536370 21913 ... France
536370 22540 ... France
536370 22544 ... France
536370 22492 ... France
536370 POST ... France
13748.0 536371 22086 ... United Kingdom
15100.0 536374 21258 ... United Kingdom
15291.0 536376 22114 ... United Kingdom
以上是转到groupby
客户ID,不管它在数据中的哪里重复。. 我强调这一点是因为我看到在您的预期输出中,一些客户有多个购买块(即17850.0
出现在开头和结尾)。我假设您确实希望将所有客户 ID 组合在一起......保留重复 ID 的块并按它们分组将是一个更复杂的groupby
操作。
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句