我正在尝试用换行符“ \ n”拆分列中的字符串。这是一个数据框sample_data:
test_data <- data.frame(ID=c('[email protected]', '[email protected]'),
Changes=c('3 max cost changes
productxyz > pb100 > a : Max cost decreased from $0.98 to $0.83
productxyz > pb2 > a : Max cost decreased from $1.07 to $0.91
productxyz > pb2 > b : Max cost decreased from $0.65 to $0.55',
'2 max cost changes
productabc > pb1000 > d : Max cost decreased from $1.07 to $0.91
productabc > pb1000 > x : Max cost decreased from $1.44 to $1.22'), stringsAsFactors=FALSE)
我的目标是将价格提取到各列中并获得如下结果集:
ID Prev_Price New_Price
[email protected] $0.98 $0.83
[email protected] $1.07 $0.91
[email protected] $0.65 $0.55
[email protected] $1.07 $0.91
[email protected] $1.44 $1.22
我已经尝试过使用tidyr软件包,但是我的结果充满了N / A。
vars <- c("Prev_Price","New_Price")
seperate(sample_data, Changes, into = vars, sep = "[A-Za-z]+from", extra= "drop")
任何帮助将非常感激。
谢谢!
尝试
df1$ID <- df1$ID[df1$ID!=''][cumsum(df1$ID!='')]
library(stringi)
setNames(data.frame(df1$ID, do.call(rbind,stri_extract_all(df1$Changes,
regex='\\$\\d*'))), c('ID', 'Prev_Price', 'New_Price'))
# ID Prev_Price New_Price
#1 A $20 $10
#2 A $11 $10
#3 B $13 $12
#4 B $15 $12
或者
library(tidyr)
extract(df1, Changes, into=c('Prev_Price', 'New_Price'),
'[^$]*(\\$\\d*)[^$]*(\\$\\d*)')
# ID Prev_Price New_Price
#1 A $20 $10
#2 A $11 $10
#3 B $13 $12
#4 B $15 $12
或者
library(data.table)#v1.9.5+
setDT(df1)[, c('Prev_Price', 'New_Price') := tstrsplit(Changes,
'[A-Za-z ]+')[-1]][]
# ID Changes Prev_Price New_Price
#1: A down from $20 to $10 $20 $10
#2: A down from $11 to $10 $11 $10
#3: B down from $13 to $12 $13 $12
#4: B down from $15 to $12 $15 $12
注意:可以删除“更改”
或仅使用base R
方法
data.frame(ID=df1$ID, read.table(text=gsub('[^$]*(\\$\\d+)', ' \\1 ',
df1$Changes),col.names=c('Prev_Price', 'New_Price'),
stringsAsFactors=FALSE))
# ID Prev_Price New_Price
#1 A $20 $10
#2 A $11 $10
#3 B $13 $12
#4 B $15 $12
如果元素位于同一单元格中,则一种选择是使用data.table
ie的精简版本。v1.9.5 +。可以从以下位置安装here
在这里,我们使用相同的代码到“更改”(分裂tstrsplit(Changes,..)
),然后melt
通过指定的输出以长格式measure.vars
作为一个list
,并且如果需要order
通过“ID”和删除不需要的列(“变量”)。
melt(
setDT(df2)[, paste0('V',1:4) := tstrsplit(Changes,
'[A-Za-z ]+')[-1]][,-2, with=FALSE],
id.var='ID', measure=list(c('V1', 'V3'), c('V2', 'V4')),
value.name=c('Prev_Price', 'New_Price'))[order(ID)][, variable:=NULL]
# ID Prev_Price New_Price
#1: A $20 $10
#2: A $11 $10
#3: B $13 $12
#4: B $15 $12
或者我们可以gsub
像以前一样使用,然后long
使用reshape
from转换为格式base R
d1 <- data.frame(ID=df2$ID,read.table(text=gsub('[^$]*(\\$\\d+)',
' \\1 ', df2$Changes)))
colnames(d1)[-1] <- paste0(c('Prev_Price.', 'New_Price.'),
rep(1:2,each=2))
reshape(d1, idvar='ID', varying=2:ncol(d1), sep=".", direction='long')
# ID time Prev_Price New_Price
#A.1 A 1 $20 $10
#B.1 B 1 $13 $12
#A.2 A 2 $11 $10
#B.2 B 2 $15 $12
对于新的数据集(“ df3”),我们可以使用stri_extract_all_regex
提取“ Changes”列的$
后跟数字,包括小数点('\\$[0-9.]+'
),用于Map
将第一列与list
我们stri_extract_all_regex
将输出更改为matrix
(因为我们需要将交替元素放在不同的列中),然后单击rbind
(do.call(rbind,
)。
library(stringi)
res <- do.call(rbind,
Map(function(x,y) data.frame(x,matrix(y, ncol=2, byrow=TRUE,
dimnames=list(NULL, c("Prev_Price", "New_Price")))),
df3$ID, stri_extract_all_regex(df3$Changes, '\\$[0-9.]+')))
row.names(res) <- NULL
res
# x Prev_Price New_Price
#1 [email protected] $0.98 $0.83
#2 [email protected] $1.07 $0.91
#3 [email protected] $0.65 $0.55
#4 [email protected] $1.07 $0.91
#5 [email protected] $1.44 $1.22
df1 <- structure(list(ID = c("A", "", "B", ""),
Changes = c("down from $20 to $10",
"down from $11 to $10", "down from $13 to $12", "down from $15 to $12"
)), .Names = c("ID", "Changes"), class = "data.frame",
row.names = c(NA, -4L))
df2 <- data.frame(ID=c('A', 'B'),
Changes=c('down from $20 to $10 down from $11 to $10',
'down from $13 to $12 down from $15 to $12'), stringsAsFactors=FALSE)
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句