从文本文件中提取模式之间的数据

Asmi 发表于 Dev

阿斯米

我正在处理一个文本文件，其数据如下所示

*******************************
Sensor 1028 at site 101
SID = 16384
Tag = AI.1028.BT.VOLT
04/07/16 05:00:00  12.65
04/07/16 06:00:00  12.64
04/07/16 07:00:00  12.68
04/07/16 08:00:00  13.08
04/07/16 09:00:00  13.76
*******************************
Sensor 1171 at well 102
SID = 20062
Tag = AI.1171.WT.LEV
04/07/16 05:00:00  0.95
04/07/16 06:00:00  0.90
04/07/16 07:00:00  0.82
04/07/16 08:00:00  0.71
04/07/16 09:00:00  0.59
04/07/16 10:00:00  0.48

我希望能够提取每个标签的数据并创建如下数据框-

Tag  Timestamp          Value
1028 04/07/16 05:00:00  12.65
1028 04/07/16 06:00:00  12.64
1028 04/07/16 07:00:00  12.68
1028 04/07/16 08:00:00  13.08
1028 04/07/16 09:00:00  13.76
1171 04/07/16 05:00:00  0.95
1171 04/07/16 06:00:00  0.90
1171 04/07/16 07:00:00  0.82
1171 04/07/16 08:00:00  0.71
1171 04/07/16 09:00:00  0.59
1171 04/07/16 10:00:00  0.48

标签是模式中的数字部分，例如“标签= AI.1028.BT.VOLT”中的1028和“标签= AI.1171.WT.LEV”中的1171。

我已经在类似的行上查看了其他问题，但是我对R还是比较陌生，除了使用导入文本文件readLines和使用提取模式之外grep，我无能为力。

任何帮助将不胜感激。谢谢！

夏普

使用该data.table程序包，我将采用以下方法：

sensortext <- readLines('sensors.txt')

library(data.table)
DT <- data.table(txt = sensortext[!grepl(pattern = '\\*+', sensortext)])

DT <- DT[, grp := cumsum(grepl('Sensor', txt))
         ][, `:=` (tag = as.numeric(gsub('^.*(\\d+{4}).*','\\1', grep('Tag =', txt, value = TRUE))),
                   sid = as.numeric(gsub('^.*(\\d+{5}).*','\\1', grep('SID = ', txt, value = TRUE))),
                   type = strsplit(grep('Sensor ', txt, value = TRUE),' ')[[1]][4],
                   type.nr = as.numeric(gsub('^.*(\\d+{3}).*','\\1', grep('Sensor ', txt, value = TRUE)))), 
           by = grp
           ][, .SD[4:.N], by = grp
             ][, c('datetime','value') := tstrsplit(txt, '\\s+{2}', type.convert = TRUE)
               ][, c('grp','txt') := NULL
                 ][, datetime := as.POSIXct(strptime(datetime, "%d/%m/%y %H:%M:%S"))]

这使：

> DT
     tag   sid type type.nr            datetime value
 1: 1028 16384 site     101 2016-07-04 05:00:00 12.65
 2: 1028 16384 site     101 2016-07-04 06:00:00 12.64
 3: 1028 16384 site     101 2016-07-04 07:00:00 12.68
 4: 1028 16384 site     101 2016-07-04 08:00:00 13.08
 5: 1028 16384 site     101 2016-07-04 09:00:00 13.76
 6: 1171 20062 well     102 2016-07-04 05:00:00  0.95
 7: 1171 20062 well     102 2016-07-04 06:00:00  0.90
 8: 1171 20062 well     102 2016-07-04 07:00:00  0.82
 9: 1171 20062 well     102 2016-07-04 08:00:00  0.71
10: 1171 20062 well     102 2016-07-04 09:00:00  0.59
11: 1171 20062 well     102 2016-07-04 10:00:00  0.48

说明：

使用该readLines功能，您可以读取文本文件。之后，将其转换为1列datatable data.table(txt = sensortext[!grepl(pattern = '\\*+', sensortext)])。
通过[, grp := cumsum(grepl('Sensor', txt))]创建分组变量来分隔不同的数据部分。grepl('Sensor', txt)创建一个逻辑值，以检测Sensor以其开头的行（并指示新数据部分的开始）。cumsum在上使用创建分组变量。
随着tag = as.numeric(gsub('^.*(\\d+{4}).*','\\1', grep('Tag =', txt, value = TRUE)))您提取标记号（以及sid，type＆type.nr）。
随着[, .SD[4:.N], by = grp]您删除每组前三行（因为它们不包含数据和所需的信息是在前面的步骤已经提取）。
通过[, c('datetime','value') := tstrsplit(txt, '\\s+{2}', type.convert = TRUE)]将txt列中仍为文本格式的数据转换为三个数据列。这样type.convert = TRUE可以确保该value列使用正确的格式（在这种情况下为数字）。
使用grp和删除和txt列[, c('grp','txt') := NULL]（因为不再需要它们）。
最后将datetime列转换为POSIXct格式as.POSIXct(strptime(datetime, "%d/%m/%y %H:%M:%S"))。

要查看每个步骤的作用，您还可以使用以下代码：

DT[, grp := cumsum(grepl('Sensor', txt))]
DT[, `:=` (tag = as.numeric(gsub('^.*(\\d+{4}).*','\\1', grep('Tag =', txt, value = TRUE))),
           sid = as.numeric(gsub('^.*(\\d+{5}).*','\\1', grep('SID = ', txt, value = TRUE))),
           type = strsplit(grep('Sensor ', txt, value = TRUE),' ')[[1]][4],
           type.nr = as.numeric(gsub('^.*(\\d+{3}).*','\\1', grep('Sensor ', txt, value = TRUE)))),
   by = grp][]
DT <- DT[, .SD[4:.N], by = grp][]
DT[, c('datetime','value') := tstrsplit(txt, '\\s+{2}', type.convert = TRUE)][]
DT[, c('grp','txt') := NULL][]
DT[, datetime := as.POSIXct(strptime(datetime, "%d/%m/%y %H:%M:%S"))][]

添加[]到每一行，确保将结果打印到控制台。

以R为底的替代方案：

sensortext <- readLines('sensors.txt')

rawlist <- split(sensortext, cumsum(grepl(pattern = '\\*+', sensortext)))
l <- lapply(rawlist, function(x) read.fwf(textConnection(x[-c(1:4)]), widths = c(17,7), header = FALSE))
reps <- sapply(l, nrow)

df <- do.call(rbind, l)
df$V1 <- strptime(df$V1, '%d/%m/%y %H:%M:%S')
names(df) <- c('datetime','value')

df$tag <- rep(as.numeric(gsub('^.*(\\d+{4}).*','\\1', grep('Tag =', sensortext, value = TRUE))), reps)
df$sid  <- rep(as.numeric(gsub('^.*(\\d+{5}).*','\\1', grep('SID = ', sensortext, value = TRUE))), reps)
df$type  <- rep(sapply(strsplit(grep('Sensor ', sensortext, value = TRUE),' '), '[', 4), reps)
df$type.nr <- rep(as.numeric(gsub('^.*(\\d+{3}).*','\\1', grep('Sensor ', sensortext, value = TRUE))), reps)

得到相同的结果：

> df
               datetime value  tag   sid type type.nr
1.1 2016-07-04 05:00:00 12.65 1028 16384 site     101
1.2 2016-07-04 06:00:00 12.64 1028 16384 site     101
1.3 2016-07-04 07:00:00 12.68 1028 16384 site     101
1.4 2016-07-04 08:00:00 13.08 1028 16384 site     101
1.5 2016-07-04 09:00:00 13.76 1028 16384 site     101
2.1 2016-07-04 05:00:00  0.95 1171 20062 well     102
2.2 2016-07-04 06:00:00  0.90 1171 20062 well     102
2.3 2016-07-04 07:00:00  0.82 1171 20062 well     102
2.4 2016-07-04 08:00:00  0.71 1171 20062 well     102
2.5 2016-07-04 09:00:00  0.59 1171 20062 well     102
2.6 2016-07-04 10:00:00  0.48 1171 20062 well     102

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。