R 从由变量标题分隔的文本文件中的垂直列表中解析数据

麦克威廉姆斯

我正在阅读一个标准文本数据文件（步态周期数据），它由以下格式组成，其中!用于指定它后面的文本是一个变量名称，后面的行后面是平均值（m）和标准偏差（ s) 该变量的数据。变量可以是标量、向量或张量。

#Some header lines
.
.
#in these variable names below the N just refers to the N used to compute mean and SD
#not the number of data points in the variable
!ScalarVariable1 N1 
m1 s1
!ScalarVariable2 N2
m2 s2
!VectorVariable3 N3
m3_1 s3_1
m3_2 s3_2
.
.
m3_100 s3_100
!VectorVariable4 N4
m4_1 s4_1
m4_2 s4_2
.
.
m4_100 s4_100

我试图找到一种好方法来读取变量名称并将数据存储在数据帧或单个数组或（理想情况下我认为）结构中。数据帧存储很棘手，因为长度不同：如图所示，有长度为 1 的标量，有已知长度的归一化向量（此处为 100），但还有另一类长度为 N 的向量，它们取决于时间长度审判进行了。最后还有张量，有的有 3 列数据，有的有 6 列，这些也有均值和标准差。

我正在使用readLines()导入文件，它为我提供了单个数组中的每一行文本。在 matlab 中，当我解决这个问题时，我将数据作为文本行完全按照readLines()提供的方式读取，然后循环遍历此数组以通过检测来计算变量名称之间的行数!，存储与每个标题相关联的值的索引列表，然后去返回并将数据读入使用变量名称（例如GCD.velocity）的结构形式。我是 R 的新手，不知道如何解决这个解析问题。只是寻求帮助以正确的方向开始，即使意味着现在只处理 100 的已知数组长度。谢谢。

以下是文件中的一些示例行，涵盖所有类型的数据和标题。从所有标题开始，然后是标量，然后是向量、12 列张量、6 列张量（--- 只是表示要跳转到新变量的分隔符，而不是在文件中）：

#!DST
$REFERENCE 5
G:\Gait Data\2015\GCD\xxx_07.gcd : Left(Angles,Forces) : Right(Angles,Forces) : +X
G:\Gait Data\2015\GCD\xxx_05.gcd : Left(Angles,Forces) : Right(Angles,Forces):-X
G:\Gait Data\2015\GCD\xxx_10.gcd : Left(Angles,Forces) : Right(Angles,Forces):-X
G:\Gait Data\2015\GCD\xxx_12.gcd : Left(Angles,Forces) : Right(Angles,Forces):-X
G:\Gait Data\2015\GCD\xxx_13.gcd : Left(Angles,Forces) : Right(Angles,Forces) : +X
!Mass 5    
29.0000 0.0000
!Height 5
1310.0000 0.0000
!LeftLegLength 5
640.0000 0.0000
!LeftTrunkObliquity 5
5.4914 1.8161
4.9017 1.7414
4.3771 1.6795
3.9143 1.6484
---------------------------------------------------------------------------
!LeftGroundReaction-3-2 5
-9.8387 -3.3189 30.0240 0.000 0.000 0.0418 7.9230 4.3737 17.2863 0.000 0.000 0.2978
-56.6241 -14.5228 123.1434 0.000 0.000 0.0923 6.0863 7.5595 35.1965 0.000 0.000 0.3562
-40.9967 4.3286 255.1618 0.000 0.000 0.5213 11.8429 9.5473 49.7839 0.000 0.000 0.3331
-85.3239 8.8256 428.0071 0.000 0.000 0.7669 9.9698 14.0490 44.5523 0.000 0.000 0.5099
-----------------------------------------------------------------------------
!RightPelvicOrigin-3 5
-446.5973 -2.6151 -22.3667 24.8248 8.4681 0.9047
-426.1199 -3.9391 -21.4263 23.8164 7.1312 0.7944
-407.3914 -4.9089 -19.4336 22.8752 6.0196 1.1956
-389.6329 -5.7206 -16.6267 22.0211 5.1385 1.7573
-373.1119 -6.4350 -13.3333 21.2372 4.5618 2.2868
-----------------------------------------------------------------------------

Abdessabour Mtk

使用一些正则表达式和R内置read.table函数，我们可以实现这一点，然后您可以使用list2env将变量放入全局环境中，即您将拥有标量作为向量，将向量/张量作为 data.frames 命名为同名在文件中：

library(stringr)
input <- readLines("example.file")
input <- input[-(1:grep("^!",input)[1]-1)]
input <- input[!grepl("^--+$", input)]
str_extract(grep("^!", input, value=T), "(?<=!)\\S+") -> cs

input.t <- lapply( str_split( paste0(input, collapse='\n') , "\n(?=!)")[[1]] , function(x){
            
            res <- read.table(text=str_replace(x, "![^\n]+\n", ""))
            if(nrow(res)==1) unlist(res, use.names=F) else res
})

setNames(input.t, cs) -> input.t

list2env(input.t, .GlobalEnv)
#> <environment: R_GlobalEnv>
ScalarVariable1
#> [1] 0.3186621 0.3861956
ScalarVariable2
#> [1] 1.8439012 0.3019289
head(VectorVariable3)
#>           V1        V2
#> 1 -0.1964990 0.9647295
#> 2 -0.4015327 0.5645811
#> 3 -1.1161385 0.6641921
#> 4  0.7292709 1.6256362
#> 5  0.6351160 0.5434198
#> 6  0.8395378 1.2967163

或者只是一个完整的baseR解决方案，不需要stringr：

input <- readLines("example.file")
input <- input[-(1:grep("^!",input)[1]-1)]
input <- input[!grepl("^--+$", input)]
sub("^!(\\S+) .*$", '\\1', grep("^!", input, value=T)) -> cs

input.t <- lapply( strsplit( paste0(input, collapse='\n') , "\n(?=!)", perl=T)[[1]] , function(x){
            
            res <- read.table(text=sub("![^\n]+\n", "",x ))
            if(nrow(res)==1) unlist(res, use.names=F) else res
})

setNames(input.t, cs)
$Mass
[1] 29  0

$Height
[1] 1310    0

$LeftLegLength
[1] 640   0

$LeftTrunkObliquity
      V1     V2
1 5.4914 1.8161
2 4.9017 1.7414
3 4.3771 1.6795
4 3.9143 1.6484

$`LeftGroundReaction-3-2`
        V1       V2       V3 V4 V5     V6      V7      V8      V9 V10 V11    V12
1  -9.8387  -3.3189  30.0240  0  0 0.0418  7.9230  4.3737 17.2863   0   0 0.2978
2 -56.6241 -14.5228 123.1434  0  0 0.0923  6.0863  7.5595 35.1965   0   0 0.3562
3 -40.9967   4.3286 255.1618  0  0 0.5213 11.8429  9.5473 49.7839   0   0 0.3331
4 -85.3239   8.8256 428.0071  0  0 0.7669  9.9698 14.0490 44.5523   0   0 0.5099

$`RightPelvicOrigin-3`
         V1      V2       V3      V4     V5     V6
1 -446.5973 -2.6151 -22.3667 24.8248 8.4681 0.9047
2 -426.1199 -3.9391 -21.4263 23.8164 7.1312 0.7944
3 -407.3914 -4.9089 -19.4336 22.8752 6.0196 1.1956
4 -389.6329 -5.7206 -16.6267 22.0211 5.1385 1.7573
5 -373.1119 -6.4350 -13.3333 21.2372 4.5618 2.2868

笔记：

该解决方案与列无关，因为无论您有 1000 列还是只有 1 列，只要它们以空格分隔即可。分隔符可以在read.table通话中设置

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-08-17

我来说两句

0 条评论

登录后参与评论

上一篇：查找至少参加过一次考试但没有参加过 Max 和 Min 分数的学生

TOP 榜单

文章

R 从由变量标题分隔的文本文件中的垂直列表中解析数据

R 从由变量标题分隔的文本文件中的垂直列表中解析数据

笔记 ：

隐藏发件人没有短信PHP

材质UI垂直滑块。如何改变在垂直材料UI滑块导轨的厚度（反应）

在Windows 7中无法删除文件（2）

HttpClient中的角度变化检测

Azure VM启动/停止日志

如何在 Vb.net 中使用函数返回多个值

Powerpoint-条形长度错误的堆积条形图

最新歌剧断断续续的快速拨号和渲染错误

Mac OS X更新后的GRUB 2问题

需要公式以vlookup逗号分隔单个单元格中的值

Hashchange事件侦听器在将事件处理程序附加到事件之前进行侦听

ggplot：对齐多个分面图-所有大小不同的分面

OS X-为什么我需要打开WiFi才能确定最近的位置

用日期数据透视表和日期顺序查询

Java Eclipse中的错误13，如何解决？

如何在Django中使用UUID

加载Microsoft Visual菜单时出现问题

具有if条件的SQL UPDATE

从JSON到JSONL的Python转换

如何在Kod中更改字体？

共享图像将路径放入地址

笔记：