如何从pdf的特定部分提取文本

塔杰

我在处理 R 中的文件方面没有经验!所以请保持外邦人。

我有一个看起来像这样的pdf:

在此处输入图像描述

我想仅从此文本中提取红色矩形中的数据并将其保存到数据框中(我有数千个这种 pdf)。

到目前为止,我设法读入了数据并得到了这个->

我的代码:

library(tidyverse)
library(pdftools)
library(here)

PDF_x <- pdf_text(here("pdf_project/example_for_pdf.pdf")) %>% 
  str_split("\n")

这使:

[[1]]
 [1] "                                              BlaBla heaeder"                                                                                               
 [2] "                                           Mr. Bombastic XXXXXXXXXXXXX"                                                                                     
 [3] "                                                                                                                 Text1"                                     
 [4] "                                                                                                                 Text2"                                     
 [5] "                                                                                                                 Text3,"                                    
 [6] "                                                                                                                 Text4"                                     
 [7] "                                                                                                                 Text5"                                     
 [8] "                                                                                                                 Text6"                                     
 [9] "                                                                                                                 Text7"                                     
[10] "                                                                                                                                                 Text8"     
[11] "                                                                                                                                       Blabla, 12.01.2021"  
[12] "                                                                                                                                                     bobo /"
[13] "                                                                                                                                        blabla: 111111111"  
[14] "       Micheal Jackson, justo duo dolores et ea rebu"                                                                                                       
[15] "       accusam:           justo duo dolores et ea rebu"                                                                                                     
[16] "       dolores:           Bla Bla Bla"                                                                                                                      
[17] "                                                                              BLABLA_1"                                                                     
[18] "     X-Date: 17.07.2021"                                                                                                                                    
[19] "      1. Master1                        Tim"                                                                                                                
[20] "      1. Master2                        Jack"                                                                                                               
[21] "      1. Master3                        Monika"                                                                                                             
[22] "      1. Master4                        Jill"                                                                                                               
[23] "     Header1"                                                                                                                                               
[24] "     Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore"                                   
[25] "      magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd"                                     
[26] "      gubergren, no sea takimata"                                                                                                                           
[27] "     Header2"                                                                                                                                               
[28] "      Lorem ipsum dolor sit amet, consetetur sadipscing elitr."                                                                                             
[29] "     Header3"                                                                                                                                               
[30] "     Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore"                                   
[31] "      magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum."                                                     
[32] "     Header4"                                                                                                                                               
[33] "      Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna"                            
[34] "      aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea"                         
[35] "      takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy"                            
[36] "      eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo"                               
[37] "      dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."                                              
[38] "ipsum dolor sit a                            sed diam nonumy eirmod tempor invidunt ut labore et dolore magna         Master of Disaster Tim"               
[39] "ipsum dolor sit a                                                             invidunt ut labore et dolore magna                Chief master"               
[40] "            ipsum dolor sit a            invidunt ut labore et dolore magnainvidunt ut labore et dolore magna 2s"                                           
[41] ""                                                                                                                                                           

[[2]]
[1] "                  blablablablablablab"  "   invidunt ut labore et dolore magna" 
[3] "invidunt ut labore et dolore magna..at" ""  

我非常感谢任何指导帮助!

阿克伦

作为str_split/strsplit返回 a list,提取第一个list元素 ( ),在删除前导/滞后空格 ( ) 以及 'Header4' 的位置后找到以 ( ) 'X-Date:'[[1]]开头的行的位置索引(并减去 1 到获取上一行位置),获取序列()以对向量元素进行子集化^trimws:

v1 <- trimws(PDF_x[[1]])
v1[grep("^X-Date:", v1):(grep("Header4", v1)-1)]

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章