clean data in r from image

zachi

I am trying to scan a text from an Ocr and clean it, I got a character that is divided to few lines, however I would like to have the text in similar to the way it is in the image

the code :

heraclitus<-"greek.png"
library(tidyverse)
library(tesseract)
library(magick)

image_greek<-image_read(heraclitus)

image_greek<-image_greek %>% image_scale("600") %>% 
  image_crop("600x400+220+150") %>% 
  image_convert(type = 'Grayscale') %>% 
  image_contrast(sharpen = 1) %>% 
  image_write(format="jpg")

heraclitus_sentences<-magick::image_read(image_greek)%>% 
  ocr() %>% str_split("\n")

As you can see from the output, I have white spaces and sentences that are divided to two lines. I would like to have it in a vector or a list, that each element will be a sentence

enter image description here

enter image description here

jpsmith

You need to split on \n\n (not \n) then replace the middle \n values:

magick::image_read(image_greek) %>% 
  ocr() %>% 
  str_split("\n\n") %>%
  unlist() %>%
  str_replace_all("\n", " ")

Output:

[1] "© Much learning does not teach understanding."                                                       
[2] "© The road up and the road down is one and the same."                                                
[3] "© Our envy always lasts longer than the happiness of those we envy."                                 
[4] "© No man ever steps in the same river twice, for it's not the same river and he's not the same man. "

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related