How to detect and tabulate data from an excel file?

Adonis Cedeño

I have an Excel file (Available at Google Drive) with data which was saved with a very strange format in order to get printed easily:

And every table repeats daily for over 5 years. I need to analyze this data and tried to get a relational format in order to load it in R/Python-like tools and get only 5 columns:

DATE	CLIENT NAME	TEST	MEASUREMENT	TESTER
01-01-2023	JOHN SMITH	METABOLYTE A	0.01	PhD. IMA GU
01-01-2023	JOHN SMITH	METABOLYTE B	10	PhD. IMA GU
01-01-2023	JOHN SMITH	PCR	NEGATIVE	PhD. IMA GU
01-01-2023	JOHN SMITH	MUTATION	+++	PhD. IMA GU
01-01-2023	ALBUS DUMBLE	PREGNANT	NEGATIVE	TECH. GUIVER
01-01-2023	ALBUS DUMBLE	GLUCOSE	121	TECH. GUIVER
02-01-2023	MAYDAY JUNE	METABOLYTE A	0.01	PhD. IMA GU
02-01-2023	JOHN SMITH	METABOLYTE A	0.01	TECH. GUIVER
02-01-2023	JOHN SMITH	METABOLYTE B	10	TECH. GUIVER
02-01-2023	JOHN SMITH	PCR	NEGATIVE	TECH. GUIVER
02-01-2023	JOHN SMITH	MUTATION	+++	TECH. GUIVER

So, in order to get a conversion from non-relational data to relational table I have applied text-mining techniques available at this GitHub repo. But, basically, have converted everything into one column with tidyr::pivot_longer(). Is there any optimal function or method to detect and tabulate this kind of data, or should i try to do it with a loop (+843 files)?

Grzegorz Sapijaszko

My attempt is based on fact, that the entries are formatted identically, so we can use kind of 'moving window'.

a <- openxlsx::read.xlsx(xlsxFile = "/home/sapi/Downloads/ENERO_2023_prueba.xlsx",
                    colNames = FALSE
                    )

Now we have to define data frame for data storage. Comments like # [2,2] +0, +1 corresponds to row and column of a (loaded excel).

entry <- data.frame(
  NOMBRE = character(),    # [2,2] +0, +1
  FECHA = character(),     # [2,6] +0, +5
  MUESTRA = character(),   # [3,3] +1, +2
  place = character(),     # [3,5] +1, +4
  COLOR = character(),     # [6,3] +4, +2
  ASPECTO = character(),   # [7,3] +5< +2
  DENSIDAD = double(),     # [8,3] +6, +2
  PH = character(),        # [9,3] +7, +2
# ...
  LEUCOCITOS = character(),#[19,3] +17, +2
  BACTERIAS = character(), # [6,7] +4, +6
  PIOCITOS = character()   # [7,7] +5, +6
# ...
)

Now we have to find all rows with NOMBRE

nombre_rows <- which(a[,"X1"] == "NOMBRE")

and use it in loop like:

for (i in 1:length(nombre_rows)) {
  x <- nombre_rows[i]
  nombre_cols <- which(a[x,] == "NOMBRE") # the same for columns
  for (j in 1:length(nombre_cols)) {
    y <- nombre_cols[j]

    entry <- data.frame(
      NOMBRE = a[x, y+1],
      FECHA = a[x, y+5],
      MUESTRA = a[x+1, y+2],
      place = a[x+1, y+4],
      COLOR = a[x+4, y+2],
      ASPECTO = a[x+5, y+2],
      DENSIDAD = a[x+6, y+2],
      PH = a[x+7, y+2],
      # ...
      LEUCOCITOS = a[x+17, y+2],
      BACTERIAS = a[x+4, y+6],
      PIOCITOS = a[x+5, y+6]
      # ...
    ) |> rbind(entry)
  }
}

And finally the data:

head(entry)
#>             NOMBRE   FECHA MUESTRA            place      COLOR      ASPECTO
#> 1      RUANO EDITH 44957.0   ORINA         CEXTERNA  AMARILLO   LIG.TURBIO 
#> 2    CUNIN ELVIRA  44957.0   ORINA HOSPÌTALIZACION   AMARILLO       TURBIO 
#> 3 LOACHAMIN MARIA  44957.0   ORINA         CEXTERNA  AMARILLO  TRANSPARENTE
#> 4    MANZANO RAUL  44957.0   ORINA         CEXTERNA  AMARILLO   LIG.TURBIO 
#> 5    MERCHAN IVAN  44957.0   ORINA      HIDRATACION ANARANJADO      TURBIO 
#> 6   ACERO ANTHONY  44957.0   ORINA         CEXTERNA   AMARILLO  LIG.TURBIO 
#>   DENSIDAD  PH LEUCOCITOS    BACTERIAS     PIOCITOS
#> 1   1005.0 8.0   NEGATIVO            +    1-2/CAMPO
#> 2   1020.0 5.0        +++           ++ CAMPO LLENO 
#> 3   1005.0 6.0   NEGATIVO OCASIONALES    1-2/CAMPO 
#> 4   1010.0 7.0   NEGATIVO            +   3-7/CAMPO 
#> 5   1015.0 6.0         ++           ++ 50-60/CAMPO 
#> 6   1010.0 5.0   NEGATIVO            +   0-2/CAMPO

^{Created on 2023-06-18 with reprex v2.0.2}

You should extend the entry data frame to grab all variables from your data. And then loop it through all excels you have.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2023-06-20

Comments

0 comments

TOP Ranking

Article

How to detect and tabulate data from an excel file?

How to detect and tabulate data from an excel file?

Loopback Error: connect ECONNREFUSED 127.0.0.1:3306 (MAMP)

Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

pump.io port in URL

How to import an asset in swift using Bundle.main.path() in a react-native native module

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

Emulator wrong screen resolution in Android Studio 1.3

3D Touch Peek Swipe Like Mail

Double spacing in rmarkdown pdf

Svchost high CPU from Microsoft.BingWeather app errors

How to how increase/decrease compared to adjacent cell

Using Response.Redirect with Friendly URLS in ASP.NET

java.lang.NullPointerException: Cannot read the array length because "<local3>" is null

BigQuery - concatenate ignoring NULL

How to fix "pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'" using YOLOv3?

ngClass error (Can't bind ngClass since it isn't a known property of div) in Angular 11.0.3

Can a 32-bit antivirus program protect you from 64-bit threats

Make a B+ Tree concurrent thread safe

Bootstrap 5 Static Modal Still Closes when I Click Outside

Vector input in shiny R and then use it

Assembly definition can't resolve namespaces from external packages