How to detect and tabulate data from an excel file?

Adonis Cedeño

I have an Excel file (Available at Google Drive) with data which was saved with a very strange format in order to get printed easily:

Non-relational data, available at Google Drive

And every table repeats daily for over 5 years. I need to analyze this data and tried to get a relational format in order to load it in R/Python-like tools and get only 5 columns:

DATE CLIENT NAME TEST MEASUREMENT TESTER
01-01-2023 JOHN SMITH METABOLYTE A 0.01 PhD. IMA GU
01-01-2023 JOHN SMITH METABOLYTE B 10 PhD. IMA GU
01-01-2023 JOHN SMITH PCR NEGATIVE PhD. IMA GU
01-01-2023 JOHN SMITH MUTATION +++ PhD. IMA GU
01-01-2023 ALBUS DUMBLE PREGNANT NEGATIVE TECH. GUIVER
01-01-2023 ALBUS DUMBLE GLUCOSE 121 TECH. GUIVER
02-01-2023 MAYDAY JUNE METABOLYTE A 0.01 PhD. IMA GU
02-01-2023 JOHN SMITH METABOLYTE A 0.01 TECH. GUIVER
02-01-2023 JOHN SMITH METABOLYTE B 10 TECH. GUIVER
02-01-2023 JOHN SMITH PCR NEGATIVE TECH. GUIVER
02-01-2023 JOHN SMITH MUTATION +++ TECH. GUIVER

So, in order to get a conversion from non-relational data to relational table I have applied text-mining techniques available at this GitHub repo. But, basically, have converted everything into one column with tidyr::pivot_longer(). Is there any optimal function or method to detect and tabulate this kind of data, or should i try to do it with a loop (+843 files)?

Grzegorz Sapijaszko

My attempt is based on fact, that the entries are formatted identically, so we can use kind of 'moving window'.

a <- openxlsx::read.xlsx(xlsxFile = "/home/sapi/Downloads/ENERO_2023_prueba.xlsx",
                    colNames = FALSE
                    )

Now we have to define data frame for data storage. Comments like # [2,2] +0, +1 corresponds to row and column of a (loaded excel).

entry <- data.frame(
  NOMBRE = character(),    # [2,2] +0, +1
  FECHA = character(),     # [2,6] +0, +5
  MUESTRA = character(),   # [3,3] +1, +2
  place = character(),     # [3,5] +1, +4
  COLOR = character(),     # [6,3] +4, +2
  ASPECTO = character(),   # [7,3] +5< +2
  DENSIDAD = double(),     # [8,3] +6, +2
  PH = character(),        # [9,3] +7, +2
# ...
  LEUCOCITOS = character(),#[19,3] +17, +2
  BACTERIAS = character(), # [6,7] +4, +6
  PIOCITOS = character()   # [7,7] +5, +6
# ...
)

Now we have to find all rows with NOMBRE

nombre_rows <- which(a[,"X1"] == "NOMBRE")

and use it in loop like:

for (i in 1:length(nombre_rows)) {
  x <- nombre_rows[i]
  nombre_cols <- which(a[x,] == "NOMBRE") # the same for columns
  for (j in 1:length(nombre_cols)) {
    y <- nombre_cols[j]

    entry <- data.frame(
      NOMBRE = a[x, y+1],
      FECHA = a[x, y+5],
      MUESTRA = a[x+1, y+2],
      place = a[x+1, y+4],
      COLOR = a[x+4, y+2],
      ASPECTO = a[x+5, y+2],
      DENSIDAD = a[x+6, y+2],
      PH = a[x+7, y+2],
      # ...
      LEUCOCITOS = a[x+17, y+2],
      BACTERIAS = a[x+4, y+6],
      PIOCITOS = a[x+5, y+6]
      # ...
    ) |> rbind(entry)
  }
}

And finally the data:

head(entry)
#>             NOMBRE   FECHA MUESTRA            place      COLOR      ASPECTO
#> 1      RUANO EDITH 44957.0   ORINA         CEXTERNA  AMARILLO   LIG.TURBIO 
#> 2    CUNIN ELVIRA  44957.0   ORINA HOSPÌTALIZACION   AMARILLO       TURBIO 
#> 3 LOACHAMIN MARIA  44957.0   ORINA         CEXTERNA  AMARILLO  TRANSPARENTE
#> 4    MANZANO RAUL  44957.0   ORINA         CEXTERNA  AMARILLO   LIG.TURBIO 
#> 5    MERCHAN IVAN  44957.0   ORINA      HIDRATACION ANARANJADO      TURBIO 
#> 6   ACERO ANTHONY  44957.0   ORINA         CEXTERNA   AMARILLO  LIG.TURBIO 
#>   DENSIDAD  PH LEUCOCITOS    BACTERIAS     PIOCITOS
#> 1   1005.0 8.0   NEGATIVO            +    1-2/CAMPO
#> 2   1020.0 5.0        +++           ++ CAMPO LLENO 
#> 3   1005.0 6.0   NEGATIVO OCASIONALES    1-2/CAMPO 
#> 4   1010.0 7.0   NEGATIVO            +   3-7/CAMPO 
#> 5   1015.0 6.0         ++           ++ 50-60/CAMPO 
#> 6   1010.0 5.0   NEGATIVO            +   0-2/CAMPO

Created on 2023-06-18 with reprex v2.0.2

You should extend the entry data frame to grab all variables from your data. And then loop it through all excels you have.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Unable to tabulate data read in from file

How to modify the spaces in a text file to tabulate it with Excel "Text Import Wizard"?

How to parse data from file Json or Excel

Tabulate data extracted from a .pdf into pandas

How to tabulate nested JSON file with jq

How to detect #N/A in a data frame (data taken from xlsx file) using pandas?

How to get the filtered data from ReactDataGrid and export it in excel file?

How to get data from excel file 'as is' using openpyxl?

How to read data from XLS (Excel) file [Java, Android]

How to append specific data from excel file into a python list?

How do I use PHPExcel to read data from an Excel file?

How to import data in Excel from the clipboard as if it was in a text file?

How to copy data from csv file to other excel files

How can I make json data from requests into excel file?

How to read a data from excel file and store into variable?

How to place columns of data set from excel file into numeric arrays

How can i read data from an excel file to flutter

How To Read data from Microsoft Excel File by C++ Program

How to fetch data from another excel file and sort it under name?

How to create excel file out of data on the fly from web

How to export huge data from sql server into excel file

How to import data from excel file and identify subcategory

How can I replace some columns with data from an excel file?

return tabulate data from brush selection with d3.js

Read and tabulate table-like data from website

Read data from excel file Laravel Excel

How to copy data from one excel file to second excel file using conditions?

How to export data from multiple sheets in old Excel file into one sheet in new Excel file

Tabulate a text file