Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tabulizer does not read top rows when there is no line on top #121

Open
1 task done
gorkang opened this issue Jul 1, 2020 · 1 comment
Open
1 task done

Tabulizer does not read top rows when there is no line on top #121

gorkang opened this issue Jul 1, 2020 · 1 comment

Comments

@gorkang
Copy link

gorkang commented Jul 1, 2020

Please specify whether your issue is about:

  • a possible bug
    If you are reporting (1) a bug or (2) a question about code, please supply:

Please see a reprex below of the issue. In brief, tabulizer extracts 684 out of 746 rows of this document. Mostly ignores the first rows of each page (starting at page 2). Those rows don't have a line on top.

## rJava loads successfully
# install.packages("rJava")
library("rJava")

## load package
library("tabulizer")

## code goes here


library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)

table_list = extract_tables("https://github.com/ropensci/tabulizer/files/4860026/DOC.pdf", method = "lattice")
temp_df = 1:length(table_list) %>% map_df(~ table_list[[.x]]%>% as_tibble)  
#> Warning: The `x` argument of `as_tibble.matrix()` must have column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.

nrow(temp_df)
#> [1] 684

# COMMENT: Tabulizer extracts 684 out of 746 rows. Mostly ignores the first rows of each page (starting at page 2). Those rows don't have a line on top.
  


## session info for your system
sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                 
#>  [3] LC_TIME=es_CL.UTF-8           LC_COLLATE=en_US.UTF-8       
#>  [5] LC_MONETARY=es_CL.UTF-8       LC_MESSAGES=en_US.UTF-8      
#>  [7] LC_PAPER=es_CL.UTF-8          LC_NAME=es_CL.UTF-8          
#>  [9] LC_ADDRESS=es_CL.UTF-8        LC_TELEPHONE=es_CL.UTF-8     
#> [11] LC_MEASUREMENT=es_CL.UTF-8    LC_IDENTIFICATION=es_CL.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] purrr_0.3.4     dplyr_1.0.0     tabulizer_0.2.2 rJava_0.9-12   
#> 
#> loaded via a namespace (and not attached):
#>  [1] knitr_1.29          magrittr_1.5        tidyselect_1.1.0   
#>  [4] R6_2.4.1            rlang_0.4.6         stringr_1.4.0      
#>  [7] highr_0.8           tools_3.6.3         xfun_0.15          
#> [10] png_0.1-7           htmltools_0.5.0     ellipsis_0.3.1     
#> [13] yaml_2.2.1          digest_0.6.25       tibble_3.0.1       
#> [16] lifecycle_0.2.0     crayon_1.3.4        vctrs_0.3.1        
#> [19] glue_1.4.1          evaluate_0.14       rmarkdown_2.3      
#> [22] stringi_1.4.6       compiler_3.6.3      pillar_1.4.4       
#> [25] tabulizerjars_1.0.1 generics_0.0.2      pkgconfig_2.0.3

Created on 2020-07-01 by the reprex package (v0.3.0)

DOC.pdf

@gorkang
Copy link
Author

gorkang commented Jul 1, 2020

For the sake of completeness (or in case it is useful to someone), my solution has been the following:

library(tabulizer)
library(dplyr)
library(purrr)

# Manually set areas for the tables (page1 is different than the rest)
p1_area = list(c(154.24309, 17.75138, 782.15470, 600.81215 )) 
p2_area = list(c(95.17127 , 24.31492, 780, 594.24862))

# Get number of pages of document, to create as many lists with the areas as pages has the document
num_pages = length(extract_tables(filename, method = "lattice"))
areas_all = c(p1_area, rep(p2_area, num_pages - 1))

# Extract tables
table_list = extract_tables("https://github.com/ropensci/tabulizer/files/4860026/DOC.pdf", method = "lattice", area = areas_all)
temp_df = 1:length(table_list) %>% map_df(~ table_list[[.x]]%>% as_tibble)  

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant