You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
a possible bug
If you are reporting (1) a bug or (2) a question about code, please supply:
Please see a reprex below of the issue. In brief, tabulizer extracts 684 out of 746 rows of this document. Mostly ignores the first rows of each page (starting at page 2). Those rows don't have a line on top.
## rJava loads successfully# install.packages("rJava")
library("rJava")
## load package
library("tabulizer")
## code goes here
library(dplyr)
#> #> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#> #> filter, lag#> The following objects are masked from 'package:base':#> #> intersect, setdiff, setequal, union
library(purrr)
table_list= extract_tables("https://github.com/ropensci/tabulizer/files/4860026/DOC.pdf", method="lattice")
temp_df=1:length(table_list) %>% map_df(~table_list[[.x]]%>% as_tibble)
#> Warning: The `x` argument of `as_tibble.matrix()` must have column names if `.name_repair` is omitted as of tibble 2.0.0.#> Using compatibility `.name_repair`.#> This warning is displayed once every 8 hours.#> Call `lifecycle::last_warnings()` to see where this warning was generated.
nrow(temp_df)
#> [1] 684# COMMENT: Tabulizer extracts 684 out of 746 rows. Mostly ignores the first rows of each page (starting at page 2). Those rows don't have a line on top.## session info for your system
sessionInfo()
#> R version 3.6.3 (2020-02-29)#> Platform: x86_64-pc-linux-gnu (64-bit)#> Running under: Ubuntu 18.04.4 LTS#> #> Matrix products: default#> BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3#> LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so#> #> locale:#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C #> [3] LC_TIME=es_CL.UTF-8 LC_COLLATE=en_US.UTF-8 #> [5] LC_MONETARY=es_CL.UTF-8 LC_MESSAGES=en_US.UTF-8 #> [7] LC_PAPER=es_CL.UTF-8 LC_NAME=es_CL.UTF-8 #> [9] LC_ADDRESS=es_CL.UTF-8 LC_TELEPHONE=es_CL.UTF-8 #> [11] LC_MEASUREMENT=es_CL.UTF-8 LC_IDENTIFICATION=es_CL.UTF-8#> #> attached base packages:#> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages:#> [1] purrr_0.3.4 dplyr_1.0.0 tabulizer_0.2.2 rJava_0.9-12 #> #> loaded via a namespace (and not attached):#> [1] knitr_1.29 magrittr_1.5 tidyselect_1.1.0 #> [4] R6_2.4.1 rlang_0.4.6 stringr_1.4.0 #> [7] highr_0.8 tools_3.6.3 xfun_0.15 #> [10] png_0.1-7 htmltools_0.5.0 ellipsis_0.3.1 #> [13] yaml_2.2.1 digest_0.6.25 tibble_3.0.1 #> [16] lifecycle_0.2.0 crayon_1.3.4 vctrs_0.3.1 #> [19] glue_1.4.1 evaluate_0.14 rmarkdown_2.3 #> [22] stringi_1.4.6 compiler_3.6.3 pillar_1.4.4 #> [25] tabulizerjars_1.0.1 generics_0.0.2 pkgconfig_2.0.3
For the sake of completeness (or in case it is useful to someone), my solution has been the following:
library(tabulizer)
library(dplyr)
library(purrr)
# Manually set areas for the tables (page1 is different than the rest)
p1_area = list(c(154.24309, 17.75138, 782.15470, 600.81215 ))
p2_area = list(c(95.17127 , 24.31492, 780, 594.24862))
# Get number of pages of document, to create as many lists with the areas as pages has the document
num_pages = length(extract_tables(filename, method = "lattice"))
areas_all = c(p1_area, rep(p2_area, num_pages - 1))
# Extract tables
table_list = extract_tables("https://github.com/ropensci/tabulizer/files/4860026/DOC.pdf", method = "lattice", area = areas_all)
temp_df = 1:length(table_list) %>% map_df(~ table_list[[.x]]%>% as_tibble)
Please specify whether your issue is about:
If you are reporting (1) a bug or (2) a question about code, please supply:
Please see a reprex below of the issue. In brief, tabulizer extracts 684 out of 746 rows of this document. Mostly ignores the first rows of each page (starting at page 2). Those rows don't have a line on top.
Created on 2020-07-01 by the reprex package (v0.3.0)
DOC.pdf
The text was updated successfully, but these errors were encountered: