how to extract bold text from a pdf using R












-1















I have searched through SO and the closest I got to the answer was here.
But my requirement is to get a simpler & more elegant way to extract bold from a simple paragraph of text of pdf. The pdftools package only extracts the plain text component. Does anyone know if there is any other way to simply detect bold tokens (or words) from a chunk of text in pdf. I use R so kindly restrict to suggestions in R.










share|improve this question

























  • "Please don't advice possibly using a particular tool because I am unwilling to do the work necessary to setup a usable data science environment" is not exactly going to cause folks to come running to this question. Thousands of R folks manage to have a working R + rJava environment. It has some headaches. Ultimately it's worth it b/c you get access to a whole world of great Java libraries. Anyway, not going to bother with a full answer but github.com/hrbrmstr/pdfbox can likely help (but there's that rJava "work" again).

    – hrbrmstr
    Nov 20 '18 at 18:50











  • Can you provide a sample PDF?

    – Ralf Stubner
    Nov 20 '18 at 19:02











  • Thank you for direct the advice @hrbrmstr. I get what you are trying to say. Apologise for sounding lazy. I will check your link and also try to setup an rJava environment. I guess what I am also understanding, reading in between your lines, is that rJava is essential for a data science environment if I use R. Am I correct?

    – Sanjay Mehrotra
    Nov 21 '18 at 6:25













  • @RalfStubner: I use the pdf on this link here. Although it is a bit large it has all its titles in bold hence is an ideal case of text processing using bold titles as section headers. If you can suggest any easy way (not intending to be lazy but if there's an easier way I would be happy to use it).

    – Sanjay Mehrotra
    Nov 21 '18 at 6:34











  • So your actual aim is not “identify bold text” but “identify section titles “?

    – Ralf Stubner
    Nov 21 '18 at 8:39
















-1















I have searched through SO and the closest I got to the answer was here.
But my requirement is to get a simpler & more elegant way to extract bold from a simple paragraph of text of pdf. The pdftools package only extracts the plain text component. Does anyone know if there is any other way to simply detect bold tokens (or words) from a chunk of text in pdf. I use R so kindly restrict to suggestions in R.










share|improve this question

























  • "Please don't advice possibly using a particular tool because I am unwilling to do the work necessary to setup a usable data science environment" is not exactly going to cause folks to come running to this question. Thousands of R folks manage to have a working R + rJava environment. It has some headaches. Ultimately it's worth it b/c you get access to a whole world of great Java libraries. Anyway, not going to bother with a full answer but github.com/hrbrmstr/pdfbox can likely help (but there's that rJava "work" again).

    – hrbrmstr
    Nov 20 '18 at 18:50











  • Can you provide a sample PDF?

    – Ralf Stubner
    Nov 20 '18 at 19:02











  • Thank you for direct the advice @hrbrmstr. I get what you are trying to say. Apologise for sounding lazy. I will check your link and also try to setup an rJava environment. I guess what I am also understanding, reading in between your lines, is that rJava is essential for a data science environment if I use R. Am I correct?

    – Sanjay Mehrotra
    Nov 21 '18 at 6:25













  • @RalfStubner: I use the pdf on this link here. Although it is a bit large it has all its titles in bold hence is an ideal case of text processing using bold titles as section headers. If you can suggest any easy way (not intending to be lazy but if there's an easier way I would be happy to use it).

    – Sanjay Mehrotra
    Nov 21 '18 at 6:34











  • So your actual aim is not “identify bold text” but “identify section titles “?

    – Ralf Stubner
    Nov 21 '18 at 8:39














-1












-1








-1


1






I have searched through SO and the closest I got to the answer was here.
But my requirement is to get a simpler & more elegant way to extract bold from a simple paragraph of text of pdf. The pdftools package only extracts the plain text component. Does anyone know if there is any other way to simply detect bold tokens (or words) from a chunk of text in pdf. I use R so kindly restrict to suggestions in R.










share|improve this question
















I have searched through SO and the closest I got to the answer was here.
But my requirement is to get a simpler & more elegant way to extract bold from a simple paragraph of text of pdf. The pdftools package only extracts the plain text component. Does anyone know if there is any other way to simply detect bold tokens (or words) from a chunk of text in pdf. I use R so kindly restrict to suggestions in R.







r pdf






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 23 '18 at 7:50







Sanjay Mehrotra

















asked Nov 20 '18 at 17:41









Sanjay MehrotraSanjay Mehrotra

312313




312313













  • "Please don't advice possibly using a particular tool because I am unwilling to do the work necessary to setup a usable data science environment" is not exactly going to cause folks to come running to this question. Thousands of R folks manage to have a working R + rJava environment. It has some headaches. Ultimately it's worth it b/c you get access to a whole world of great Java libraries. Anyway, not going to bother with a full answer but github.com/hrbrmstr/pdfbox can likely help (but there's that rJava "work" again).

    – hrbrmstr
    Nov 20 '18 at 18:50











  • Can you provide a sample PDF?

    – Ralf Stubner
    Nov 20 '18 at 19:02











  • Thank you for direct the advice @hrbrmstr. I get what you are trying to say. Apologise for sounding lazy. I will check your link and also try to setup an rJava environment. I guess what I am also understanding, reading in between your lines, is that rJava is essential for a data science environment if I use R. Am I correct?

    – Sanjay Mehrotra
    Nov 21 '18 at 6:25













  • @RalfStubner: I use the pdf on this link here. Although it is a bit large it has all its titles in bold hence is an ideal case of text processing using bold titles as section headers. If you can suggest any easy way (not intending to be lazy but if there's an easier way I would be happy to use it).

    – Sanjay Mehrotra
    Nov 21 '18 at 6:34











  • So your actual aim is not “identify bold text” but “identify section titles “?

    – Ralf Stubner
    Nov 21 '18 at 8:39



















  • "Please don't advice possibly using a particular tool because I am unwilling to do the work necessary to setup a usable data science environment" is not exactly going to cause folks to come running to this question. Thousands of R folks manage to have a working R + rJava environment. It has some headaches. Ultimately it's worth it b/c you get access to a whole world of great Java libraries. Anyway, not going to bother with a full answer but github.com/hrbrmstr/pdfbox can likely help (but there's that rJava "work" again).

    – hrbrmstr
    Nov 20 '18 at 18:50











  • Can you provide a sample PDF?

    – Ralf Stubner
    Nov 20 '18 at 19:02











  • Thank you for direct the advice @hrbrmstr. I get what you are trying to say. Apologise for sounding lazy. I will check your link and also try to setup an rJava environment. I guess what I am also understanding, reading in between your lines, is that rJava is essential for a data science environment if I use R. Am I correct?

    – Sanjay Mehrotra
    Nov 21 '18 at 6:25













  • @RalfStubner: I use the pdf on this link here. Although it is a bit large it has all its titles in bold hence is an ideal case of text processing using bold titles as section headers. If you can suggest any easy way (not intending to be lazy but if there's an easier way I would be happy to use it).

    – Sanjay Mehrotra
    Nov 21 '18 at 6:34











  • So your actual aim is not “identify bold text” but “identify section titles “?

    – Ralf Stubner
    Nov 21 '18 at 8:39

















"Please don't advice possibly using a particular tool because I am unwilling to do the work necessary to setup a usable data science environment" is not exactly going to cause folks to come running to this question. Thousands of R folks manage to have a working R + rJava environment. It has some headaches. Ultimately it's worth it b/c you get access to a whole world of great Java libraries. Anyway, not going to bother with a full answer but github.com/hrbrmstr/pdfbox can likely help (but there's that rJava "work" again).

– hrbrmstr
Nov 20 '18 at 18:50





"Please don't advice possibly using a particular tool because I am unwilling to do the work necessary to setup a usable data science environment" is not exactly going to cause folks to come running to this question. Thousands of R folks manage to have a working R + rJava environment. It has some headaches. Ultimately it's worth it b/c you get access to a whole world of great Java libraries. Anyway, not going to bother with a full answer but github.com/hrbrmstr/pdfbox can likely help (but there's that rJava "work" again).

– hrbrmstr
Nov 20 '18 at 18:50













Can you provide a sample PDF?

– Ralf Stubner
Nov 20 '18 at 19:02





Can you provide a sample PDF?

– Ralf Stubner
Nov 20 '18 at 19:02













Thank you for direct the advice @hrbrmstr. I get what you are trying to say. Apologise for sounding lazy. I will check your link and also try to setup an rJava environment. I guess what I am also understanding, reading in between your lines, is that rJava is essential for a data science environment if I use R. Am I correct?

– Sanjay Mehrotra
Nov 21 '18 at 6:25







Thank you for direct the advice @hrbrmstr. I get what you are trying to say. Apologise for sounding lazy. I will check your link and also try to setup an rJava environment. I guess what I am also understanding, reading in between your lines, is that rJava is essential for a data science environment if I use R. Am I correct?

– Sanjay Mehrotra
Nov 21 '18 at 6:25















@RalfStubner: I use the pdf on this link here. Although it is a bit large it has all its titles in bold hence is an ideal case of text processing using bold titles as section headers. If you can suggest any easy way (not intending to be lazy but if there's an easier way I would be happy to use it).

– Sanjay Mehrotra
Nov 21 '18 at 6:34





@RalfStubner: I use the pdf on this link here. Although it is a bit large it has all its titles in bold hence is an ideal case of text processing using bold titles as section headers. If you can suggest any easy way (not intending to be lazy but if there's an easier way I would be happy to use it).

– Sanjay Mehrotra
Nov 21 '18 at 6:34













So your actual aim is not “identify bold text” but “identify section titles “?

– Ralf Stubner
Nov 21 '18 at 8:39





So your actual aim is not “identify bold text” but “identify section titles “?

– Ralf Stubner
Nov 21 '18 at 8:39












3 Answers
3






active

oldest

votes


















2














You don't have to use tabularizer, but I don't know a way that does not involve Java. I had hoped that Apache Tika via the rtika package can be used. However, bold text is not rendered as such. However, one can use pdfbox as shown in that ticket:



 java -jar <pdfbox-jar> ExtractText -html <pdf-file> <html-file>


This command would normally started in a shell, but you can also use system(2) from within R. Then in R use



html <- xml2::read_html(<html-file>)
bold <- xml2::xml_find_all(html, '//b')
head(xml2::xml_contents(bold))


to process the HTML file.
With your document this returns



{xml_nodeset (6)}
[1] Preamblen
[2] WHEREAS it is expedient to define and amend certain parts of the law relating to contracts;n
[3] Historyn
[4] Ancient and Medieval Periodn
[5] The Introduction of English Law Into Indian
[6] Mofussal Courtsn





share|improve this answer
























  • I think Ralf your answer uses java hence I am not trying it out of now. Also, the remaining parts seem to be the same as @hrbmstr. So thanks a lot. I am now replicating my working function as a new answer so that others who search on this (including myself) donot get stuck by selecting the incorrect html file.

    – Sanjay Mehrotra
    Nov 23 '18 at 11:19



















2














Along with having a flexible toolkit, data science regularly requires out-of-the-box thinking (at least in my profession).



But, first, a thing about PDF files.



I don't think they are what you think they are. "Bold" (or "italic", etc.) isn't "metadata". You should spend some time reading up on PDF files because they are complex, nasty, evil things that you are likely to encounter often when working with data. Read this — https://stackoverflow.com/a/19777953/1457051 — to see what finding bold text actually entails (follow the link to the 1.8.x Java pdfbox solution).



Back to our irregularly scheduled answering



While I'm one of the YUGEst proponents of R, not everything needs to be done or should be done in R. Sure, we'll use R to eventually get your bold text but we'll use a helper command-line utility to do so.



The pdftools package is based on the poppler library. It comes with the source so "I'm just an R user" folks likely don't have the full poppler toolset on their system.



Mac folks can use Homebrew to (once you get Homebrew setup):




  • brew install poppler


Linux folks know how to do things. Windows folks are lost forever (there are poppler binaries for you, but your time would be better spent switching to a real operating system).



Once you do that, you can use the below to achieve your goal.



First, we'll make a helper function with lots of safety bumpers:



#' Uses the command-line pdftohtml function from the poppler library
#' to convert a PDF to HTML and then read it in with xml2::read_html()
#'
#' @md
#' @param path the path to the file [path.expand()] will be run on this value
#' @param extra_args extra command-line arguments to be passed to `pdftohtml`.
#' They should be supplied as you would supply arguments to the `args`
#' parameter of [system2()].
read_pdf_as_html <- function(path, extra_args=character()) {

# make sure poppler/pdftohtml is installed
pdftohtml <- Sys.which("pdftohtml")
if (pdftohtml == "") {
stop("The pdftohtml command-line utility must be installed.", call.=FALSE)
}

# make sure the file exists
path <- path.expand(path)
stopifnot(file.exists(path))

# pdf's should really have a PDF extension
stopifnot(tolower(tools::file_ext(path)) == "pdf")

# get by with a little help from our friends
suppressPackageStartupMessages({
library(xml2, warn.conflicts = FALSE, quietly = TRUE)
library(rvest, warn.conflicts = FALSE, quietly = TRUE)
})

# we're going to do the conversion in a temp directory space
td <- tempfile(fileext = "_dir")
dir.create(td)
on.exit(unlink(td, recursive=TRUE), add=TRUE)

# save our current working directory
curwd <- getwd()
on.exit(setwd(curwd), add=TRUE)

# move to the temp space
setwd(td)
file.copy(path, td)

# collect the extra arguments
c(
"-i" # ignore images
) -> args

args <- c(args, extra_args, basename(path), "r-doc") # saves it to r-doc-html.html

# this could take seconds so inform users what's going on
message("Converting ", basename(path), "...")

# we'll let stderr display so you can debug errors
system2(
command = pdftohtml,
args = args,
stdout = TRUE
) -> res

res <- gsub("^Page-", "", res[length(res)])
message("Converted ", res, " pages")

# this will need to be changed if poppler ever does anything different
xml2::read_html("r-docs.html")

}


Now, we'll use it:



doc <- read_pdf_as_html("~/Data/Mulla__Indian_Contract_Act2018-11-12_01-00.PDF")

bold_tags <- html_nodes(doc, xpath=".//b")

bold_words <- html_text(bold_tags)

head(bold_words, 20)
## [1] "Preamble"
## [2] "WHEREAS it is expedient to define and amend certain parts of the law relating to contracts;"
## [3] "History"
## [4] "Ancient and Medieval Period"
## [5] "The Introduction of English Law Into India"
## [6] "Mofussal Courts"
## [7] "Legislation"
## [8] "The Indian Contract Act 1872"
## [9] "The Making of the Act"
## [10] "Law of Contract Until 1950"
## [11] "The Law of Contract after 1950"
## [12] "Amendments to This Act"
## [13] "Other Laws Affecting Contracts and Enforcement"
## [14] "Recommendations of the Indian Law Commission"
## [15] "Section 1."
## [16] "Short title"
## [17] "Extent, Commencement."
## [18] "Enactments Repealed."
## [19] "Applicability of the Act"
## [20] "Scheme of the Act"

length(bold_words)
## [1] 1939


No Java required at all and you've got your bold words.



If you do want to go the pdfbox-app route as Ralf noted, you can use this wrapper to make it easier to work with:



read_pdf_as_html_with_pdfbox <- function(path) {

java <- Sys.which("java")
if (java == "") {
stop("Java binary is not on the system PATH.", call.=FALSE)
}

# get by with a little help from our friends
suppressPackageStartupMessages({
library(httr, warn.conflicts = FALSE, quietly = TRUE)
library(xml2, warn.conflicts = FALSE, quietly = TRUE)
library(rvest, warn.conflicts = FALSE, quietly = TRUE)
})

path <- path.expand(path)
stopifnot(file.exists(path))

# pdf's should really have a PDF extension
stopifnot(tolower(tools::file_ext(path)) == "pdf")

# download the pdfbox "app" if not installed
if (!dir.exists("~/.pdfboxjars")) {
message("~/.pdfboxjars not found. Creating it and downloading pdfbox-app jar...")
dir.create("~/.pdfboxjars")
httr::GET(
url = "http://central.maven.org/maven2/org/apache/pdfbox/pdfbox-app/2.0.12/pdfbox-app-2.0.12.jar",
httr::write_disk(file.path("~/.pdfboxjars", "pdfbox-app-2.0.12.jar")),
httr::progress()
) -> res
httr::stop_for_status(res)
}

# we're going to do the conversion in a temp directory space
tf <- tempfile(fileext = ".html")
on.exit(unlink(tf), add=TRUE)

c(
"-jar",
path.expand(file.path("~/.pdfboxjars", "pdfbox-app-2.0.12.jar")),
"ExtractText",
"-html",
path,
tf
) -> args

# this could take seconds so inform users what's going on
message("Converting ", basename(path), "...")

system2(
command = java,
args = args
) -> res

xml2::read_html(tf)

}





share|improve this answer


























  • @hrbmstr : thank you so much. I have started the process of implementation. Seems there is something amiss in the pdftohtml command. After homebrewing poppler (went through successfully on Mac OS Mujave) and creating all the temp directory, copying the pdf file there etc the pdftohtml seems to return a status = 1. Warning message: In system2(command = "pdftohtml", args = args, stdout = TRUE) : running command ''pdftohtml' -i' had status 1; Please note I have not supplied extra_args to pdftohtm. The args = "-i" was only used.

    – Sanjay Mehrotra
    Nov 21 '18 at 17:20













  • Try running it without the function. In a terminal, try just pdftohtml -i thenameofyourpdffile.pdf and then a read_html on the file that has an s towards the end.

    – hrbrmstr
    Nov 21 '18 at 22:00











  • I managed to convert the pdf to html using the command pdftohtml -i file.pdf followed by bold_tags <- html_nodes(doc, xpath=".//b"); bold_words <- html_text(bold_tags) but the variable bold_words is a 0 length character vector.

    – Sanjay Mehrotra
    Nov 22 '18 at 18:18











  • Can you add what you did and the exact (complete with library calls) R code after that to the original question?

    – hrbrmstr
    Nov 22 '18 at 18:20













  • Bingo it's working! I was using the incorrect html file out of the three files formed by default. Also, I suggest we simplify the answer for others. I have made this simple function that delivers the output without adding too many checks.

    – Sanjay Mehrotra
    Nov 23 '18 at 11:14



















0














This answer is based on answers received from @hrbmstr and @ralf. So thanks to them. I've made the answers simpler (mainly taking out the peculiarity of the HTML conversion & file naming). Also it is tailored for MAC OS users (perhaps LINUX too) - not sure about Windows guys.



I presume you have pdftohtml installed on your machine. If not use brew install pdftohtml. If you donot have homebrew on your MAC then install it first. A link is provided to help you for homebrew.



Once you are sure pdftohtml is installed on the mac, go with this R function to extract bold from any pdf document.



library(magrittr)
library(rvest)
library(stringr)

# pass a pdf file in current directory to this function
extr_bold <- function(file) {
basefile <- str_remove(file,"\.pdf|\.PDF")
htmlfile <- paste0(basefile,"s",".html")
if(!exists(htmlfile) )
system2("pdftohtml",args = c("-i",file),stdout=NULL)
nodevar <- read_html(htmlfile)
x <- html_nodes(nodevar,xpath = ".//b")
html_text(x)
}





share|improve this answer


























  • @hrbmstr: could you please check my answer and see if it makes sense? Or does it need some improvement without adding complexity and of course keeping it easy to read. thanks

    – Sanjay Mehrotra
    Nov 27 '18 at 6:09











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53398611%2fhow-to-extract-bold-text-from-a-pdf-using-r%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























3 Answers
3






active

oldest

votes








3 Answers
3






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














You don't have to use tabularizer, but I don't know a way that does not involve Java. I had hoped that Apache Tika via the rtika package can be used. However, bold text is not rendered as such. However, one can use pdfbox as shown in that ticket:



 java -jar <pdfbox-jar> ExtractText -html <pdf-file> <html-file>


This command would normally started in a shell, but you can also use system(2) from within R. Then in R use



html <- xml2::read_html(<html-file>)
bold <- xml2::xml_find_all(html, '//b')
head(xml2::xml_contents(bold))


to process the HTML file.
With your document this returns



{xml_nodeset (6)}
[1] Preamblen
[2] WHEREAS it is expedient to define and amend certain parts of the law relating to contracts;n
[3] Historyn
[4] Ancient and Medieval Periodn
[5] The Introduction of English Law Into Indian
[6] Mofussal Courtsn





share|improve this answer
























  • I think Ralf your answer uses java hence I am not trying it out of now. Also, the remaining parts seem to be the same as @hrbmstr. So thanks a lot. I am now replicating my working function as a new answer so that others who search on this (including myself) donot get stuck by selecting the incorrect html file.

    – Sanjay Mehrotra
    Nov 23 '18 at 11:19
















2














You don't have to use tabularizer, but I don't know a way that does not involve Java. I had hoped that Apache Tika via the rtika package can be used. However, bold text is not rendered as such. However, one can use pdfbox as shown in that ticket:



 java -jar <pdfbox-jar> ExtractText -html <pdf-file> <html-file>


This command would normally started in a shell, but you can also use system(2) from within R. Then in R use



html <- xml2::read_html(<html-file>)
bold <- xml2::xml_find_all(html, '//b')
head(xml2::xml_contents(bold))


to process the HTML file.
With your document this returns



{xml_nodeset (6)}
[1] Preamblen
[2] WHEREAS it is expedient to define and amend certain parts of the law relating to contracts;n
[3] Historyn
[4] Ancient and Medieval Periodn
[5] The Introduction of English Law Into Indian
[6] Mofussal Courtsn





share|improve this answer
























  • I think Ralf your answer uses java hence I am not trying it out of now. Also, the remaining parts seem to be the same as @hrbmstr. So thanks a lot. I am now replicating my working function as a new answer so that others who search on this (including myself) donot get stuck by selecting the incorrect html file.

    – Sanjay Mehrotra
    Nov 23 '18 at 11:19














2












2








2







You don't have to use tabularizer, but I don't know a way that does not involve Java. I had hoped that Apache Tika via the rtika package can be used. However, bold text is not rendered as such. However, one can use pdfbox as shown in that ticket:



 java -jar <pdfbox-jar> ExtractText -html <pdf-file> <html-file>


This command would normally started in a shell, but you can also use system(2) from within R. Then in R use



html <- xml2::read_html(<html-file>)
bold <- xml2::xml_find_all(html, '//b')
head(xml2::xml_contents(bold))


to process the HTML file.
With your document this returns



{xml_nodeset (6)}
[1] Preamblen
[2] WHEREAS it is expedient to define and amend certain parts of the law relating to contracts;n
[3] Historyn
[4] Ancient and Medieval Periodn
[5] The Introduction of English Law Into Indian
[6] Mofussal Courtsn





share|improve this answer













You don't have to use tabularizer, but I don't know a way that does not involve Java. I had hoped that Apache Tika via the rtika package can be used. However, bold text is not rendered as such. However, one can use pdfbox as shown in that ticket:



 java -jar <pdfbox-jar> ExtractText -html <pdf-file> <html-file>


This command would normally started in a shell, but you can also use system(2) from within R. Then in R use



html <- xml2::read_html(<html-file>)
bold <- xml2::xml_find_all(html, '//b')
head(xml2::xml_contents(bold))


to process the HTML file.
With your document this returns



{xml_nodeset (6)}
[1] Preamblen
[2] WHEREAS it is expedient to define and amend certain parts of the law relating to contracts;n
[3] Historyn
[4] Ancient and Medieval Periodn
[5] The Introduction of English Law Into Indian
[6] Mofussal Courtsn






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 21 '18 at 11:59









Ralf StubnerRalf Stubner

14.1k21537




14.1k21537













  • I think Ralf your answer uses java hence I am not trying it out of now. Also, the remaining parts seem to be the same as @hrbmstr. So thanks a lot. I am now replicating my working function as a new answer so that others who search on this (including myself) donot get stuck by selecting the incorrect html file.

    – Sanjay Mehrotra
    Nov 23 '18 at 11:19



















  • I think Ralf your answer uses java hence I am not trying it out of now. Also, the remaining parts seem to be the same as @hrbmstr. So thanks a lot. I am now replicating my working function as a new answer so that others who search on this (including myself) donot get stuck by selecting the incorrect html file.

    – Sanjay Mehrotra
    Nov 23 '18 at 11:19

















I think Ralf your answer uses java hence I am not trying it out of now. Also, the remaining parts seem to be the same as @hrbmstr. So thanks a lot. I am now replicating my working function as a new answer so that others who search on this (including myself) donot get stuck by selecting the incorrect html file.

– Sanjay Mehrotra
Nov 23 '18 at 11:19





I think Ralf your answer uses java hence I am not trying it out of now. Also, the remaining parts seem to be the same as @hrbmstr. So thanks a lot. I am now replicating my working function as a new answer so that others who search on this (including myself) donot get stuck by selecting the incorrect html file.

– Sanjay Mehrotra
Nov 23 '18 at 11:19













2














Along with having a flexible toolkit, data science regularly requires out-of-the-box thinking (at least in my profession).



But, first, a thing about PDF files.



I don't think they are what you think they are. "Bold" (or "italic", etc.) isn't "metadata". You should spend some time reading up on PDF files because they are complex, nasty, evil things that you are likely to encounter often when working with data. Read this — https://stackoverflow.com/a/19777953/1457051 — to see what finding bold text actually entails (follow the link to the 1.8.x Java pdfbox solution).



Back to our irregularly scheduled answering



While I'm one of the YUGEst proponents of R, not everything needs to be done or should be done in R. Sure, we'll use R to eventually get your bold text but we'll use a helper command-line utility to do so.



The pdftools package is based on the poppler library. It comes with the source so "I'm just an R user" folks likely don't have the full poppler toolset on their system.



Mac folks can use Homebrew to (once you get Homebrew setup):




  • brew install poppler


Linux folks know how to do things. Windows folks are lost forever (there are poppler binaries for you, but your time would be better spent switching to a real operating system).



Once you do that, you can use the below to achieve your goal.



First, we'll make a helper function with lots of safety bumpers:



#' Uses the command-line pdftohtml function from the poppler library
#' to convert a PDF to HTML and then read it in with xml2::read_html()
#'
#' @md
#' @param path the path to the file [path.expand()] will be run on this value
#' @param extra_args extra command-line arguments to be passed to `pdftohtml`.
#' They should be supplied as you would supply arguments to the `args`
#' parameter of [system2()].
read_pdf_as_html <- function(path, extra_args=character()) {

# make sure poppler/pdftohtml is installed
pdftohtml <- Sys.which("pdftohtml")
if (pdftohtml == "") {
stop("The pdftohtml command-line utility must be installed.", call.=FALSE)
}

# make sure the file exists
path <- path.expand(path)
stopifnot(file.exists(path))

# pdf's should really have a PDF extension
stopifnot(tolower(tools::file_ext(path)) == "pdf")

# get by with a little help from our friends
suppressPackageStartupMessages({
library(xml2, warn.conflicts = FALSE, quietly = TRUE)
library(rvest, warn.conflicts = FALSE, quietly = TRUE)
})

# we're going to do the conversion in a temp directory space
td <- tempfile(fileext = "_dir")
dir.create(td)
on.exit(unlink(td, recursive=TRUE), add=TRUE)

# save our current working directory
curwd <- getwd()
on.exit(setwd(curwd), add=TRUE)

# move to the temp space
setwd(td)
file.copy(path, td)

# collect the extra arguments
c(
"-i" # ignore images
) -> args

args <- c(args, extra_args, basename(path), "r-doc") # saves it to r-doc-html.html

# this could take seconds so inform users what's going on
message("Converting ", basename(path), "...")

# we'll let stderr display so you can debug errors
system2(
command = pdftohtml,
args = args,
stdout = TRUE
) -> res

res <- gsub("^Page-", "", res[length(res)])
message("Converted ", res, " pages")

# this will need to be changed if poppler ever does anything different
xml2::read_html("r-docs.html")

}


Now, we'll use it:



doc <- read_pdf_as_html("~/Data/Mulla__Indian_Contract_Act2018-11-12_01-00.PDF")

bold_tags <- html_nodes(doc, xpath=".//b")

bold_words <- html_text(bold_tags)

head(bold_words, 20)
## [1] "Preamble"
## [2] "WHEREAS it is expedient to define and amend certain parts of the law relating to contracts;"
## [3] "History"
## [4] "Ancient and Medieval Period"
## [5] "The Introduction of English Law Into India"
## [6] "Mofussal Courts"
## [7] "Legislation"
## [8] "The Indian Contract Act 1872"
## [9] "The Making of the Act"
## [10] "Law of Contract Until 1950"
## [11] "The Law of Contract after 1950"
## [12] "Amendments to This Act"
## [13] "Other Laws Affecting Contracts and Enforcement"
## [14] "Recommendations of the Indian Law Commission"
## [15] "Section 1."
## [16] "Short title"
## [17] "Extent, Commencement."
## [18] "Enactments Repealed."
## [19] "Applicability of the Act"
## [20] "Scheme of the Act"

length(bold_words)
## [1] 1939


No Java required at all and you've got your bold words.



If you do want to go the pdfbox-app route as Ralf noted, you can use this wrapper to make it easier to work with:



read_pdf_as_html_with_pdfbox <- function(path) {

java <- Sys.which("java")
if (java == "") {
stop("Java binary is not on the system PATH.", call.=FALSE)
}

# get by with a little help from our friends
suppressPackageStartupMessages({
library(httr, warn.conflicts = FALSE, quietly = TRUE)
library(xml2, warn.conflicts = FALSE, quietly = TRUE)
library(rvest, warn.conflicts = FALSE, quietly = TRUE)
})

path <- path.expand(path)
stopifnot(file.exists(path))

# pdf's should really have a PDF extension
stopifnot(tolower(tools::file_ext(path)) == "pdf")

# download the pdfbox "app" if not installed
if (!dir.exists("~/.pdfboxjars")) {
message("~/.pdfboxjars not found. Creating it and downloading pdfbox-app jar...")
dir.create("~/.pdfboxjars")
httr::GET(
url = "http://central.maven.org/maven2/org/apache/pdfbox/pdfbox-app/2.0.12/pdfbox-app-2.0.12.jar",
httr::write_disk(file.path("~/.pdfboxjars", "pdfbox-app-2.0.12.jar")),
httr::progress()
) -> res
httr::stop_for_status(res)
}

# we're going to do the conversion in a temp directory space
tf <- tempfile(fileext = ".html")
on.exit(unlink(tf), add=TRUE)

c(
"-jar",
path.expand(file.path("~/.pdfboxjars", "pdfbox-app-2.0.12.jar")),
"ExtractText",
"-html",
path,
tf
) -> args

# this could take seconds so inform users what's going on
message("Converting ", basename(path), "...")

system2(
command = java,
args = args
) -> res

xml2::read_html(tf)

}





share|improve this answer


























  • @hrbmstr : thank you so much. I have started the process of implementation. Seems there is something amiss in the pdftohtml command. After homebrewing poppler (went through successfully on Mac OS Mujave) and creating all the temp directory, copying the pdf file there etc the pdftohtml seems to return a status = 1. Warning message: In system2(command = "pdftohtml", args = args, stdout = TRUE) : running command ''pdftohtml' -i' had status 1; Please note I have not supplied extra_args to pdftohtm. The args = "-i" was only used.

    – Sanjay Mehrotra
    Nov 21 '18 at 17:20













  • Try running it without the function. In a terminal, try just pdftohtml -i thenameofyourpdffile.pdf and then a read_html on the file that has an s towards the end.

    – hrbrmstr
    Nov 21 '18 at 22:00











  • I managed to convert the pdf to html using the command pdftohtml -i file.pdf followed by bold_tags <- html_nodes(doc, xpath=".//b"); bold_words <- html_text(bold_tags) but the variable bold_words is a 0 length character vector.

    – Sanjay Mehrotra
    Nov 22 '18 at 18:18











  • Can you add what you did and the exact (complete with library calls) R code after that to the original question?

    – hrbrmstr
    Nov 22 '18 at 18:20













  • Bingo it's working! I was using the incorrect html file out of the three files formed by default. Also, I suggest we simplify the answer for others. I have made this simple function that delivers the output without adding too many checks.

    – Sanjay Mehrotra
    Nov 23 '18 at 11:14
















2














Along with having a flexible toolkit, data science regularly requires out-of-the-box thinking (at least in my profession).



But, first, a thing about PDF files.



I don't think they are what you think they are. "Bold" (or "italic", etc.) isn't "metadata". You should spend some time reading up on PDF files because they are complex, nasty, evil things that you are likely to encounter often when working with data. Read this — https://stackoverflow.com/a/19777953/1457051 — to see what finding bold text actually entails (follow the link to the 1.8.x Java pdfbox solution).



Back to our irregularly scheduled answering



While I'm one of the YUGEst proponents of R, not everything needs to be done or should be done in R. Sure, we'll use R to eventually get your bold text but we'll use a helper command-line utility to do so.



The pdftools package is based on the poppler library. It comes with the source so "I'm just an R user" folks likely don't have the full poppler toolset on their system.



Mac folks can use Homebrew to (once you get Homebrew setup):




  • brew install poppler


Linux folks know how to do things. Windows folks are lost forever (there are poppler binaries for you, but your time would be better spent switching to a real operating system).



Once you do that, you can use the below to achieve your goal.



First, we'll make a helper function with lots of safety bumpers:



#' Uses the command-line pdftohtml function from the poppler library
#' to convert a PDF to HTML and then read it in with xml2::read_html()
#'
#' @md
#' @param path the path to the file [path.expand()] will be run on this value
#' @param extra_args extra command-line arguments to be passed to `pdftohtml`.
#' They should be supplied as you would supply arguments to the `args`
#' parameter of [system2()].
read_pdf_as_html <- function(path, extra_args=character()) {

# make sure poppler/pdftohtml is installed
pdftohtml <- Sys.which("pdftohtml")
if (pdftohtml == "") {
stop("The pdftohtml command-line utility must be installed.", call.=FALSE)
}

# make sure the file exists
path <- path.expand(path)
stopifnot(file.exists(path))

# pdf's should really have a PDF extension
stopifnot(tolower(tools::file_ext(path)) == "pdf")

# get by with a little help from our friends
suppressPackageStartupMessages({
library(xml2, warn.conflicts = FALSE, quietly = TRUE)
library(rvest, warn.conflicts = FALSE, quietly = TRUE)
})

# we're going to do the conversion in a temp directory space
td <- tempfile(fileext = "_dir")
dir.create(td)
on.exit(unlink(td, recursive=TRUE), add=TRUE)

# save our current working directory
curwd <- getwd()
on.exit(setwd(curwd), add=TRUE)

# move to the temp space
setwd(td)
file.copy(path, td)

# collect the extra arguments
c(
"-i" # ignore images
) -> args

args <- c(args, extra_args, basename(path), "r-doc") # saves it to r-doc-html.html

# this could take seconds so inform users what's going on
message("Converting ", basename(path), "...")

# we'll let stderr display so you can debug errors
system2(
command = pdftohtml,
args = args,
stdout = TRUE
) -> res

res <- gsub("^Page-", "", res[length(res)])
message("Converted ", res, " pages")

# this will need to be changed if poppler ever does anything different
xml2::read_html("r-docs.html")

}


Now, we'll use it:



doc <- read_pdf_as_html("~/Data/Mulla__Indian_Contract_Act2018-11-12_01-00.PDF")

bold_tags <- html_nodes(doc, xpath=".//b")

bold_words <- html_text(bold_tags)

head(bold_words, 20)
## [1] "Preamble"
## [2] "WHEREAS it is expedient to define and amend certain parts of the law relating to contracts;"
## [3] "History"
## [4] "Ancient and Medieval Period"
## [5] "The Introduction of English Law Into India"
## [6] "Mofussal Courts"
## [7] "Legislation"
## [8] "The Indian Contract Act 1872"
## [9] "The Making of the Act"
## [10] "Law of Contract Until 1950"
## [11] "The Law of Contract after 1950"
## [12] "Amendments to This Act"
## [13] "Other Laws Affecting Contracts and Enforcement"
## [14] "Recommendations of the Indian Law Commission"
## [15] "Section 1."
## [16] "Short title"
## [17] "Extent, Commencement."
## [18] "Enactments Repealed."
## [19] "Applicability of the Act"
## [20] "Scheme of the Act"

length(bold_words)
## [1] 1939


No Java required at all and you've got your bold words.



If you do want to go the pdfbox-app route as Ralf noted, you can use this wrapper to make it easier to work with:



read_pdf_as_html_with_pdfbox <- function(path) {

java <- Sys.which("java")
if (java == "") {
stop("Java binary is not on the system PATH.", call.=FALSE)
}

# get by with a little help from our friends
suppressPackageStartupMessages({
library(httr, warn.conflicts = FALSE, quietly = TRUE)
library(xml2, warn.conflicts = FALSE, quietly = TRUE)
library(rvest, warn.conflicts = FALSE, quietly = TRUE)
})

path <- path.expand(path)
stopifnot(file.exists(path))

# pdf's should really have a PDF extension
stopifnot(tolower(tools::file_ext(path)) == "pdf")

# download the pdfbox "app" if not installed
if (!dir.exists("~/.pdfboxjars")) {
message("~/.pdfboxjars not found. Creating it and downloading pdfbox-app jar...")
dir.create("~/.pdfboxjars")
httr::GET(
url = "http://central.maven.org/maven2/org/apache/pdfbox/pdfbox-app/2.0.12/pdfbox-app-2.0.12.jar",
httr::write_disk(file.path("~/.pdfboxjars", "pdfbox-app-2.0.12.jar")),
httr::progress()
) -> res
httr::stop_for_status(res)
}

# we're going to do the conversion in a temp directory space
tf <- tempfile(fileext = ".html")
on.exit(unlink(tf), add=TRUE)

c(
"-jar",
path.expand(file.path("~/.pdfboxjars", "pdfbox-app-2.0.12.jar")),
"ExtractText",
"-html",
path,
tf
) -> args

# this could take seconds so inform users what's going on
message("Converting ", basename(path), "...")

system2(
command = java,
args = args
) -> res

xml2::read_html(tf)

}





share|improve this answer


























  • @hrbmstr : thank you so much. I have started the process of implementation. Seems there is something amiss in the pdftohtml command. After homebrewing poppler (went through successfully on Mac OS Mujave) and creating all the temp directory, copying the pdf file there etc the pdftohtml seems to return a status = 1. Warning message: In system2(command = "pdftohtml", args = args, stdout = TRUE) : running command ''pdftohtml' -i' had status 1; Please note I have not supplied extra_args to pdftohtm. The args = "-i" was only used.

    – Sanjay Mehrotra
    Nov 21 '18 at 17:20













  • Try running it without the function. In a terminal, try just pdftohtml -i thenameofyourpdffile.pdf and then a read_html on the file that has an s towards the end.

    – hrbrmstr
    Nov 21 '18 at 22:00











  • I managed to convert the pdf to html using the command pdftohtml -i file.pdf followed by bold_tags <- html_nodes(doc, xpath=".//b"); bold_words <- html_text(bold_tags) but the variable bold_words is a 0 length character vector.

    – Sanjay Mehrotra
    Nov 22 '18 at 18:18











  • Can you add what you did and the exact (complete with library calls) R code after that to the original question?

    – hrbrmstr
    Nov 22 '18 at 18:20













  • Bingo it's working! I was using the incorrect html file out of the three files formed by default. Also, I suggest we simplify the answer for others. I have made this simple function that delivers the output without adding too many checks.

    – Sanjay Mehrotra
    Nov 23 '18 at 11:14














2












2








2







Along with having a flexible toolkit, data science regularly requires out-of-the-box thinking (at least in my profession).



But, first, a thing about PDF files.



I don't think they are what you think they are. "Bold" (or "italic", etc.) isn't "metadata". You should spend some time reading up on PDF files because they are complex, nasty, evil things that you are likely to encounter often when working with data. Read this — https://stackoverflow.com/a/19777953/1457051 — to see what finding bold text actually entails (follow the link to the 1.8.x Java pdfbox solution).



Back to our irregularly scheduled answering



While I'm one of the YUGEst proponents of R, not everything needs to be done or should be done in R. Sure, we'll use R to eventually get your bold text but we'll use a helper command-line utility to do so.



The pdftools package is based on the poppler library. It comes with the source so "I'm just an R user" folks likely don't have the full poppler toolset on their system.



Mac folks can use Homebrew to (once you get Homebrew setup):




  • brew install poppler


Linux folks know how to do things. Windows folks are lost forever (there are poppler binaries for you, but your time would be better spent switching to a real operating system).



Once you do that, you can use the below to achieve your goal.



First, we'll make a helper function with lots of safety bumpers:



#' Uses the command-line pdftohtml function from the poppler library
#' to convert a PDF to HTML and then read it in with xml2::read_html()
#'
#' @md
#' @param path the path to the file [path.expand()] will be run on this value
#' @param extra_args extra command-line arguments to be passed to `pdftohtml`.
#' They should be supplied as you would supply arguments to the `args`
#' parameter of [system2()].
read_pdf_as_html <- function(path, extra_args=character()) {

# make sure poppler/pdftohtml is installed
pdftohtml <- Sys.which("pdftohtml")
if (pdftohtml == "") {
stop("The pdftohtml command-line utility must be installed.", call.=FALSE)
}

# make sure the file exists
path <- path.expand(path)
stopifnot(file.exists(path))

# pdf's should really have a PDF extension
stopifnot(tolower(tools::file_ext(path)) == "pdf")

# get by with a little help from our friends
suppressPackageStartupMessages({
library(xml2, warn.conflicts = FALSE, quietly = TRUE)
library(rvest, warn.conflicts = FALSE, quietly = TRUE)
})

# we're going to do the conversion in a temp directory space
td <- tempfile(fileext = "_dir")
dir.create(td)
on.exit(unlink(td, recursive=TRUE), add=TRUE)

# save our current working directory
curwd <- getwd()
on.exit(setwd(curwd), add=TRUE)

# move to the temp space
setwd(td)
file.copy(path, td)

# collect the extra arguments
c(
"-i" # ignore images
) -> args

args <- c(args, extra_args, basename(path), "r-doc") # saves it to r-doc-html.html

# this could take seconds so inform users what's going on
message("Converting ", basename(path), "...")

# we'll let stderr display so you can debug errors
system2(
command = pdftohtml,
args = args,
stdout = TRUE
) -> res

res <- gsub("^Page-", "", res[length(res)])
message("Converted ", res, " pages")

# this will need to be changed if poppler ever does anything different
xml2::read_html("r-docs.html")

}


Now, we'll use it:



doc <- read_pdf_as_html("~/Data/Mulla__Indian_Contract_Act2018-11-12_01-00.PDF")

bold_tags <- html_nodes(doc, xpath=".//b")

bold_words <- html_text(bold_tags)

head(bold_words, 20)
## [1] "Preamble"
## [2] "WHEREAS it is expedient to define and amend certain parts of the law relating to contracts;"
## [3] "History"
## [4] "Ancient and Medieval Period"
## [5] "The Introduction of English Law Into India"
## [6] "Mofussal Courts"
## [7] "Legislation"
## [8] "The Indian Contract Act 1872"
## [9] "The Making of the Act"
## [10] "Law of Contract Until 1950"
## [11] "The Law of Contract after 1950"
## [12] "Amendments to This Act"
## [13] "Other Laws Affecting Contracts and Enforcement"
## [14] "Recommendations of the Indian Law Commission"
## [15] "Section 1."
## [16] "Short title"
## [17] "Extent, Commencement."
## [18] "Enactments Repealed."
## [19] "Applicability of the Act"
## [20] "Scheme of the Act"

length(bold_words)
## [1] 1939


No Java required at all and you've got your bold words.



If you do want to go the pdfbox-app route as Ralf noted, you can use this wrapper to make it easier to work with:



read_pdf_as_html_with_pdfbox <- function(path) {

java <- Sys.which("java")
if (java == "") {
stop("Java binary is not on the system PATH.", call.=FALSE)
}

# get by with a little help from our friends
suppressPackageStartupMessages({
library(httr, warn.conflicts = FALSE, quietly = TRUE)
library(xml2, warn.conflicts = FALSE, quietly = TRUE)
library(rvest, warn.conflicts = FALSE, quietly = TRUE)
})

path <- path.expand(path)
stopifnot(file.exists(path))

# pdf's should really have a PDF extension
stopifnot(tolower(tools::file_ext(path)) == "pdf")

# download the pdfbox "app" if not installed
if (!dir.exists("~/.pdfboxjars")) {
message("~/.pdfboxjars not found. Creating it and downloading pdfbox-app jar...")
dir.create("~/.pdfboxjars")
httr::GET(
url = "http://central.maven.org/maven2/org/apache/pdfbox/pdfbox-app/2.0.12/pdfbox-app-2.0.12.jar",
httr::write_disk(file.path("~/.pdfboxjars", "pdfbox-app-2.0.12.jar")),
httr::progress()
) -> res
httr::stop_for_status(res)
}

# we're going to do the conversion in a temp directory space
tf <- tempfile(fileext = ".html")
on.exit(unlink(tf), add=TRUE)

c(
"-jar",
path.expand(file.path("~/.pdfboxjars", "pdfbox-app-2.0.12.jar")),
"ExtractText",
"-html",
path,
tf
) -> args

# this could take seconds so inform users what's going on
message("Converting ", basename(path), "...")

system2(
command = java,
args = args
) -> res

xml2::read_html(tf)

}





share|improve this answer















Along with having a flexible toolkit, data science regularly requires out-of-the-box thinking (at least in my profession).



But, first, a thing about PDF files.



I don't think they are what you think they are. "Bold" (or "italic", etc.) isn't "metadata". You should spend some time reading up on PDF files because they are complex, nasty, evil things that you are likely to encounter often when working with data. Read this — https://stackoverflow.com/a/19777953/1457051 — to see what finding bold text actually entails (follow the link to the 1.8.x Java pdfbox solution).



Back to our irregularly scheduled answering



While I'm one of the YUGEst proponents of R, not everything needs to be done or should be done in R. Sure, we'll use R to eventually get your bold text but we'll use a helper command-line utility to do so.



The pdftools package is based on the poppler library. It comes with the source so "I'm just an R user" folks likely don't have the full poppler toolset on their system.



Mac folks can use Homebrew to (once you get Homebrew setup):




  • brew install poppler


Linux folks know how to do things. Windows folks are lost forever (there are poppler binaries for you, but your time would be better spent switching to a real operating system).



Once you do that, you can use the below to achieve your goal.



First, we'll make a helper function with lots of safety bumpers:



#' Uses the command-line pdftohtml function from the poppler library
#' to convert a PDF to HTML and then read it in with xml2::read_html()
#'
#' @md
#' @param path the path to the file [path.expand()] will be run on this value
#' @param extra_args extra command-line arguments to be passed to `pdftohtml`.
#' They should be supplied as you would supply arguments to the `args`
#' parameter of [system2()].
read_pdf_as_html <- function(path, extra_args=character()) {

# make sure poppler/pdftohtml is installed
pdftohtml <- Sys.which("pdftohtml")
if (pdftohtml == "") {
stop("The pdftohtml command-line utility must be installed.", call.=FALSE)
}

# make sure the file exists
path <- path.expand(path)
stopifnot(file.exists(path))

# pdf's should really have a PDF extension
stopifnot(tolower(tools::file_ext(path)) == "pdf")

# get by with a little help from our friends
suppressPackageStartupMessages({
library(xml2, warn.conflicts = FALSE, quietly = TRUE)
library(rvest, warn.conflicts = FALSE, quietly = TRUE)
})

# we're going to do the conversion in a temp directory space
td <- tempfile(fileext = "_dir")
dir.create(td)
on.exit(unlink(td, recursive=TRUE), add=TRUE)

# save our current working directory
curwd <- getwd()
on.exit(setwd(curwd), add=TRUE)

# move to the temp space
setwd(td)
file.copy(path, td)

# collect the extra arguments
c(
"-i" # ignore images
) -> args

args <- c(args, extra_args, basename(path), "r-doc") # saves it to r-doc-html.html

# this could take seconds so inform users what's going on
message("Converting ", basename(path), "...")

# we'll let stderr display so you can debug errors
system2(
command = pdftohtml,
args = args,
stdout = TRUE
) -> res

res <- gsub("^Page-", "", res[length(res)])
message("Converted ", res, " pages")

# this will need to be changed if poppler ever does anything different
xml2::read_html("r-docs.html")

}


Now, we'll use it:



doc <- read_pdf_as_html("~/Data/Mulla__Indian_Contract_Act2018-11-12_01-00.PDF")

bold_tags <- html_nodes(doc, xpath=".//b")

bold_words <- html_text(bold_tags)

head(bold_words, 20)
## [1] "Preamble"
## [2] "WHEREAS it is expedient to define and amend certain parts of the law relating to contracts;"
## [3] "History"
## [4] "Ancient and Medieval Period"
## [5] "The Introduction of English Law Into India"
## [6] "Mofussal Courts"
## [7] "Legislation"
## [8] "The Indian Contract Act 1872"
## [9] "The Making of the Act"
## [10] "Law of Contract Until 1950"
## [11] "The Law of Contract after 1950"
## [12] "Amendments to This Act"
## [13] "Other Laws Affecting Contracts and Enforcement"
## [14] "Recommendations of the Indian Law Commission"
## [15] "Section 1."
## [16] "Short title"
## [17] "Extent, Commencement."
## [18] "Enactments Repealed."
## [19] "Applicability of the Act"
## [20] "Scheme of the Act"

length(bold_words)
## [1] 1939


No Java required at all and you've got your bold words.



If you do want to go the pdfbox-app route as Ralf noted, you can use this wrapper to make it easier to work with:



read_pdf_as_html_with_pdfbox <- function(path) {

java <- Sys.which("java")
if (java == "") {
stop("Java binary is not on the system PATH.", call.=FALSE)
}

# get by with a little help from our friends
suppressPackageStartupMessages({
library(httr, warn.conflicts = FALSE, quietly = TRUE)
library(xml2, warn.conflicts = FALSE, quietly = TRUE)
library(rvest, warn.conflicts = FALSE, quietly = TRUE)
})

path <- path.expand(path)
stopifnot(file.exists(path))

# pdf's should really have a PDF extension
stopifnot(tolower(tools::file_ext(path)) == "pdf")

# download the pdfbox "app" if not installed
if (!dir.exists("~/.pdfboxjars")) {
message("~/.pdfboxjars not found. Creating it and downloading pdfbox-app jar...")
dir.create("~/.pdfboxjars")
httr::GET(
url = "http://central.maven.org/maven2/org/apache/pdfbox/pdfbox-app/2.0.12/pdfbox-app-2.0.12.jar",
httr::write_disk(file.path("~/.pdfboxjars", "pdfbox-app-2.0.12.jar")),
httr::progress()
) -> res
httr::stop_for_status(res)
}

# we're going to do the conversion in a temp directory space
tf <- tempfile(fileext = ".html")
on.exit(unlink(tf), add=TRUE)

c(
"-jar",
path.expand(file.path("~/.pdfboxjars", "pdfbox-app-2.0.12.jar")),
"ExtractText",
"-html",
path,
tf
) -> args

# this could take seconds so inform users what's going on
message("Converting ", basename(path), "...")

system2(
command = java,
args = args
) -> res

xml2::read_html(tf)

}






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 21 '18 at 12:23

























answered Nov 21 '18 at 12:02









hrbrmstrhrbrmstr

60.6k687148




60.6k687148













  • @hrbmstr : thank you so much. I have started the process of implementation. Seems there is something amiss in the pdftohtml command. After homebrewing poppler (went through successfully on Mac OS Mujave) and creating all the temp directory, copying the pdf file there etc the pdftohtml seems to return a status = 1. Warning message: In system2(command = "pdftohtml", args = args, stdout = TRUE) : running command ''pdftohtml' -i' had status 1; Please note I have not supplied extra_args to pdftohtm. The args = "-i" was only used.

    – Sanjay Mehrotra
    Nov 21 '18 at 17:20













  • Try running it without the function. In a terminal, try just pdftohtml -i thenameofyourpdffile.pdf and then a read_html on the file that has an s towards the end.

    – hrbrmstr
    Nov 21 '18 at 22:00











  • I managed to convert the pdf to html using the command pdftohtml -i file.pdf followed by bold_tags <- html_nodes(doc, xpath=".//b"); bold_words <- html_text(bold_tags) but the variable bold_words is a 0 length character vector.

    – Sanjay Mehrotra
    Nov 22 '18 at 18:18











  • Can you add what you did and the exact (complete with library calls) R code after that to the original question?

    – hrbrmstr
    Nov 22 '18 at 18:20













  • Bingo it's working! I was using the incorrect html file out of the three files formed by default. Also, I suggest we simplify the answer for others. I have made this simple function that delivers the output without adding too many checks.

    – Sanjay Mehrotra
    Nov 23 '18 at 11:14



















  • @hrbmstr : thank you so much. I have started the process of implementation. Seems there is something amiss in the pdftohtml command. After homebrewing poppler (went through successfully on Mac OS Mujave) and creating all the temp directory, copying the pdf file there etc the pdftohtml seems to return a status = 1. Warning message: In system2(command = "pdftohtml", args = args, stdout = TRUE) : running command ''pdftohtml' -i' had status 1; Please note I have not supplied extra_args to pdftohtm. The args = "-i" was only used.

    – Sanjay Mehrotra
    Nov 21 '18 at 17:20













  • Try running it without the function. In a terminal, try just pdftohtml -i thenameofyourpdffile.pdf and then a read_html on the file that has an s towards the end.

    – hrbrmstr
    Nov 21 '18 at 22:00











  • I managed to convert the pdf to html using the command pdftohtml -i file.pdf followed by bold_tags <- html_nodes(doc, xpath=".//b"); bold_words <- html_text(bold_tags) but the variable bold_words is a 0 length character vector.

    – Sanjay Mehrotra
    Nov 22 '18 at 18:18











  • Can you add what you did and the exact (complete with library calls) R code after that to the original question?

    – hrbrmstr
    Nov 22 '18 at 18:20













  • Bingo it's working! I was using the incorrect html file out of the three files formed by default. Also, I suggest we simplify the answer for others. I have made this simple function that delivers the output without adding too many checks.

    – Sanjay Mehrotra
    Nov 23 '18 at 11:14

















@hrbmstr : thank you so much. I have started the process of implementation. Seems there is something amiss in the pdftohtml command. After homebrewing poppler (went through successfully on Mac OS Mujave) and creating all the temp directory, copying the pdf file there etc the pdftohtml seems to return a status = 1. Warning message: In system2(command = "pdftohtml", args = args, stdout = TRUE) : running command ''pdftohtml' -i' had status 1; Please note I have not supplied extra_args to pdftohtm. The args = "-i" was only used.

– Sanjay Mehrotra
Nov 21 '18 at 17:20







@hrbmstr : thank you so much. I have started the process of implementation. Seems there is something amiss in the pdftohtml command. After homebrewing poppler (went through successfully on Mac OS Mujave) and creating all the temp directory, copying the pdf file there etc the pdftohtml seems to return a status = 1. Warning message: In system2(command = "pdftohtml", args = args, stdout = TRUE) : running command ''pdftohtml' -i' had status 1; Please note I have not supplied extra_args to pdftohtm. The args = "-i" was only used.

– Sanjay Mehrotra
Nov 21 '18 at 17:20















Try running it without the function. In a terminal, try just pdftohtml -i thenameofyourpdffile.pdf and then a read_html on the file that has an s towards the end.

– hrbrmstr
Nov 21 '18 at 22:00





Try running it without the function. In a terminal, try just pdftohtml -i thenameofyourpdffile.pdf and then a read_html on the file that has an s towards the end.

– hrbrmstr
Nov 21 '18 at 22:00













I managed to convert the pdf to html using the command pdftohtml -i file.pdf followed by bold_tags <- html_nodes(doc, xpath=".//b"); bold_words <- html_text(bold_tags) but the variable bold_words is a 0 length character vector.

– Sanjay Mehrotra
Nov 22 '18 at 18:18





I managed to convert the pdf to html using the command pdftohtml -i file.pdf followed by bold_tags <- html_nodes(doc, xpath=".//b"); bold_words <- html_text(bold_tags) but the variable bold_words is a 0 length character vector.

– Sanjay Mehrotra
Nov 22 '18 at 18:18













Can you add what you did and the exact (complete with library calls) R code after that to the original question?

– hrbrmstr
Nov 22 '18 at 18:20







Can you add what you did and the exact (complete with library calls) R code after that to the original question?

– hrbrmstr
Nov 22 '18 at 18:20















Bingo it's working! I was using the incorrect html file out of the three files formed by default. Also, I suggest we simplify the answer for others. I have made this simple function that delivers the output without adding too many checks.

– Sanjay Mehrotra
Nov 23 '18 at 11:14





Bingo it's working! I was using the incorrect html file out of the three files formed by default. Also, I suggest we simplify the answer for others. I have made this simple function that delivers the output without adding too many checks.

– Sanjay Mehrotra
Nov 23 '18 at 11:14











0














This answer is based on answers received from @hrbmstr and @ralf. So thanks to them. I've made the answers simpler (mainly taking out the peculiarity of the HTML conversion & file naming). Also it is tailored for MAC OS users (perhaps LINUX too) - not sure about Windows guys.



I presume you have pdftohtml installed on your machine. If not use brew install pdftohtml. If you donot have homebrew on your MAC then install it first. A link is provided to help you for homebrew.



Once you are sure pdftohtml is installed on the mac, go with this R function to extract bold from any pdf document.



library(magrittr)
library(rvest)
library(stringr)

# pass a pdf file in current directory to this function
extr_bold <- function(file) {
basefile <- str_remove(file,"\.pdf|\.PDF")
htmlfile <- paste0(basefile,"s",".html")
if(!exists(htmlfile) )
system2("pdftohtml",args = c("-i",file),stdout=NULL)
nodevar <- read_html(htmlfile)
x <- html_nodes(nodevar,xpath = ".//b")
html_text(x)
}





share|improve this answer


























  • @hrbmstr: could you please check my answer and see if it makes sense? Or does it need some improvement without adding complexity and of course keeping it easy to read. thanks

    – Sanjay Mehrotra
    Nov 27 '18 at 6:09
















0














This answer is based on answers received from @hrbmstr and @ralf. So thanks to them. I've made the answers simpler (mainly taking out the peculiarity of the HTML conversion & file naming). Also it is tailored for MAC OS users (perhaps LINUX too) - not sure about Windows guys.



I presume you have pdftohtml installed on your machine. If not use brew install pdftohtml. If you donot have homebrew on your MAC then install it first. A link is provided to help you for homebrew.



Once you are sure pdftohtml is installed on the mac, go with this R function to extract bold from any pdf document.



library(magrittr)
library(rvest)
library(stringr)

# pass a pdf file in current directory to this function
extr_bold <- function(file) {
basefile <- str_remove(file,"\.pdf|\.PDF")
htmlfile <- paste0(basefile,"s",".html")
if(!exists(htmlfile) )
system2("pdftohtml",args = c("-i",file),stdout=NULL)
nodevar <- read_html(htmlfile)
x <- html_nodes(nodevar,xpath = ".//b")
html_text(x)
}





share|improve this answer


























  • @hrbmstr: could you please check my answer and see if it makes sense? Or does it need some improvement without adding complexity and of course keeping it easy to read. thanks

    – Sanjay Mehrotra
    Nov 27 '18 at 6:09














0












0








0







This answer is based on answers received from @hrbmstr and @ralf. So thanks to them. I've made the answers simpler (mainly taking out the peculiarity of the HTML conversion & file naming). Also it is tailored for MAC OS users (perhaps LINUX too) - not sure about Windows guys.



I presume you have pdftohtml installed on your machine. If not use brew install pdftohtml. If you donot have homebrew on your MAC then install it first. A link is provided to help you for homebrew.



Once you are sure pdftohtml is installed on the mac, go with this R function to extract bold from any pdf document.



library(magrittr)
library(rvest)
library(stringr)

# pass a pdf file in current directory to this function
extr_bold <- function(file) {
basefile <- str_remove(file,"\.pdf|\.PDF")
htmlfile <- paste0(basefile,"s",".html")
if(!exists(htmlfile) )
system2("pdftohtml",args = c("-i",file),stdout=NULL)
nodevar <- read_html(htmlfile)
x <- html_nodes(nodevar,xpath = ".//b")
html_text(x)
}





share|improve this answer















This answer is based on answers received from @hrbmstr and @ralf. So thanks to them. I've made the answers simpler (mainly taking out the peculiarity of the HTML conversion & file naming). Also it is tailored for MAC OS users (perhaps LINUX too) - not sure about Windows guys.



I presume you have pdftohtml installed on your machine. If not use brew install pdftohtml. If you donot have homebrew on your MAC then install it first. A link is provided to help you for homebrew.



Once you are sure pdftohtml is installed on the mac, go with this R function to extract bold from any pdf document.



library(magrittr)
library(rvest)
library(stringr)

# pass a pdf file in current directory to this function
extr_bold <- function(file) {
basefile <- str_remove(file,"\.pdf|\.PDF")
htmlfile <- paste0(basefile,"s",".html")
if(!exists(htmlfile) )
system2("pdftohtml",args = c("-i",file),stdout=NULL)
nodevar <- read_html(htmlfile)
x <- html_nodes(nodevar,xpath = ".//b")
html_text(x)
}






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 23 '18 at 12:14

























answered Nov 23 '18 at 11:30









Sanjay MehrotraSanjay Mehrotra

312313




312313













  • @hrbmstr: could you please check my answer and see if it makes sense? Or does it need some improvement without adding complexity and of course keeping it easy to read. thanks

    – Sanjay Mehrotra
    Nov 27 '18 at 6:09



















  • @hrbmstr: could you please check my answer and see if it makes sense? Or does it need some improvement without adding complexity and of course keeping it easy to read. thanks

    – Sanjay Mehrotra
    Nov 27 '18 at 6:09

















@hrbmstr: could you please check my answer and see if it makes sense? Or does it need some improvement without adding complexity and of course keeping it easy to read. thanks

– Sanjay Mehrotra
Nov 27 '18 at 6:09





@hrbmstr: could you please check my answer and see if it makes sense? Or does it need some improvement without adding complexity and of course keeping it easy to read. thanks

– Sanjay Mehrotra
Nov 27 '18 at 6:09


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53398611%2fhow-to-extract-bold-text-from-a-pdf-using-r%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

'app-layout' is not a known element: how to share Component with different Modules

android studio warns about leanback feature tag usage required on manifest while using Unity exported app?

WPF add header to Image with URL pettitions [duplicate]