getAcroForm() method returning null values, but with PDFTextStripper I am able to read complete text
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I have a PDF document I want to read fields of that document but PDAcroForm
object is null from docCatalog.getAcroForm();
. with PDFTextStripper
I am able to get the complete pdf as text, but I want to read fields.
The document is here.
pdfbox
|
show 5 more comments
I have a PDF document I want to read fields of that document but PDAcroForm
object is null from docCatalog.getAcroForm();
. with PDFTextStripper
I am able to get the complete pdf as text, but I want to read fields.
The document is here.
pdfbox
Please add some core code logic you used, and language tag.
– psyco
Jan 3 at 9:46
1
Please share the PDF. Maybe the fields were "flattened".
– Tilman Hausherr
Jan 3 at 11:02
Actually I can't see any option to upload pdf file here. code i am using is as below : PDDocument pdDoc = null; try { pdDoc = PDDocument.load((new FileInputStream(new File("Application for Individual Life Insurance.pdf")))); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } PDDocumentCatalog docCatalog = pdDoc.getDocumentCatalog(); PDAcroForm acroForm = docCatalog.getAcroForm(); List fields = acroForm.getFields();
– Vijendra Singh
Jan 3 at 12:23
"Actually I can't see any option to upload pdf file here." - usually one uses a public file sharing service (Google drive, dropbox,...) and posts the url here.
– mkl
Jan 3 at 17:21
1
@halfer If get getAcroForm() then there are no fields. But the user believes that there are fields, so she/he saw something. Further analysis requires some knowledge of the PDF specification that goes further than the PDFBox API.
– Tilman Hausherr
Jan 3 at 19:31
|
show 5 more comments
I have a PDF document I want to read fields of that document but PDAcroForm
object is null from docCatalog.getAcroForm();
. with PDFTextStripper
I am able to get the complete pdf as text, but I want to read fields.
The document is here.
pdfbox
I have a PDF document I want to read fields of that document but PDAcroForm
object is null from docCatalog.getAcroForm();
. with PDFTextStripper
I am able to get the complete pdf as text, but I want to read fields.
The document is here.
pdfbox
pdfbox
edited Jan 4 at 18:12


halfer
14.8k759117
14.8k759117
asked Jan 3 at 9:21


Vijendra SinghVijendra Singh
1
1
Please add some core code logic you used, and language tag.
– psyco
Jan 3 at 9:46
1
Please share the PDF. Maybe the fields were "flattened".
– Tilman Hausherr
Jan 3 at 11:02
Actually I can't see any option to upload pdf file here. code i am using is as below : PDDocument pdDoc = null; try { pdDoc = PDDocument.load((new FileInputStream(new File("Application for Individual Life Insurance.pdf")))); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } PDDocumentCatalog docCatalog = pdDoc.getDocumentCatalog(); PDAcroForm acroForm = docCatalog.getAcroForm(); List fields = acroForm.getFields();
– Vijendra Singh
Jan 3 at 12:23
"Actually I can't see any option to upload pdf file here." - usually one uses a public file sharing service (Google drive, dropbox,...) and posts the url here.
– mkl
Jan 3 at 17:21
1
@halfer If get getAcroForm() then there are no fields. But the user believes that there are fields, so she/he saw something. Further analysis requires some knowledge of the PDF specification that goes further than the PDFBox API.
– Tilman Hausherr
Jan 3 at 19:31
|
show 5 more comments
Please add some core code logic you used, and language tag.
– psyco
Jan 3 at 9:46
1
Please share the PDF. Maybe the fields were "flattened".
– Tilman Hausherr
Jan 3 at 11:02
Actually I can't see any option to upload pdf file here. code i am using is as below : PDDocument pdDoc = null; try { pdDoc = PDDocument.load((new FileInputStream(new File("Application for Individual Life Insurance.pdf")))); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } PDDocumentCatalog docCatalog = pdDoc.getDocumentCatalog(); PDAcroForm acroForm = docCatalog.getAcroForm(); List fields = acroForm.getFields();
– Vijendra Singh
Jan 3 at 12:23
"Actually I can't see any option to upload pdf file here." - usually one uses a public file sharing service (Google drive, dropbox,...) and posts the url here.
– mkl
Jan 3 at 17:21
1
@halfer If get getAcroForm() then there are no fields. But the user believes that there are fields, so she/he saw something. Further analysis requires some knowledge of the PDF specification that goes further than the PDFBox API.
– Tilman Hausherr
Jan 3 at 19:31
Please add some core code logic you used, and language tag.
– psyco
Jan 3 at 9:46
Please add some core code logic you used, and language tag.
– psyco
Jan 3 at 9:46
1
1
Please share the PDF. Maybe the fields were "flattened".
– Tilman Hausherr
Jan 3 at 11:02
Please share the PDF. Maybe the fields were "flattened".
– Tilman Hausherr
Jan 3 at 11:02
Actually I can't see any option to upload pdf file here. code i am using is as below : PDDocument pdDoc = null; try { pdDoc = PDDocument.load((new FileInputStream(new File("Application for Individual Life Insurance.pdf")))); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } PDDocumentCatalog docCatalog = pdDoc.getDocumentCatalog(); PDAcroForm acroForm = docCatalog.getAcroForm(); List fields = acroForm.getFields();
– Vijendra Singh
Jan 3 at 12:23
Actually I can't see any option to upload pdf file here. code i am using is as below : PDDocument pdDoc = null; try { pdDoc = PDDocument.load((new FileInputStream(new File("Application for Individual Life Insurance.pdf")))); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } PDDocumentCatalog docCatalog = pdDoc.getDocumentCatalog(); PDAcroForm acroForm = docCatalog.getAcroForm(); List fields = acroForm.getFields();
– Vijendra Singh
Jan 3 at 12:23
"Actually I can't see any option to upload pdf file here." - usually one uses a public file sharing service (Google drive, dropbox,...) and posts the url here.
– mkl
Jan 3 at 17:21
"Actually I can't see any option to upload pdf file here." - usually one uses a public file sharing service (Google drive, dropbox,...) and posts the url here.
– mkl
Jan 3 at 17:21
1
1
@halfer If get getAcroForm() then there are no fields. But the user believes that there are fields, so she/he saw something. Further analysis requires some knowledge of the PDF specification that goes further than the PDFBox API.
– Tilman Hausherr
Jan 3 at 19:31
@halfer If get getAcroForm() then there are no fields. But the user believes that there are fields, so she/he saw something. Further analysis requires some knowledge of the PDF specification that goes further than the PDFBox API.
– Tilman Hausherr
Jan 3 at 19:31
|
show 5 more comments
1 Answer
1
active
oldest
votes
The PDF you shared does not contain any AcroForm form fields.
If you inspect the file using a PDF browser (like iText RUPS or PDFBox PDFDebugger), you'll see that the Catalog only contains a Pages and a Type entry:
In particular, there is no AcroForm entry which bundles the data of an AcroForm form. Thus, docCatalog.getAcroForm();
cannot return any existing field structure.
Looking at the last Contents stream of e.g. page 1, one sees
Q
q
Q
q
1 0 0 1 329.78 655.45 cm
/Xi5 Do
Q
q
Q
q
1 0 0 1 324.17 624.51 cm
/Xi8 Do
Q
q
Q
q
1 0 0 1 265.95 702.31 cm
/Xi10 Do
Q
q
Q
q
1 0 0 1 554.46 655.6 cm
/Xi17 Do
Q
...
This is typical for a PDF which used to contain an AcroForm form definition which then was flattened into the page contents, for each former form field an XObject (which before defined the appearance of the form field widget annotation) is now referenced directly from the page content stream.
Thus, the only way to extract contents is via text extraction.
The obvious problem with text extraction is that it may be difficult to differentiate between former field contents and static form text like labels. Depending on the number of PDFs you have to extract data from it might be worth extending the PDFTextStripper
to add some marker for text extracted from some XObject contents (in contrast to immediate page contents). Such markers would allow you to differentiate quite well.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54019367%2fgetacroform-method-returning-null-values-but-with-pdftextstripper-i-am-able-t%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The PDF you shared does not contain any AcroForm form fields.
If you inspect the file using a PDF browser (like iText RUPS or PDFBox PDFDebugger), you'll see that the Catalog only contains a Pages and a Type entry:
In particular, there is no AcroForm entry which bundles the data of an AcroForm form. Thus, docCatalog.getAcroForm();
cannot return any existing field structure.
Looking at the last Contents stream of e.g. page 1, one sees
Q
q
Q
q
1 0 0 1 329.78 655.45 cm
/Xi5 Do
Q
q
Q
q
1 0 0 1 324.17 624.51 cm
/Xi8 Do
Q
q
Q
q
1 0 0 1 265.95 702.31 cm
/Xi10 Do
Q
q
Q
q
1 0 0 1 554.46 655.6 cm
/Xi17 Do
Q
...
This is typical for a PDF which used to contain an AcroForm form definition which then was flattened into the page contents, for each former form field an XObject (which before defined the appearance of the form field widget annotation) is now referenced directly from the page content stream.
Thus, the only way to extract contents is via text extraction.
The obvious problem with text extraction is that it may be difficult to differentiate between former field contents and static form text like labels. Depending on the number of PDFs you have to extract data from it might be worth extending the PDFTextStripper
to add some marker for text extracted from some XObject contents (in contrast to immediate page contents). Such markers would allow you to differentiate quite well.
add a comment |
The PDF you shared does not contain any AcroForm form fields.
If you inspect the file using a PDF browser (like iText RUPS or PDFBox PDFDebugger), you'll see that the Catalog only contains a Pages and a Type entry:
In particular, there is no AcroForm entry which bundles the data of an AcroForm form. Thus, docCatalog.getAcroForm();
cannot return any existing field structure.
Looking at the last Contents stream of e.g. page 1, one sees
Q
q
Q
q
1 0 0 1 329.78 655.45 cm
/Xi5 Do
Q
q
Q
q
1 0 0 1 324.17 624.51 cm
/Xi8 Do
Q
q
Q
q
1 0 0 1 265.95 702.31 cm
/Xi10 Do
Q
q
Q
q
1 0 0 1 554.46 655.6 cm
/Xi17 Do
Q
...
This is typical for a PDF which used to contain an AcroForm form definition which then was flattened into the page contents, for each former form field an XObject (which before defined the appearance of the form field widget annotation) is now referenced directly from the page content stream.
Thus, the only way to extract contents is via text extraction.
The obvious problem with text extraction is that it may be difficult to differentiate between former field contents and static form text like labels. Depending on the number of PDFs you have to extract data from it might be worth extending the PDFTextStripper
to add some marker for text extracted from some XObject contents (in contrast to immediate page contents). Such markers would allow you to differentiate quite well.
add a comment |
The PDF you shared does not contain any AcroForm form fields.
If you inspect the file using a PDF browser (like iText RUPS or PDFBox PDFDebugger), you'll see that the Catalog only contains a Pages and a Type entry:
In particular, there is no AcroForm entry which bundles the data of an AcroForm form. Thus, docCatalog.getAcroForm();
cannot return any existing field structure.
Looking at the last Contents stream of e.g. page 1, one sees
Q
q
Q
q
1 0 0 1 329.78 655.45 cm
/Xi5 Do
Q
q
Q
q
1 0 0 1 324.17 624.51 cm
/Xi8 Do
Q
q
Q
q
1 0 0 1 265.95 702.31 cm
/Xi10 Do
Q
q
Q
q
1 0 0 1 554.46 655.6 cm
/Xi17 Do
Q
...
This is typical for a PDF which used to contain an AcroForm form definition which then was flattened into the page contents, for each former form field an XObject (which before defined the appearance of the form field widget annotation) is now referenced directly from the page content stream.
Thus, the only way to extract contents is via text extraction.
The obvious problem with text extraction is that it may be difficult to differentiate between former field contents and static form text like labels. Depending on the number of PDFs you have to extract data from it might be worth extending the PDFTextStripper
to add some marker for text extracted from some XObject contents (in contrast to immediate page contents). Such markers would allow you to differentiate quite well.
The PDF you shared does not contain any AcroForm form fields.
If you inspect the file using a PDF browser (like iText RUPS or PDFBox PDFDebugger), you'll see that the Catalog only contains a Pages and a Type entry:
In particular, there is no AcroForm entry which bundles the data of an AcroForm form. Thus, docCatalog.getAcroForm();
cannot return any existing field structure.
Looking at the last Contents stream of e.g. page 1, one sees
Q
q
Q
q
1 0 0 1 329.78 655.45 cm
/Xi5 Do
Q
q
Q
q
1 0 0 1 324.17 624.51 cm
/Xi8 Do
Q
q
Q
q
1 0 0 1 265.95 702.31 cm
/Xi10 Do
Q
q
Q
q
1 0 0 1 554.46 655.6 cm
/Xi17 Do
Q
...
This is typical for a PDF which used to contain an AcroForm form definition which then was flattened into the page contents, for each former form field an XObject (which before defined the appearance of the form field widget annotation) is now referenced directly from the page content stream.
Thus, the only way to extract contents is via text extraction.
The obvious problem with text extraction is that it may be difficult to differentiate between former field contents and static form text like labels. Depending on the number of PDFs you have to extract data from it might be worth extending the PDFTextStripper
to add some marker for text extracted from some XObject contents (in contrast to immediate page contents). Such markers would allow you to differentiate quite well.
answered Jan 7 at 16:00


mklmkl
55.7k1270150
55.7k1270150
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54019367%2fgetacroform-method-returning-null-values-but-with-pdftextstripper-i-am-able-t%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Please add some core code logic you used, and language tag.
– psyco
Jan 3 at 9:46
1
Please share the PDF. Maybe the fields were "flattened".
– Tilman Hausherr
Jan 3 at 11:02
Actually I can't see any option to upload pdf file here. code i am using is as below : PDDocument pdDoc = null; try { pdDoc = PDDocument.load((new FileInputStream(new File("Application for Individual Life Insurance.pdf")))); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } PDDocumentCatalog docCatalog = pdDoc.getDocumentCatalog(); PDAcroForm acroForm = docCatalog.getAcroForm(); List fields = acroForm.getFields();
– Vijendra Singh
Jan 3 at 12:23
"Actually I can't see any option to upload pdf file here." - usually one uses a public file sharing service (Google drive, dropbox,...) and posts the url here.
– mkl
Jan 3 at 17:21
1
@halfer If get getAcroForm() then there are no fields. But the user believes that there are fields, so she/he saw something. Further analysis requires some knowledge of the PDF specification that goes further than the PDFBox API.
– Tilman Hausherr
Jan 3 at 19:31