getAcroForm() method returning null values, but with PDFTextStripper I am able to read complete text





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







0















I have a PDF document I want to read fields of that document but PDAcroForm object is null from docCatalog.getAcroForm();. with PDFTextStripper I am able to get the complete pdf as text, but I want to read fields.



The document is here.










share|improve this question

























  • Please add some core code logic you used, and language tag.

    – psyco
    Jan 3 at 9:46








  • 1





    Please share the PDF. Maybe the fields were "flattened".

    – Tilman Hausherr
    Jan 3 at 11:02











  • Actually I can't see any option to upload pdf file here. code i am using is as below : PDDocument pdDoc = null; try { pdDoc = PDDocument.load((new FileInputStream(new File("Application for Individual Life Insurance.pdf")))); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } PDDocumentCatalog docCatalog = pdDoc.getDocumentCatalog(); PDAcroForm acroForm = docCatalog.getAcroForm(); List fields = acroForm.getFields();

    – Vijendra Singh
    Jan 3 at 12:23













  • "Actually I can't see any option to upload pdf file here." - usually one uses a public file sharing service (Google drive, dropbox,...) and posts the url here.

    – mkl
    Jan 3 at 17:21






  • 1





    @halfer If get getAcroForm() then there are no fields. But the user believes that there are fields, so she/he saw something. Further analysis requires some knowledge of the PDF specification that goes further than the PDFBox API.

    – Tilman Hausherr
    Jan 3 at 19:31


















0















I have a PDF document I want to read fields of that document but PDAcroForm object is null from docCatalog.getAcroForm();. with PDFTextStripper I am able to get the complete pdf as text, but I want to read fields.



The document is here.










share|improve this question

























  • Please add some core code logic you used, and language tag.

    – psyco
    Jan 3 at 9:46








  • 1





    Please share the PDF. Maybe the fields were "flattened".

    – Tilman Hausherr
    Jan 3 at 11:02











  • Actually I can't see any option to upload pdf file here. code i am using is as below : PDDocument pdDoc = null; try { pdDoc = PDDocument.load((new FileInputStream(new File("Application for Individual Life Insurance.pdf")))); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } PDDocumentCatalog docCatalog = pdDoc.getDocumentCatalog(); PDAcroForm acroForm = docCatalog.getAcroForm(); List fields = acroForm.getFields();

    – Vijendra Singh
    Jan 3 at 12:23













  • "Actually I can't see any option to upload pdf file here." - usually one uses a public file sharing service (Google drive, dropbox,...) and posts the url here.

    – mkl
    Jan 3 at 17:21






  • 1





    @halfer If get getAcroForm() then there are no fields. But the user believes that there are fields, so she/he saw something. Further analysis requires some knowledge of the PDF specification that goes further than the PDFBox API.

    – Tilman Hausherr
    Jan 3 at 19:31














0












0








0


0






I have a PDF document I want to read fields of that document but PDAcroForm object is null from docCatalog.getAcroForm();. with PDFTextStripper I am able to get the complete pdf as text, but I want to read fields.



The document is here.










share|improve this question
















I have a PDF document I want to read fields of that document but PDAcroForm object is null from docCatalog.getAcroForm();. with PDFTextStripper I am able to get the complete pdf as text, but I want to read fields.



The document is here.







pdfbox






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 4 at 18:12









halfer

14.8k759117




14.8k759117










asked Jan 3 at 9:21









Vijendra SinghVijendra Singh

1




1













  • Please add some core code logic you used, and language tag.

    – psyco
    Jan 3 at 9:46








  • 1





    Please share the PDF. Maybe the fields were "flattened".

    – Tilman Hausherr
    Jan 3 at 11:02











  • Actually I can't see any option to upload pdf file here. code i am using is as below : PDDocument pdDoc = null; try { pdDoc = PDDocument.load((new FileInputStream(new File("Application for Individual Life Insurance.pdf")))); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } PDDocumentCatalog docCatalog = pdDoc.getDocumentCatalog(); PDAcroForm acroForm = docCatalog.getAcroForm(); List fields = acroForm.getFields();

    – Vijendra Singh
    Jan 3 at 12:23













  • "Actually I can't see any option to upload pdf file here." - usually one uses a public file sharing service (Google drive, dropbox,...) and posts the url here.

    – mkl
    Jan 3 at 17:21






  • 1





    @halfer If get getAcroForm() then there are no fields. But the user believes that there are fields, so she/he saw something. Further analysis requires some knowledge of the PDF specification that goes further than the PDFBox API.

    – Tilman Hausherr
    Jan 3 at 19:31



















  • Please add some core code logic you used, and language tag.

    – psyco
    Jan 3 at 9:46








  • 1





    Please share the PDF. Maybe the fields were "flattened".

    – Tilman Hausherr
    Jan 3 at 11:02











  • Actually I can't see any option to upload pdf file here. code i am using is as below : PDDocument pdDoc = null; try { pdDoc = PDDocument.load((new FileInputStream(new File("Application for Individual Life Insurance.pdf")))); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } PDDocumentCatalog docCatalog = pdDoc.getDocumentCatalog(); PDAcroForm acroForm = docCatalog.getAcroForm(); List fields = acroForm.getFields();

    – Vijendra Singh
    Jan 3 at 12:23













  • "Actually I can't see any option to upload pdf file here." - usually one uses a public file sharing service (Google drive, dropbox,...) and posts the url here.

    – mkl
    Jan 3 at 17:21






  • 1





    @halfer If get getAcroForm() then there are no fields. But the user believes that there are fields, so she/he saw something. Further analysis requires some knowledge of the PDF specification that goes further than the PDFBox API.

    – Tilman Hausherr
    Jan 3 at 19:31

















Please add some core code logic you used, and language tag.

– psyco
Jan 3 at 9:46







Please add some core code logic you used, and language tag.

– psyco
Jan 3 at 9:46






1




1





Please share the PDF. Maybe the fields were "flattened".

– Tilman Hausherr
Jan 3 at 11:02





Please share the PDF. Maybe the fields were "flattened".

– Tilman Hausherr
Jan 3 at 11:02













Actually I can't see any option to upload pdf file here. code i am using is as below : PDDocument pdDoc = null; try { pdDoc = PDDocument.load((new FileInputStream(new File("Application for Individual Life Insurance.pdf")))); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } PDDocumentCatalog docCatalog = pdDoc.getDocumentCatalog(); PDAcroForm acroForm = docCatalog.getAcroForm(); List fields = acroForm.getFields();

– Vijendra Singh
Jan 3 at 12:23







Actually I can't see any option to upload pdf file here. code i am using is as below : PDDocument pdDoc = null; try { pdDoc = PDDocument.load((new FileInputStream(new File("Application for Individual Life Insurance.pdf")))); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } PDDocumentCatalog docCatalog = pdDoc.getDocumentCatalog(); PDAcroForm acroForm = docCatalog.getAcroForm(); List fields = acroForm.getFields();

– Vijendra Singh
Jan 3 at 12:23















"Actually I can't see any option to upload pdf file here." - usually one uses a public file sharing service (Google drive, dropbox,...) and posts the url here.

– mkl
Jan 3 at 17:21





"Actually I can't see any option to upload pdf file here." - usually one uses a public file sharing service (Google drive, dropbox,...) and posts the url here.

– mkl
Jan 3 at 17:21




1




1





@halfer If get getAcroForm() then there are no fields. But the user believes that there are fields, so she/he saw something. Further analysis requires some knowledge of the PDF specification that goes further than the PDFBox API.

– Tilman Hausherr
Jan 3 at 19:31





@halfer If get getAcroForm() then there are no fields. But the user believes that there are fields, so she/he saw something. Further analysis requires some knowledge of the PDF specification that goes further than the PDFBox API.

– Tilman Hausherr
Jan 3 at 19:31












1 Answer
1






active

oldest

votes


















1














The PDF you shared does not contain any AcroForm form fields.



If you inspect the file using a PDF browser (like iText RUPS or PDFBox PDFDebugger), you'll see that the Catalog only contains a Pages and a Type entry:



Catalog screen shot



In particular, there is no AcroForm entry which bundles the data of an AcroForm form. Thus, docCatalog.getAcroForm(); cannot return any existing field structure.



Looking at the last Contents stream of e.g. page 1, one sees



Q
q
Q
q
1 0 0 1 329.78 655.45 cm
/Xi5 Do
Q
q
Q
q
1 0 0 1 324.17 624.51 cm
/Xi8 Do
Q
q
Q
q
1 0 0 1 265.95 702.31 cm
/Xi10 Do
Q
q
Q
q
1 0 0 1 554.46 655.6 cm
/Xi17 Do
Q
...


This is typical for a PDF which used to contain an AcroForm form definition which then was flattened into the page contents, for each former form field an XObject (which before defined the appearance of the form field widget annotation) is now referenced directly from the page content stream.



Thus, the only way to extract contents is via text extraction.





The obvious problem with text extraction is that it may be difficult to differentiate between former field contents and static form text like labels. Depending on the number of PDFs you have to extract data from it might be worth extending the PDFTextStripper to add some marker for text extracted from some XObject contents (in contrast to immediate page contents). Such markers would allow you to differentiate quite well.






share|improve this answer
























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54019367%2fgetacroform-method-returning-null-values-but-with-pdftextstripper-i-am-able-t%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    The PDF you shared does not contain any AcroForm form fields.



    If you inspect the file using a PDF browser (like iText RUPS or PDFBox PDFDebugger), you'll see that the Catalog only contains a Pages and a Type entry:



    Catalog screen shot



    In particular, there is no AcroForm entry which bundles the data of an AcroForm form. Thus, docCatalog.getAcroForm(); cannot return any existing field structure.



    Looking at the last Contents stream of e.g. page 1, one sees



    Q
    q
    Q
    q
    1 0 0 1 329.78 655.45 cm
    /Xi5 Do
    Q
    q
    Q
    q
    1 0 0 1 324.17 624.51 cm
    /Xi8 Do
    Q
    q
    Q
    q
    1 0 0 1 265.95 702.31 cm
    /Xi10 Do
    Q
    q
    Q
    q
    1 0 0 1 554.46 655.6 cm
    /Xi17 Do
    Q
    ...


    This is typical for a PDF which used to contain an AcroForm form definition which then was flattened into the page contents, for each former form field an XObject (which before defined the appearance of the form field widget annotation) is now referenced directly from the page content stream.



    Thus, the only way to extract contents is via text extraction.





    The obvious problem with text extraction is that it may be difficult to differentiate between former field contents and static form text like labels. Depending on the number of PDFs you have to extract data from it might be worth extending the PDFTextStripper to add some marker for text extracted from some XObject contents (in contrast to immediate page contents). Such markers would allow you to differentiate quite well.






    share|improve this answer




























      1














      The PDF you shared does not contain any AcroForm form fields.



      If you inspect the file using a PDF browser (like iText RUPS or PDFBox PDFDebugger), you'll see that the Catalog only contains a Pages and a Type entry:



      Catalog screen shot



      In particular, there is no AcroForm entry which bundles the data of an AcroForm form. Thus, docCatalog.getAcroForm(); cannot return any existing field structure.



      Looking at the last Contents stream of e.g. page 1, one sees



      Q
      q
      Q
      q
      1 0 0 1 329.78 655.45 cm
      /Xi5 Do
      Q
      q
      Q
      q
      1 0 0 1 324.17 624.51 cm
      /Xi8 Do
      Q
      q
      Q
      q
      1 0 0 1 265.95 702.31 cm
      /Xi10 Do
      Q
      q
      Q
      q
      1 0 0 1 554.46 655.6 cm
      /Xi17 Do
      Q
      ...


      This is typical for a PDF which used to contain an AcroForm form definition which then was flattened into the page contents, for each former form field an XObject (which before defined the appearance of the form field widget annotation) is now referenced directly from the page content stream.



      Thus, the only way to extract contents is via text extraction.





      The obvious problem with text extraction is that it may be difficult to differentiate between former field contents and static form text like labels. Depending on the number of PDFs you have to extract data from it might be worth extending the PDFTextStripper to add some marker for text extracted from some XObject contents (in contrast to immediate page contents). Such markers would allow you to differentiate quite well.






      share|improve this answer


























        1












        1








        1







        The PDF you shared does not contain any AcroForm form fields.



        If you inspect the file using a PDF browser (like iText RUPS or PDFBox PDFDebugger), you'll see that the Catalog only contains a Pages and a Type entry:



        Catalog screen shot



        In particular, there is no AcroForm entry which bundles the data of an AcroForm form. Thus, docCatalog.getAcroForm(); cannot return any existing field structure.



        Looking at the last Contents stream of e.g. page 1, one sees



        Q
        q
        Q
        q
        1 0 0 1 329.78 655.45 cm
        /Xi5 Do
        Q
        q
        Q
        q
        1 0 0 1 324.17 624.51 cm
        /Xi8 Do
        Q
        q
        Q
        q
        1 0 0 1 265.95 702.31 cm
        /Xi10 Do
        Q
        q
        Q
        q
        1 0 0 1 554.46 655.6 cm
        /Xi17 Do
        Q
        ...


        This is typical for a PDF which used to contain an AcroForm form definition which then was flattened into the page contents, for each former form field an XObject (which before defined the appearance of the form field widget annotation) is now referenced directly from the page content stream.



        Thus, the only way to extract contents is via text extraction.





        The obvious problem with text extraction is that it may be difficult to differentiate between former field contents and static form text like labels. Depending on the number of PDFs you have to extract data from it might be worth extending the PDFTextStripper to add some marker for text extracted from some XObject contents (in contrast to immediate page contents). Such markers would allow you to differentiate quite well.






        share|improve this answer













        The PDF you shared does not contain any AcroForm form fields.



        If you inspect the file using a PDF browser (like iText RUPS or PDFBox PDFDebugger), you'll see that the Catalog only contains a Pages and a Type entry:



        Catalog screen shot



        In particular, there is no AcroForm entry which bundles the data of an AcroForm form. Thus, docCatalog.getAcroForm(); cannot return any existing field structure.



        Looking at the last Contents stream of e.g. page 1, one sees



        Q
        q
        Q
        q
        1 0 0 1 329.78 655.45 cm
        /Xi5 Do
        Q
        q
        Q
        q
        1 0 0 1 324.17 624.51 cm
        /Xi8 Do
        Q
        q
        Q
        q
        1 0 0 1 265.95 702.31 cm
        /Xi10 Do
        Q
        q
        Q
        q
        1 0 0 1 554.46 655.6 cm
        /Xi17 Do
        Q
        ...


        This is typical for a PDF which used to contain an AcroForm form definition which then was flattened into the page contents, for each former form field an XObject (which before defined the appearance of the form field widget annotation) is now referenced directly from the page content stream.



        Thus, the only way to extract contents is via text extraction.





        The obvious problem with text extraction is that it may be difficult to differentiate between former field contents and static form text like labels. Depending on the number of PDFs you have to extract data from it might be worth extending the PDFTextStripper to add some marker for text extracted from some XObject contents (in contrast to immediate page contents). Such markers would allow you to differentiate quite well.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jan 7 at 16:00









        mklmkl

        55.7k1270150




        55.7k1270150
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54019367%2fgetacroform-method-returning-null-values-but-with-pdftextstripper-i-am-able-t%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            MongoDB - Not Authorized To Execute Command

            in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith

            Npm cannot find a required file even through it is in the searched directory