I want to get all objects except text object as an image from PDF using iTextSharp












-1















I am developing a program to convert PDF to PPTX for specific reasons using iTextSharp.
What I've done so far is to get all text objects and image objects and locations.
But I'm feeling difficult to get Table objects without texts.
Actually it would be better if I can get them as images.
My plan is to merge all objects except text objects as a background image and put text objects at proper locations.
I tried to find similar questions here but no luck so far.
If anyone knows how to do this particular job, please answer.
Thanks.










share|improve this question























  • There is nothing like a table object in a pdf (unless it's properly tagged, and even then it's merely a logical table object, not a graphical one), there only are chunks of text (or whatever table content you see) and probably some graphical objects like lines or colored rectangles. Thus, it is unclear what you want.

    – mkl
    Jan 2 at 11:12













  • mkl, thanks for your reply. Hope I can get help from you again on this question. I agree that there should be no table objects but it's interesting that when I get all images I can't see ones for tables. I used IRenderListener. Looking forward to your answer.

    – Piao David
    Jan 2 at 12:52






  • 1





    Implement IExtRenderListener which extends IRenderListener but has additional callbacks for vector graphics related instructions. Most likely these additional callbacks will be invoked for the lines or colored rectangles structuring your table.

    – mkl
    Jan 2 at 17:45











  • Thanks a lot, mkl. I tried IExtRenderListener but no idea how to use Path. Basically what I need to do is draw all objects on PPTX. I'm afraid Path includes all texts and images too. On the other hand, I'm thinking to remove all text objects from the PDF and get a temporary PDF. Then I can get the whole page (text objects removed) as an image and use it as a background. Do you have any ideas how to implement this way? Removing text objects and make a new PDF without texts. Thanks in advance.

    – Piao David
    Jan 3 at 3:39











  • @mkl, I'm still struggling. Looking forward to your answer

    – Piao David
    Jan 4 at 3:02
















-1















I am developing a program to convert PDF to PPTX for specific reasons using iTextSharp.
What I've done so far is to get all text objects and image objects and locations.
But I'm feeling difficult to get Table objects without texts.
Actually it would be better if I can get them as images.
My plan is to merge all objects except text objects as a background image and put text objects at proper locations.
I tried to find similar questions here but no luck so far.
If anyone knows how to do this particular job, please answer.
Thanks.










share|improve this question























  • There is nothing like a table object in a pdf (unless it's properly tagged, and even then it's merely a logical table object, not a graphical one), there only are chunks of text (or whatever table content you see) and probably some graphical objects like lines or colored rectangles. Thus, it is unclear what you want.

    – mkl
    Jan 2 at 11:12













  • mkl, thanks for your reply. Hope I can get help from you again on this question. I agree that there should be no table objects but it's interesting that when I get all images I can't see ones for tables. I used IRenderListener. Looking forward to your answer.

    – Piao David
    Jan 2 at 12:52






  • 1





    Implement IExtRenderListener which extends IRenderListener but has additional callbacks for vector graphics related instructions. Most likely these additional callbacks will be invoked for the lines or colored rectangles structuring your table.

    – mkl
    Jan 2 at 17:45











  • Thanks a lot, mkl. I tried IExtRenderListener but no idea how to use Path. Basically what I need to do is draw all objects on PPTX. I'm afraid Path includes all texts and images too. On the other hand, I'm thinking to remove all text objects from the PDF and get a temporary PDF. Then I can get the whole page (text objects removed) as an image and use it as a background. Do you have any ideas how to implement this way? Removing text objects and make a new PDF without texts. Thanks in advance.

    – Piao David
    Jan 3 at 3:39











  • @mkl, I'm still struggling. Looking forward to your answer

    – Piao David
    Jan 4 at 3:02














-1












-1








-1


1






I am developing a program to convert PDF to PPTX for specific reasons using iTextSharp.
What I've done so far is to get all text objects and image objects and locations.
But I'm feeling difficult to get Table objects without texts.
Actually it would be better if I can get them as images.
My plan is to merge all objects except text objects as a background image and put text objects at proper locations.
I tried to find similar questions here but no luck so far.
If anyone knows how to do this particular job, please answer.
Thanks.










share|improve this question














I am developing a program to convert PDF to PPTX for specific reasons using iTextSharp.
What I've done so far is to get all text objects and image objects and locations.
But I'm feeling difficult to get Table objects without texts.
Actually it would be better if I can get them as images.
My plan is to merge all objects except text objects as a background image and put text objects at proper locations.
I tried to find similar questions here but no luck so far.
If anyone knows how to do this particular job, please answer.
Thanks.







c# pdf itext






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jan 2 at 9:25









Piao DavidPiao David

158




158













  • There is nothing like a table object in a pdf (unless it's properly tagged, and even then it's merely a logical table object, not a graphical one), there only are chunks of text (or whatever table content you see) and probably some graphical objects like lines or colored rectangles. Thus, it is unclear what you want.

    – mkl
    Jan 2 at 11:12













  • mkl, thanks for your reply. Hope I can get help from you again on this question. I agree that there should be no table objects but it's interesting that when I get all images I can't see ones for tables. I used IRenderListener. Looking forward to your answer.

    – Piao David
    Jan 2 at 12:52






  • 1





    Implement IExtRenderListener which extends IRenderListener but has additional callbacks for vector graphics related instructions. Most likely these additional callbacks will be invoked for the lines or colored rectangles structuring your table.

    – mkl
    Jan 2 at 17:45











  • Thanks a lot, mkl. I tried IExtRenderListener but no idea how to use Path. Basically what I need to do is draw all objects on PPTX. I'm afraid Path includes all texts and images too. On the other hand, I'm thinking to remove all text objects from the PDF and get a temporary PDF. Then I can get the whole page (text objects removed) as an image and use it as a background. Do you have any ideas how to implement this way? Removing text objects and make a new PDF without texts. Thanks in advance.

    – Piao David
    Jan 3 at 3:39











  • @mkl, I'm still struggling. Looking forward to your answer

    – Piao David
    Jan 4 at 3:02



















  • There is nothing like a table object in a pdf (unless it's properly tagged, and even then it's merely a logical table object, not a graphical one), there only are chunks of text (or whatever table content you see) and probably some graphical objects like lines or colored rectangles. Thus, it is unclear what you want.

    – mkl
    Jan 2 at 11:12













  • mkl, thanks for your reply. Hope I can get help from you again on this question. I agree that there should be no table objects but it's interesting that when I get all images I can't see ones for tables. I used IRenderListener. Looking forward to your answer.

    – Piao David
    Jan 2 at 12:52






  • 1





    Implement IExtRenderListener which extends IRenderListener but has additional callbacks for vector graphics related instructions. Most likely these additional callbacks will be invoked for the lines or colored rectangles structuring your table.

    – mkl
    Jan 2 at 17:45











  • Thanks a lot, mkl. I tried IExtRenderListener but no idea how to use Path. Basically what I need to do is draw all objects on PPTX. I'm afraid Path includes all texts and images too. On the other hand, I'm thinking to remove all text objects from the PDF and get a temporary PDF. Then I can get the whole page (text objects removed) as an image and use it as a background. Do you have any ideas how to implement this way? Removing text objects and make a new PDF without texts. Thanks in advance.

    – Piao David
    Jan 3 at 3:39











  • @mkl, I'm still struggling. Looking forward to your answer

    – Piao David
    Jan 4 at 3:02

















There is nothing like a table object in a pdf (unless it's properly tagged, and even then it's merely a logical table object, not a graphical one), there only are chunks of text (or whatever table content you see) and probably some graphical objects like lines or colored rectangles. Thus, it is unclear what you want.

– mkl
Jan 2 at 11:12







There is nothing like a table object in a pdf (unless it's properly tagged, and even then it's merely a logical table object, not a graphical one), there only are chunks of text (or whatever table content you see) and probably some graphical objects like lines or colored rectangles. Thus, it is unclear what you want.

– mkl
Jan 2 at 11:12















mkl, thanks for your reply. Hope I can get help from you again on this question. I agree that there should be no table objects but it's interesting that when I get all images I can't see ones for tables. I used IRenderListener. Looking forward to your answer.

– Piao David
Jan 2 at 12:52





mkl, thanks for your reply. Hope I can get help from you again on this question. I agree that there should be no table objects but it's interesting that when I get all images I can't see ones for tables. I used IRenderListener. Looking forward to your answer.

– Piao David
Jan 2 at 12:52




1




1





Implement IExtRenderListener which extends IRenderListener but has additional callbacks for vector graphics related instructions. Most likely these additional callbacks will be invoked for the lines or colored rectangles structuring your table.

– mkl
Jan 2 at 17:45





Implement IExtRenderListener which extends IRenderListener but has additional callbacks for vector graphics related instructions. Most likely these additional callbacks will be invoked for the lines or colored rectangles structuring your table.

– mkl
Jan 2 at 17:45













Thanks a lot, mkl. I tried IExtRenderListener but no idea how to use Path. Basically what I need to do is draw all objects on PPTX. I'm afraid Path includes all texts and images too. On the other hand, I'm thinking to remove all text objects from the PDF and get a temporary PDF. Then I can get the whole page (text objects removed) as an image and use it as a background. Do you have any ideas how to implement this way? Removing text objects and make a new PDF without texts. Thanks in advance.

– Piao David
Jan 3 at 3:39





Thanks a lot, mkl. I tried IExtRenderListener but no idea how to use Path. Basically what I need to do is draw all objects on PPTX. I'm afraid Path includes all texts and images too. On the other hand, I'm thinking to remove all text objects from the PDF and get a temporary PDF. Then I can get the whole page (text objects removed) as an image and use it as a background. Do you have any ideas how to implement this way? Removing text objects and make a new PDF without texts. Thanks in advance.

– Piao David
Jan 3 at 3:39













@mkl, I'm still struggling. Looking forward to your answer

– Piao David
Jan 4 at 3:02





@mkl, I'm still struggling. Looking forward to your answer

– Piao David
Jan 4 at 3:02












2 Answers
2






active

oldest

votes


















2














You say




What I've done so far is to get all text objects and image objects and locations.




but you don't go into detail how you do so. I assume you use a matching IRenderListener implementation.



But IRenderListener, as you found out yourself,




only extracts images and texts.




The main missing objects are paths and their usages.



To extract them, too, you should implement IExtRenderListener which extends IRenderListener but also retrieves information about paths. To understand the callback methods, please first be aware how path related instructions work in PDFs:





  • First there are instructions for building the actual path; these instructions essentially





    • move to some position,

    • add a line to some position from the previous position,

    • add a Bézier curve to some position from the previous position using some control points, or

    • add an upright rectangle at some position using some width and height information.



  • Then there is an optional instruction to intersect the current clip path with the generated path.


  • Finally, there is a drawing instruction for any combination of filling the inside of the path and stroking along the path, i.e. for doing both, either one, or neither one.



This corresponds to the callbacks you retrieve in your IExtRenderListener implementation:



/**
* Called when the current path is being modified. E.g. new segment is being added,
* new subpath is being started etc.
*
* @param renderInfo Contains information about the path segment being added to the current path.
*/
void ModifyPath(PathConstructionRenderInfo renderInfo);


is called once or more often to build the actual path, PathConstructionRenderInfo containing the actual instruction type in its Operation property (compare to the PathConstructionRenderInfo constant members MOVETO, LINETO, etc. to determine the operation type) and the required coordinates / dimensions in its SegmentData property. The Ctm property additionally returns the affine transformation that currently is set to be applied to all drawing operations.



Then



/**
* Called when the current path should be set as a new clipping path.
*
* @param rule Either {@link PathPaintingRenderInfo#EVEN_ODD_RULE} or {@link PathPaintingRenderInfo#NONZERO_WINDING_RULE}
*/
void ClipPath(int rule);


is called if the current clip path shall be intersected with the constructed path.



Finally



/**
* Called when the current path should be rendered.
*
* @param renderInfo Contains information about the current path which should be rendered.
* @return The path which can be used as a new clipping path.
*/
Path RenderPath(PathPaintingRenderInfo renderInfo);


is called, PathPaintingRenderInfo containing the drawing operation in its Operation property (any combination of the PathPaintingRenderInfo constants STROKE and FILL), the rule for determining what "inside the path" means in its Rule property (NONZERO_WINDING_RULE or EVEN_ODD_RULE), and some other drawing details in the Ctm, LineWidth, LineCapStyle, LineJoinStyle, MiterLimit, and LineDashPattern properties.






share|improve this answer
























  • Thanks a lot, @mkl! I think this will be an answer to my another question. Please check and share this link as an answer. I would really appreciate your any comments on my thought in that question. Thanks again. LINK:stackoverflow.com/questions/54059341/…

    – Piao David
    Jan 7 at 12:20





















0














try to implement IRenderListener



  internal class ImageExtractor : IRenderListener
{
private int _currentPage = 1;
private int _imageCount = 0;
private readonly string _outputFilePrefix;
private readonly string _outputFolder;
private readonly bool _overwriteExistingFiles;

private ImageExtractor(string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
{
_outputFilePrefix = outputFilePrefix;
_outputFolder = outputFolder;
_overwriteExistingFiles = overwriteExistingFiles;
}

/// <summary>
/// Extract all images from a PDF file
/// </summary>
/// <param name="pdfPath">Full path and file name of PDF file</param>
/// <param name="outputFilePrefix">Basic name of exported files. If null then uses same name as PDF file.</param>
/// <param name="outputFolder">Where to save images. If null or empty then uses same folder as PDF file.</param>
/// <param name="overwriteExistingFiles">True to overwrite existing image files, false to skip past them</param>
/// <returns>Count of number of images extracted.</returns>
public static int ExtractImagesFromFile(string pdfPath, string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
{
// Handle setting of any default values
outputFilePrefix = outputFilePrefix ?? System.IO.Path.GetFileNameWithoutExtension(pdfPath);
outputFolder = String.IsNullOrEmpty(outputFolder) ? System.IO.Path.GetDirectoryName(pdfPath) : outputFolder;

var instance = new ImageExtractor(outputFilePrefix, outputFolder, overwriteExistingFiles);

using (var pdfReader = new PdfReader(pdfPath))
{
if (pdfReader.IsEncrypted())
throw new ApplicationException(pdfPath + " is encrypted.");

var pdfParser = new PdfReaderContentParser(pdfReader);

while (instance._currentPage <= pdfReader.NumberOfPages)
{
pdfParser.ProcessContent(instance._currentPage, instance);

instance._currentPage++;
}
}

return instance._imageCount;
}

#region Implementation of IRenderListener

public void BeginTextBlock() { }
public void EndTextBlock() { }
public void RenderText(TextRenderInfo renderInfo) { }

public void RenderImage(ImageRenderInfo renderInfo)
{
if (_imageCount == 0)
{
var imageObject = renderInfo.GetImage();

var imageFileName = _outputFilePrefix + _imageCount; //to get multiple file (you should add .jpg or .png ...)
var imagePath = System.IO.Path.Combine(_outputFolder, imageFileName);



if (_overwriteExistingFiles || !File.Exists(imagePath))
{
var imageRawBytes = imageObject.GetImageAsBytes();
//create a new file ()
File.WriteAllBytes(imagePath, imageRawBytes);

}
}
_imageCount++;
}

#endregion // Implementation of IRenderListener

}





share|improve this answer
























  • Yes, I already tried IRenderListener. This method only extracts images and texts. It does not return anything about tables.There's no Table related function.

    – Piao David
    Jan 2 at 10:26













Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54003886%2fi-want-to-get-all-objects-except-text-object-as-an-image-from-pdf-using-itextsha%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














You say




What I've done so far is to get all text objects and image objects and locations.




but you don't go into detail how you do so. I assume you use a matching IRenderListener implementation.



But IRenderListener, as you found out yourself,




only extracts images and texts.




The main missing objects are paths and their usages.



To extract them, too, you should implement IExtRenderListener which extends IRenderListener but also retrieves information about paths. To understand the callback methods, please first be aware how path related instructions work in PDFs:





  • First there are instructions for building the actual path; these instructions essentially





    • move to some position,

    • add a line to some position from the previous position,

    • add a Bézier curve to some position from the previous position using some control points, or

    • add an upright rectangle at some position using some width and height information.



  • Then there is an optional instruction to intersect the current clip path with the generated path.


  • Finally, there is a drawing instruction for any combination of filling the inside of the path and stroking along the path, i.e. for doing both, either one, or neither one.



This corresponds to the callbacks you retrieve in your IExtRenderListener implementation:



/**
* Called when the current path is being modified. E.g. new segment is being added,
* new subpath is being started etc.
*
* @param renderInfo Contains information about the path segment being added to the current path.
*/
void ModifyPath(PathConstructionRenderInfo renderInfo);


is called once or more often to build the actual path, PathConstructionRenderInfo containing the actual instruction type in its Operation property (compare to the PathConstructionRenderInfo constant members MOVETO, LINETO, etc. to determine the operation type) and the required coordinates / dimensions in its SegmentData property. The Ctm property additionally returns the affine transformation that currently is set to be applied to all drawing operations.



Then



/**
* Called when the current path should be set as a new clipping path.
*
* @param rule Either {@link PathPaintingRenderInfo#EVEN_ODD_RULE} or {@link PathPaintingRenderInfo#NONZERO_WINDING_RULE}
*/
void ClipPath(int rule);


is called if the current clip path shall be intersected with the constructed path.



Finally



/**
* Called when the current path should be rendered.
*
* @param renderInfo Contains information about the current path which should be rendered.
* @return The path which can be used as a new clipping path.
*/
Path RenderPath(PathPaintingRenderInfo renderInfo);


is called, PathPaintingRenderInfo containing the drawing operation in its Operation property (any combination of the PathPaintingRenderInfo constants STROKE and FILL), the rule for determining what "inside the path" means in its Rule property (NONZERO_WINDING_RULE or EVEN_ODD_RULE), and some other drawing details in the Ctm, LineWidth, LineCapStyle, LineJoinStyle, MiterLimit, and LineDashPattern properties.






share|improve this answer
























  • Thanks a lot, @mkl! I think this will be an answer to my another question. Please check and share this link as an answer. I would really appreciate your any comments on my thought in that question. Thanks again. LINK:stackoverflow.com/questions/54059341/…

    – Piao David
    Jan 7 at 12:20


















2














You say




What I've done so far is to get all text objects and image objects and locations.




but you don't go into detail how you do so. I assume you use a matching IRenderListener implementation.



But IRenderListener, as you found out yourself,




only extracts images and texts.




The main missing objects are paths and their usages.



To extract them, too, you should implement IExtRenderListener which extends IRenderListener but also retrieves information about paths. To understand the callback methods, please first be aware how path related instructions work in PDFs:





  • First there are instructions for building the actual path; these instructions essentially





    • move to some position,

    • add a line to some position from the previous position,

    • add a Bézier curve to some position from the previous position using some control points, or

    • add an upright rectangle at some position using some width and height information.



  • Then there is an optional instruction to intersect the current clip path with the generated path.


  • Finally, there is a drawing instruction for any combination of filling the inside of the path and stroking along the path, i.e. for doing both, either one, or neither one.



This corresponds to the callbacks you retrieve in your IExtRenderListener implementation:



/**
* Called when the current path is being modified. E.g. new segment is being added,
* new subpath is being started etc.
*
* @param renderInfo Contains information about the path segment being added to the current path.
*/
void ModifyPath(PathConstructionRenderInfo renderInfo);


is called once or more often to build the actual path, PathConstructionRenderInfo containing the actual instruction type in its Operation property (compare to the PathConstructionRenderInfo constant members MOVETO, LINETO, etc. to determine the operation type) and the required coordinates / dimensions in its SegmentData property. The Ctm property additionally returns the affine transformation that currently is set to be applied to all drawing operations.



Then



/**
* Called when the current path should be set as a new clipping path.
*
* @param rule Either {@link PathPaintingRenderInfo#EVEN_ODD_RULE} or {@link PathPaintingRenderInfo#NONZERO_WINDING_RULE}
*/
void ClipPath(int rule);


is called if the current clip path shall be intersected with the constructed path.



Finally



/**
* Called when the current path should be rendered.
*
* @param renderInfo Contains information about the current path which should be rendered.
* @return The path which can be used as a new clipping path.
*/
Path RenderPath(PathPaintingRenderInfo renderInfo);


is called, PathPaintingRenderInfo containing the drawing operation in its Operation property (any combination of the PathPaintingRenderInfo constants STROKE and FILL), the rule for determining what "inside the path" means in its Rule property (NONZERO_WINDING_RULE or EVEN_ODD_RULE), and some other drawing details in the Ctm, LineWidth, LineCapStyle, LineJoinStyle, MiterLimit, and LineDashPattern properties.






share|improve this answer
























  • Thanks a lot, @mkl! I think this will be an answer to my another question. Please check and share this link as an answer. I would really appreciate your any comments on my thought in that question. Thanks again. LINK:stackoverflow.com/questions/54059341/…

    – Piao David
    Jan 7 at 12:20
















2












2








2







You say




What I've done so far is to get all text objects and image objects and locations.




but you don't go into detail how you do so. I assume you use a matching IRenderListener implementation.



But IRenderListener, as you found out yourself,




only extracts images and texts.




The main missing objects are paths and their usages.



To extract them, too, you should implement IExtRenderListener which extends IRenderListener but also retrieves information about paths. To understand the callback methods, please first be aware how path related instructions work in PDFs:





  • First there are instructions for building the actual path; these instructions essentially





    • move to some position,

    • add a line to some position from the previous position,

    • add a Bézier curve to some position from the previous position using some control points, or

    • add an upright rectangle at some position using some width and height information.



  • Then there is an optional instruction to intersect the current clip path with the generated path.


  • Finally, there is a drawing instruction for any combination of filling the inside of the path and stroking along the path, i.e. for doing both, either one, or neither one.



This corresponds to the callbacks you retrieve in your IExtRenderListener implementation:



/**
* Called when the current path is being modified. E.g. new segment is being added,
* new subpath is being started etc.
*
* @param renderInfo Contains information about the path segment being added to the current path.
*/
void ModifyPath(PathConstructionRenderInfo renderInfo);


is called once or more often to build the actual path, PathConstructionRenderInfo containing the actual instruction type in its Operation property (compare to the PathConstructionRenderInfo constant members MOVETO, LINETO, etc. to determine the operation type) and the required coordinates / dimensions in its SegmentData property. The Ctm property additionally returns the affine transformation that currently is set to be applied to all drawing operations.



Then



/**
* Called when the current path should be set as a new clipping path.
*
* @param rule Either {@link PathPaintingRenderInfo#EVEN_ODD_RULE} or {@link PathPaintingRenderInfo#NONZERO_WINDING_RULE}
*/
void ClipPath(int rule);


is called if the current clip path shall be intersected with the constructed path.



Finally



/**
* Called when the current path should be rendered.
*
* @param renderInfo Contains information about the current path which should be rendered.
* @return The path which can be used as a new clipping path.
*/
Path RenderPath(PathPaintingRenderInfo renderInfo);


is called, PathPaintingRenderInfo containing the drawing operation in its Operation property (any combination of the PathPaintingRenderInfo constants STROKE and FILL), the rule for determining what "inside the path" means in its Rule property (NONZERO_WINDING_RULE or EVEN_ODD_RULE), and some other drawing details in the Ctm, LineWidth, LineCapStyle, LineJoinStyle, MiterLimit, and LineDashPattern properties.






share|improve this answer













You say




What I've done so far is to get all text objects and image objects and locations.




but you don't go into detail how you do so. I assume you use a matching IRenderListener implementation.



But IRenderListener, as you found out yourself,




only extracts images and texts.




The main missing objects are paths and their usages.



To extract them, too, you should implement IExtRenderListener which extends IRenderListener but also retrieves information about paths. To understand the callback methods, please first be aware how path related instructions work in PDFs:





  • First there are instructions for building the actual path; these instructions essentially





    • move to some position,

    • add a line to some position from the previous position,

    • add a Bézier curve to some position from the previous position using some control points, or

    • add an upright rectangle at some position using some width and height information.



  • Then there is an optional instruction to intersect the current clip path with the generated path.


  • Finally, there is a drawing instruction for any combination of filling the inside of the path and stroking along the path, i.e. for doing both, either one, or neither one.



This corresponds to the callbacks you retrieve in your IExtRenderListener implementation:



/**
* Called when the current path is being modified. E.g. new segment is being added,
* new subpath is being started etc.
*
* @param renderInfo Contains information about the path segment being added to the current path.
*/
void ModifyPath(PathConstructionRenderInfo renderInfo);


is called once or more often to build the actual path, PathConstructionRenderInfo containing the actual instruction type in its Operation property (compare to the PathConstructionRenderInfo constant members MOVETO, LINETO, etc. to determine the operation type) and the required coordinates / dimensions in its SegmentData property. The Ctm property additionally returns the affine transformation that currently is set to be applied to all drawing operations.



Then



/**
* Called when the current path should be set as a new clipping path.
*
* @param rule Either {@link PathPaintingRenderInfo#EVEN_ODD_RULE} or {@link PathPaintingRenderInfo#NONZERO_WINDING_RULE}
*/
void ClipPath(int rule);


is called if the current clip path shall be intersected with the constructed path.



Finally



/**
* Called when the current path should be rendered.
*
* @param renderInfo Contains information about the current path which should be rendered.
* @return The path which can be used as a new clipping path.
*/
Path RenderPath(PathPaintingRenderInfo renderInfo);


is called, PathPaintingRenderInfo containing the drawing operation in its Operation property (any combination of the PathPaintingRenderInfo constants STROKE and FILL), the rule for determining what "inside the path" means in its Rule property (NONZERO_WINDING_RULE or EVEN_ODD_RULE), and some other drawing details in the Ctm, LineWidth, LineCapStyle, LineJoinStyle, MiterLimit, and LineDashPattern properties.







share|improve this answer












share|improve this answer



share|improve this answer










answered Jan 7 at 12:04









mklmkl

55.1k1170149




55.1k1170149













  • Thanks a lot, @mkl! I think this will be an answer to my another question. Please check and share this link as an answer. I would really appreciate your any comments on my thought in that question. Thanks again. LINK:stackoverflow.com/questions/54059341/…

    – Piao David
    Jan 7 at 12:20





















  • Thanks a lot, @mkl! I think this will be an answer to my another question. Please check and share this link as an answer. I would really appreciate your any comments on my thought in that question. Thanks again. LINK:stackoverflow.com/questions/54059341/…

    – Piao David
    Jan 7 at 12:20



















Thanks a lot, @mkl! I think this will be an answer to my another question. Please check and share this link as an answer. I would really appreciate your any comments on my thought in that question. Thanks again. LINK:stackoverflow.com/questions/54059341/…

– Piao David
Jan 7 at 12:20







Thanks a lot, @mkl! I think this will be an answer to my another question. Please check and share this link as an answer. I would really appreciate your any comments on my thought in that question. Thanks again. LINK:stackoverflow.com/questions/54059341/…

– Piao David
Jan 7 at 12:20















0














try to implement IRenderListener



  internal class ImageExtractor : IRenderListener
{
private int _currentPage = 1;
private int _imageCount = 0;
private readonly string _outputFilePrefix;
private readonly string _outputFolder;
private readonly bool _overwriteExistingFiles;

private ImageExtractor(string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
{
_outputFilePrefix = outputFilePrefix;
_outputFolder = outputFolder;
_overwriteExistingFiles = overwriteExistingFiles;
}

/// <summary>
/// Extract all images from a PDF file
/// </summary>
/// <param name="pdfPath">Full path and file name of PDF file</param>
/// <param name="outputFilePrefix">Basic name of exported files. If null then uses same name as PDF file.</param>
/// <param name="outputFolder">Where to save images. If null or empty then uses same folder as PDF file.</param>
/// <param name="overwriteExistingFiles">True to overwrite existing image files, false to skip past them</param>
/// <returns>Count of number of images extracted.</returns>
public static int ExtractImagesFromFile(string pdfPath, string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
{
// Handle setting of any default values
outputFilePrefix = outputFilePrefix ?? System.IO.Path.GetFileNameWithoutExtension(pdfPath);
outputFolder = String.IsNullOrEmpty(outputFolder) ? System.IO.Path.GetDirectoryName(pdfPath) : outputFolder;

var instance = new ImageExtractor(outputFilePrefix, outputFolder, overwriteExistingFiles);

using (var pdfReader = new PdfReader(pdfPath))
{
if (pdfReader.IsEncrypted())
throw new ApplicationException(pdfPath + " is encrypted.");

var pdfParser = new PdfReaderContentParser(pdfReader);

while (instance._currentPage <= pdfReader.NumberOfPages)
{
pdfParser.ProcessContent(instance._currentPage, instance);

instance._currentPage++;
}
}

return instance._imageCount;
}

#region Implementation of IRenderListener

public void BeginTextBlock() { }
public void EndTextBlock() { }
public void RenderText(TextRenderInfo renderInfo) { }

public void RenderImage(ImageRenderInfo renderInfo)
{
if (_imageCount == 0)
{
var imageObject = renderInfo.GetImage();

var imageFileName = _outputFilePrefix + _imageCount; //to get multiple file (you should add .jpg or .png ...)
var imagePath = System.IO.Path.Combine(_outputFolder, imageFileName);



if (_overwriteExistingFiles || !File.Exists(imagePath))
{
var imageRawBytes = imageObject.GetImageAsBytes();
//create a new file ()
File.WriteAllBytes(imagePath, imageRawBytes);

}
}
_imageCount++;
}

#endregion // Implementation of IRenderListener

}





share|improve this answer
























  • Yes, I already tried IRenderListener. This method only extracts images and texts. It does not return anything about tables.There's no Table related function.

    – Piao David
    Jan 2 at 10:26


















0














try to implement IRenderListener



  internal class ImageExtractor : IRenderListener
{
private int _currentPage = 1;
private int _imageCount = 0;
private readonly string _outputFilePrefix;
private readonly string _outputFolder;
private readonly bool _overwriteExistingFiles;

private ImageExtractor(string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
{
_outputFilePrefix = outputFilePrefix;
_outputFolder = outputFolder;
_overwriteExistingFiles = overwriteExistingFiles;
}

/// <summary>
/// Extract all images from a PDF file
/// </summary>
/// <param name="pdfPath">Full path and file name of PDF file</param>
/// <param name="outputFilePrefix">Basic name of exported files. If null then uses same name as PDF file.</param>
/// <param name="outputFolder">Where to save images. If null or empty then uses same folder as PDF file.</param>
/// <param name="overwriteExistingFiles">True to overwrite existing image files, false to skip past them</param>
/// <returns>Count of number of images extracted.</returns>
public static int ExtractImagesFromFile(string pdfPath, string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
{
// Handle setting of any default values
outputFilePrefix = outputFilePrefix ?? System.IO.Path.GetFileNameWithoutExtension(pdfPath);
outputFolder = String.IsNullOrEmpty(outputFolder) ? System.IO.Path.GetDirectoryName(pdfPath) : outputFolder;

var instance = new ImageExtractor(outputFilePrefix, outputFolder, overwriteExistingFiles);

using (var pdfReader = new PdfReader(pdfPath))
{
if (pdfReader.IsEncrypted())
throw new ApplicationException(pdfPath + " is encrypted.");

var pdfParser = new PdfReaderContentParser(pdfReader);

while (instance._currentPage <= pdfReader.NumberOfPages)
{
pdfParser.ProcessContent(instance._currentPage, instance);

instance._currentPage++;
}
}

return instance._imageCount;
}

#region Implementation of IRenderListener

public void BeginTextBlock() { }
public void EndTextBlock() { }
public void RenderText(TextRenderInfo renderInfo) { }

public void RenderImage(ImageRenderInfo renderInfo)
{
if (_imageCount == 0)
{
var imageObject = renderInfo.GetImage();

var imageFileName = _outputFilePrefix + _imageCount; //to get multiple file (you should add .jpg or .png ...)
var imagePath = System.IO.Path.Combine(_outputFolder, imageFileName);



if (_overwriteExistingFiles || !File.Exists(imagePath))
{
var imageRawBytes = imageObject.GetImageAsBytes();
//create a new file ()
File.WriteAllBytes(imagePath, imageRawBytes);

}
}
_imageCount++;
}

#endregion // Implementation of IRenderListener

}





share|improve this answer
























  • Yes, I already tried IRenderListener. This method only extracts images and texts. It does not return anything about tables.There's no Table related function.

    – Piao David
    Jan 2 at 10:26
















0












0








0







try to implement IRenderListener



  internal class ImageExtractor : IRenderListener
{
private int _currentPage = 1;
private int _imageCount = 0;
private readonly string _outputFilePrefix;
private readonly string _outputFolder;
private readonly bool _overwriteExistingFiles;

private ImageExtractor(string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
{
_outputFilePrefix = outputFilePrefix;
_outputFolder = outputFolder;
_overwriteExistingFiles = overwriteExistingFiles;
}

/// <summary>
/// Extract all images from a PDF file
/// </summary>
/// <param name="pdfPath">Full path and file name of PDF file</param>
/// <param name="outputFilePrefix">Basic name of exported files. If null then uses same name as PDF file.</param>
/// <param name="outputFolder">Where to save images. If null or empty then uses same folder as PDF file.</param>
/// <param name="overwriteExistingFiles">True to overwrite existing image files, false to skip past them</param>
/// <returns>Count of number of images extracted.</returns>
public static int ExtractImagesFromFile(string pdfPath, string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
{
// Handle setting of any default values
outputFilePrefix = outputFilePrefix ?? System.IO.Path.GetFileNameWithoutExtension(pdfPath);
outputFolder = String.IsNullOrEmpty(outputFolder) ? System.IO.Path.GetDirectoryName(pdfPath) : outputFolder;

var instance = new ImageExtractor(outputFilePrefix, outputFolder, overwriteExistingFiles);

using (var pdfReader = new PdfReader(pdfPath))
{
if (pdfReader.IsEncrypted())
throw new ApplicationException(pdfPath + " is encrypted.");

var pdfParser = new PdfReaderContentParser(pdfReader);

while (instance._currentPage <= pdfReader.NumberOfPages)
{
pdfParser.ProcessContent(instance._currentPage, instance);

instance._currentPage++;
}
}

return instance._imageCount;
}

#region Implementation of IRenderListener

public void BeginTextBlock() { }
public void EndTextBlock() { }
public void RenderText(TextRenderInfo renderInfo) { }

public void RenderImage(ImageRenderInfo renderInfo)
{
if (_imageCount == 0)
{
var imageObject = renderInfo.GetImage();

var imageFileName = _outputFilePrefix + _imageCount; //to get multiple file (you should add .jpg or .png ...)
var imagePath = System.IO.Path.Combine(_outputFolder, imageFileName);



if (_overwriteExistingFiles || !File.Exists(imagePath))
{
var imageRawBytes = imageObject.GetImageAsBytes();
//create a new file ()
File.WriteAllBytes(imagePath, imageRawBytes);

}
}
_imageCount++;
}

#endregion // Implementation of IRenderListener

}





share|improve this answer













try to implement IRenderListener



  internal class ImageExtractor : IRenderListener
{
private int _currentPage = 1;
private int _imageCount = 0;
private readonly string _outputFilePrefix;
private readonly string _outputFolder;
private readonly bool _overwriteExistingFiles;

private ImageExtractor(string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
{
_outputFilePrefix = outputFilePrefix;
_outputFolder = outputFolder;
_overwriteExistingFiles = overwriteExistingFiles;
}

/// <summary>
/// Extract all images from a PDF file
/// </summary>
/// <param name="pdfPath">Full path and file name of PDF file</param>
/// <param name="outputFilePrefix">Basic name of exported files. If null then uses same name as PDF file.</param>
/// <param name="outputFolder">Where to save images. If null or empty then uses same folder as PDF file.</param>
/// <param name="overwriteExistingFiles">True to overwrite existing image files, false to skip past them</param>
/// <returns>Count of number of images extracted.</returns>
public static int ExtractImagesFromFile(string pdfPath, string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
{
// Handle setting of any default values
outputFilePrefix = outputFilePrefix ?? System.IO.Path.GetFileNameWithoutExtension(pdfPath);
outputFolder = String.IsNullOrEmpty(outputFolder) ? System.IO.Path.GetDirectoryName(pdfPath) : outputFolder;

var instance = new ImageExtractor(outputFilePrefix, outputFolder, overwriteExistingFiles);

using (var pdfReader = new PdfReader(pdfPath))
{
if (pdfReader.IsEncrypted())
throw new ApplicationException(pdfPath + " is encrypted.");

var pdfParser = new PdfReaderContentParser(pdfReader);

while (instance._currentPage <= pdfReader.NumberOfPages)
{
pdfParser.ProcessContent(instance._currentPage, instance);

instance._currentPage++;
}
}

return instance._imageCount;
}

#region Implementation of IRenderListener

public void BeginTextBlock() { }
public void EndTextBlock() { }
public void RenderText(TextRenderInfo renderInfo) { }

public void RenderImage(ImageRenderInfo renderInfo)
{
if (_imageCount == 0)
{
var imageObject = renderInfo.GetImage();

var imageFileName = _outputFilePrefix + _imageCount; //to get multiple file (you should add .jpg or .png ...)
var imagePath = System.IO.Path.Combine(_outputFolder, imageFileName);



if (_overwriteExistingFiles || !File.Exists(imagePath))
{
var imageRawBytes = imageObject.GetImageAsBytes();
//create a new file ()
File.WriteAllBytes(imagePath, imageRawBytes);

}
}
_imageCount++;
}

#endregion // Implementation of IRenderListener

}






share|improve this answer












share|improve this answer



share|improve this answer










answered Jan 2 at 10:08









AmineAmine

383




383













  • Yes, I already tried IRenderListener. This method only extracts images and texts. It does not return anything about tables.There's no Table related function.

    – Piao David
    Jan 2 at 10:26





















  • Yes, I already tried IRenderListener. This method only extracts images and texts. It does not return anything about tables.There's no Table related function.

    – Piao David
    Jan 2 at 10:26



















Yes, I already tried IRenderListener. This method only extracts images and texts. It does not return anything about tables.There's no Table related function.

– Piao David
Jan 2 at 10:26







Yes, I already tried IRenderListener. This method only extracts images and texts. It does not return anything about tables.There's no Table related function.

– Piao David
Jan 2 at 10:26




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54003886%2fi-want-to-get-all-objects-except-text-object-as-an-image-from-pdf-using-itextsha%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

How to fix TextFormField cause rebuild widget in Flutter

in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith