I want to get all objects except text object as an image from PDF using iTextSharp
I am developing a program to convert PDF to PPTX for specific reasons using iTextSharp.
What I've done so far is to get all text objects and image objects and locations.
But I'm feeling difficult to get Table objects without texts.
Actually it would be better if I can get them as images.
My plan is to merge all objects except text objects as a background image and put text objects at proper locations.
I tried to find similar questions here but no luck so far.
If anyone knows how to do this particular job, please answer.
Thanks.
c# pdf itext
add a comment |
I am developing a program to convert PDF to PPTX for specific reasons using iTextSharp.
What I've done so far is to get all text objects and image objects and locations.
But I'm feeling difficult to get Table objects without texts.
Actually it would be better if I can get them as images.
My plan is to merge all objects except text objects as a background image and put text objects at proper locations.
I tried to find similar questions here but no luck so far.
If anyone knows how to do this particular job, please answer.
Thanks.
c# pdf itext
There is nothing like a table object in a pdf (unless it's properly tagged, and even then it's merely a logical table object, not a graphical one), there only are chunks of text (or whatever table content you see) and probably some graphical objects like lines or colored rectangles. Thus, it is unclear what you want.
– mkl
Jan 2 at 11:12
mkl, thanks for your reply. Hope I can get help from you again on this question. I agree that there should be no table objects but it's interesting that when I get all images I can't see ones for tables. I used IRenderListener. Looking forward to your answer.
– Piao David
Jan 2 at 12:52
1
ImplementIExtRenderListener
which extendsIRenderListener
but has additional callbacks for vector graphics related instructions. Most likely these additional callbacks will be invoked for the lines or colored rectangles structuring your table.
– mkl
Jan 2 at 17:45
Thanks a lot, mkl. I tried IExtRenderListener but no idea how to use Path. Basically what I need to do is draw all objects on PPTX. I'm afraid Path includes all texts and images too. On the other hand, I'm thinking to remove all text objects from the PDF and get a temporary PDF. Then I can get the whole page (text objects removed) as an image and use it as a background. Do you have any ideas how to implement this way? Removing text objects and make a new PDF without texts. Thanks in advance.
– Piao David
Jan 3 at 3:39
@mkl, I'm still struggling. Looking forward to your answer
– Piao David
Jan 4 at 3:02
add a comment |
I am developing a program to convert PDF to PPTX for specific reasons using iTextSharp.
What I've done so far is to get all text objects and image objects and locations.
But I'm feeling difficult to get Table objects without texts.
Actually it would be better if I can get them as images.
My plan is to merge all objects except text objects as a background image and put text objects at proper locations.
I tried to find similar questions here but no luck so far.
If anyone knows how to do this particular job, please answer.
Thanks.
c# pdf itext
I am developing a program to convert PDF to PPTX for specific reasons using iTextSharp.
What I've done so far is to get all text objects and image objects and locations.
But I'm feeling difficult to get Table objects without texts.
Actually it would be better if I can get them as images.
My plan is to merge all objects except text objects as a background image and put text objects at proper locations.
I tried to find similar questions here but no luck so far.
If anyone knows how to do this particular job, please answer.
Thanks.
c# pdf itext
c# pdf itext
asked Jan 2 at 9:25


Piao DavidPiao David
158
158
There is nothing like a table object in a pdf (unless it's properly tagged, and even then it's merely a logical table object, not a graphical one), there only are chunks of text (or whatever table content you see) and probably some graphical objects like lines or colored rectangles. Thus, it is unclear what you want.
– mkl
Jan 2 at 11:12
mkl, thanks for your reply. Hope I can get help from you again on this question. I agree that there should be no table objects but it's interesting that when I get all images I can't see ones for tables. I used IRenderListener. Looking forward to your answer.
– Piao David
Jan 2 at 12:52
1
ImplementIExtRenderListener
which extendsIRenderListener
but has additional callbacks for vector graphics related instructions. Most likely these additional callbacks will be invoked for the lines or colored rectangles structuring your table.
– mkl
Jan 2 at 17:45
Thanks a lot, mkl. I tried IExtRenderListener but no idea how to use Path. Basically what I need to do is draw all objects on PPTX. I'm afraid Path includes all texts and images too. On the other hand, I'm thinking to remove all text objects from the PDF and get a temporary PDF. Then I can get the whole page (text objects removed) as an image and use it as a background. Do you have any ideas how to implement this way? Removing text objects and make a new PDF without texts. Thanks in advance.
– Piao David
Jan 3 at 3:39
@mkl, I'm still struggling. Looking forward to your answer
– Piao David
Jan 4 at 3:02
add a comment |
There is nothing like a table object in a pdf (unless it's properly tagged, and even then it's merely a logical table object, not a graphical one), there only are chunks of text (or whatever table content you see) and probably some graphical objects like lines or colored rectangles. Thus, it is unclear what you want.
– mkl
Jan 2 at 11:12
mkl, thanks for your reply. Hope I can get help from you again on this question. I agree that there should be no table objects but it's interesting that when I get all images I can't see ones for tables. I used IRenderListener. Looking forward to your answer.
– Piao David
Jan 2 at 12:52
1
ImplementIExtRenderListener
which extendsIRenderListener
but has additional callbacks for vector graphics related instructions. Most likely these additional callbacks will be invoked for the lines or colored rectangles structuring your table.
– mkl
Jan 2 at 17:45
Thanks a lot, mkl. I tried IExtRenderListener but no idea how to use Path. Basically what I need to do is draw all objects on PPTX. I'm afraid Path includes all texts and images too. On the other hand, I'm thinking to remove all text objects from the PDF and get a temporary PDF. Then I can get the whole page (text objects removed) as an image and use it as a background. Do you have any ideas how to implement this way? Removing text objects and make a new PDF without texts. Thanks in advance.
– Piao David
Jan 3 at 3:39
@mkl, I'm still struggling. Looking forward to your answer
– Piao David
Jan 4 at 3:02
There is nothing like a table object in a pdf (unless it's properly tagged, and even then it's merely a logical table object, not a graphical one), there only are chunks of text (or whatever table content you see) and probably some graphical objects like lines or colored rectangles. Thus, it is unclear what you want.
– mkl
Jan 2 at 11:12
There is nothing like a table object in a pdf (unless it's properly tagged, and even then it's merely a logical table object, not a graphical one), there only are chunks of text (or whatever table content you see) and probably some graphical objects like lines or colored rectangles. Thus, it is unclear what you want.
– mkl
Jan 2 at 11:12
mkl, thanks for your reply. Hope I can get help from you again on this question. I agree that there should be no table objects but it's interesting that when I get all images I can't see ones for tables. I used IRenderListener. Looking forward to your answer.
– Piao David
Jan 2 at 12:52
mkl, thanks for your reply. Hope I can get help from you again on this question. I agree that there should be no table objects but it's interesting that when I get all images I can't see ones for tables. I used IRenderListener. Looking forward to your answer.
– Piao David
Jan 2 at 12:52
1
1
Implement
IExtRenderListener
which extends IRenderListener
but has additional callbacks for vector graphics related instructions. Most likely these additional callbacks will be invoked for the lines or colored rectangles structuring your table.– mkl
Jan 2 at 17:45
Implement
IExtRenderListener
which extends IRenderListener
but has additional callbacks for vector graphics related instructions. Most likely these additional callbacks will be invoked for the lines or colored rectangles structuring your table.– mkl
Jan 2 at 17:45
Thanks a lot, mkl. I tried IExtRenderListener but no idea how to use Path. Basically what I need to do is draw all objects on PPTX. I'm afraid Path includes all texts and images too. On the other hand, I'm thinking to remove all text objects from the PDF and get a temporary PDF. Then I can get the whole page (text objects removed) as an image and use it as a background. Do you have any ideas how to implement this way? Removing text objects and make a new PDF without texts. Thanks in advance.
– Piao David
Jan 3 at 3:39
Thanks a lot, mkl. I tried IExtRenderListener but no idea how to use Path. Basically what I need to do is draw all objects on PPTX. I'm afraid Path includes all texts and images too. On the other hand, I'm thinking to remove all text objects from the PDF and get a temporary PDF. Then I can get the whole page (text objects removed) as an image and use it as a background. Do you have any ideas how to implement this way? Removing text objects and make a new PDF without texts. Thanks in advance.
– Piao David
Jan 3 at 3:39
@mkl, I'm still struggling. Looking forward to your answer
– Piao David
Jan 4 at 3:02
@mkl, I'm still struggling. Looking forward to your answer
– Piao David
Jan 4 at 3:02
add a comment |
2 Answers
2
active
oldest
votes
You say
What I've done so far is to get all text objects and image objects and locations.
but you don't go into detail how you do so. I assume you use a matching IRenderListener
implementation.
But IRenderListener
, as you found out yourself,
only extracts images and texts.
The main missing objects are paths and their usages.
To extract them, too, you should implement IExtRenderListener
which extends IRenderListener
but also retrieves information about paths. To understand the callback methods, please first be aware how path related instructions work in PDFs:
First there are instructions for building the actual path; these instructions essentially
move to some position,- add a line to some position from the previous position,
- add a Bézier curve to some position from the previous position using some control points, or
- add an upright rectangle at some position using some width and height information.
Then there is an optional instruction to intersect the current clip path with the generated path.
Finally, there is a drawing instruction for any combination of filling the inside of the path and stroking along the path, i.e. for doing both, either one, or neither one.
This corresponds to the callbacks you retrieve in your IExtRenderListener
implementation:
/**
* Called when the current path is being modified. E.g. new segment is being added,
* new subpath is being started etc.
*
* @param renderInfo Contains information about the path segment being added to the current path.
*/
void ModifyPath(PathConstructionRenderInfo renderInfo);
is called once or more often to build the actual path, PathConstructionRenderInfo
containing the actual instruction type in its Operation
property (compare to the PathConstructionRenderInfo
constant members MOVETO
, LINETO
, etc. to determine the operation type) and the required coordinates / dimensions in its SegmentData
property. The Ctm
property additionally returns the affine transformation that currently is set to be applied to all drawing operations.
Then
/**
* Called when the current path should be set as a new clipping path.
*
* @param rule Either {@link PathPaintingRenderInfo#EVEN_ODD_RULE} or {@link PathPaintingRenderInfo#NONZERO_WINDING_RULE}
*/
void ClipPath(int rule);
is called if the current clip path shall be intersected with the constructed path.
Finally
/**
* Called when the current path should be rendered.
*
* @param renderInfo Contains information about the current path which should be rendered.
* @return The path which can be used as a new clipping path.
*/
Path RenderPath(PathPaintingRenderInfo renderInfo);
is called, PathPaintingRenderInfo
containing the drawing operation in its Operation
property (any combination of the PathPaintingRenderInfo
constants STROKE
and FILL
), the rule for determining what "inside the path" means in its Rule
property (NONZERO_WINDING_RULE
or EVEN_ODD_RULE
), and some other drawing details in the Ctm
, LineWidth
, LineCapStyle
, LineJoinStyle
, MiterLimit
, and LineDashPattern
properties.
Thanks a lot, @mkl! I think this will be an answer to my another question. Please check and share this link as an answer. I would really appreciate your any comments on my thought in that question. Thanks again. LINK:stackoverflow.com/questions/54059341/…
– Piao David
Jan 7 at 12:20
add a comment |
try to implement IRenderListener
internal class ImageExtractor : IRenderListener
{
private int _currentPage = 1;
private int _imageCount = 0;
private readonly string _outputFilePrefix;
private readonly string _outputFolder;
private readonly bool _overwriteExistingFiles;
private ImageExtractor(string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
{
_outputFilePrefix = outputFilePrefix;
_outputFolder = outputFolder;
_overwriteExistingFiles = overwriteExistingFiles;
}
/// <summary>
/// Extract all images from a PDF file
/// </summary>
/// <param name="pdfPath">Full path and file name of PDF file</param>
/// <param name="outputFilePrefix">Basic name of exported files. If null then uses same name as PDF file.</param>
/// <param name="outputFolder">Where to save images. If null or empty then uses same folder as PDF file.</param>
/// <param name="overwriteExistingFiles">True to overwrite existing image files, false to skip past them</param>
/// <returns>Count of number of images extracted.</returns>
public static int ExtractImagesFromFile(string pdfPath, string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
{
// Handle setting of any default values
outputFilePrefix = outputFilePrefix ?? System.IO.Path.GetFileNameWithoutExtension(pdfPath);
outputFolder = String.IsNullOrEmpty(outputFolder) ? System.IO.Path.GetDirectoryName(pdfPath) : outputFolder;
var instance = new ImageExtractor(outputFilePrefix, outputFolder, overwriteExistingFiles);
using (var pdfReader = new PdfReader(pdfPath))
{
if (pdfReader.IsEncrypted())
throw new ApplicationException(pdfPath + " is encrypted.");
var pdfParser = new PdfReaderContentParser(pdfReader);
while (instance._currentPage <= pdfReader.NumberOfPages)
{
pdfParser.ProcessContent(instance._currentPage, instance);
instance._currentPage++;
}
}
return instance._imageCount;
}
#region Implementation of IRenderListener
public void BeginTextBlock() { }
public void EndTextBlock() { }
public void RenderText(TextRenderInfo renderInfo) { }
public void RenderImage(ImageRenderInfo renderInfo)
{
if (_imageCount == 0)
{
var imageObject = renderInfo.GetImage();
var imageFileName = _outputFilePrefix + _imageCount; //to get multiple file (you should add .jpg or .png ...)
var imagePath = System.IO.Path.Combine(_outputFolder, imageFileName);
if (_overwriteExistingFiles || !File.Exists(imagePath))
{
var imageRawBytes = imageObject.GetImageAsBytes();
//create a new file ()
File.WriteAllBytes(imagePath, imageRawBytes);
}
}
_imageCount++;
}
#endregion // Implementation of IRenderListener
}
Yes, I already tried IRenderListener. This method only extracts images and texts. It does not return anything about tables.There's no Table related function.
– Piao David
Jan 2 at 10:26
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54003886%2fi-want-to-get-all-objects-except-text-object-as-an-image-from-pdf-using-itextsha%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
You say
What I've done so far is to get all text objects and image objects and locations.
but you don't go into detail how you do so. I assume you use a matching IRenderListener
implementation.
But IRenderListener
, as you found out yourself,
only extracts images and texts.
The main missing objects are paths and their usages.
To extract them, too, you should implement IExtRenderListener
which extends IRenderListener
but also retrieves information about paths. To understand the callback methods, please first be aware how path related instructions work in PDFs:
First there are instructions for building the actual path; these instructions essentially
move to some position,- add a line to some position from the previous position,
- add a Bézier curve to some position from the previous position using some control points, or
- add an upright rectangle at some position using some width and height information.
Then there is an optional instruction to intersect the current clip path with the generated path.
Finally, there is a drawing instruction for any combination of filling the inside of the path and stroking along the path, i.e. for doing both, either one, or neither one.
This corresponds to the callbacks you retrieve in your IExtRenderListener
implementation:
/**
* Called when the current path is being modified. E.g. new segment is being added,
* new subpath is being started etc.
*
* @param renderInfo Contains information about the path segment being added to the current path.
*/
void ModifyPath(PathConstructionRenderInfo renderInfo);
is called once or more often to build the actual path, PathConstructionRenderInfo
containing the actual instruction type in its Operation
property (compare to the PathConstructionRenderInfo
constant members MOVETO
, LINETO
, etc. to determine the operation type) and the required coordinates / dimensions in its SegmentData
property. The Ctm
property additionally returns the affine transformation that currently is set to be applied to all drawing operations.
Then
/**
* Called when the current path should be set as a new clipping path.
*
* @param rule Either {@link PathPaintingRenderInfo#EVEN_ODD_RULE} or {@link PathPaintingRenderInfo#NONZERO_WINDING_RULE}
*/
void ClipPath(int rule);
is called if the current clip path shall be intersected with the constructed path.
Finally
/**
* Called when the current path should be rendered.
*
* @param renderInfo Contains information about the current path which should be rendered.
* @return The path which can be used as a new clipping path.
*/
Path RenderPath(PathPaintingRenderInfo renderInfo);
is called, PathPaintingRenderInfo
containing the drawing operation in its Operation
property (any combination of the PathPaintingRenderInfo
constants STROKE
and FILL
), the rule for determining what "inside the path" means in its Rule
property (NONZERO_WINDING_RULE
or EVEN_ODD_RULE
), and some other drawing details in the Ctm
, LineWidth
, LineCapStyle
, LineJoinStyle
, MiterLimit
, and LineDashPattern
properties.
Thanks a lot, @mkl! I think this will be an answer to my another question. Please check and share this link as an answer. I would really appreciate your any comments on my thought in that question. Thanks again. LINK:stackoverflow.com/questions/54059341/…
– Piao David
Jan 7 at 12:20
add a comment |
You say
What I've done so far is to get all text objects and image objects and locations.
but you don't go into detail how you do so. I assume you use a matching IRenderListener
implementation.
But IRenderListener
, as you found out yourself,
only extracts images and texts.
The main missing objects are paths and their usages.
To extract them, too, you should implement IExtRenderListener
which extends IRenderListener
but also retrieves information about paths. To understand the callback methods, please first be aware how path related instructions work in PDFs:
First there are instructions for building the actual path; these instructions essentially
move to some position,- add a line to some position from the previous position,
- add a Bézier curve to some position from the previous position using some control points, or
- add an upright rectangle at some position using some width and height information.
Then there is an optional instruction to intersect the current clip path with the generated path.
Finally, there is a drawing instruction for any combination of filling the inside of the path and stroking along the path, i.e. for doing both, either one, or neither one.
This corresponds to the callbacks you retrieve in your IExtRenderListener
implementation:
/**
* Called when the current path is being modified. E.g. new segment is being added,
* new subpath is being started etc.
*
* @param renderInfo Contains information about the path segment being added to the current path.
*/
void ModifyPath(PathConstructionRenderInfo renderInfo);
is called once or more often to build the actual path, PathConstructionRenderInfo
containing the actual instruction type in its Operation
property (compare to the PathConstructionRenderInfo
constant members MOVETO
, LINETO
, etc. to determine the operation type) and the required coordinates / dimensions in its SegmentData
property. The Ctm
property additionally returns the affine transformation that currently is set to be applied to all drawing operations.
Then
/**
* Called when the current path should be set as a new clipping path.
*
* @param rule Either {@link PathPaintingRenderInfo#EVEN_ODD_RULE} or {@link PathPaintingRenderInfo#NONZERO_WINDING_RULE}
*/
void ClipPath(int rule);
is called if the current clip path shall be intersected with the constructed path.
Finally
/**
* Called when the current path should be rendered.
*
* @param renderInfo Contains information about the current path which should be rendered.
* @return The path which can be used as a new clipping path.
*/
Path RenderPath(PathPaintingRenderInfo renderInfo);
is called, PathPaintingRenderInfo
containing the drawing operation in its Operation
property (any combination of the PathPaintingRenderInfo
constants STROKE
and FILL
), the rule for determining what "inside the path" means in its Rule
property (NONZERO_WINDING_RULE
or EVEN_ODD_RULE
), and some other drawing details in the Ctm
, LineWidth
, LineCapStyle
, LineJoinStyle
, MiterLimit
, and LineDashPattern
properties.
Thanks a lot, @mkl! I think this will be an answer to my another question. Please check and share this link as an answer. I would really appreciate your any comments on my thought in that question. Thanks again. LINK:stackoverflow.com/questions/54059341/…
– Piao David
Jan 7 at 12:20
add a comment |
You say
What I've done so far is to get all text objects and image objects and locations.
but you don't go into detail how you do so. I assume you use a matching IRenderListener
implementation.
But IRenderListener
, as you found out yourself,
only extracts images and texts.
The main missing objects are paths and their usages.
To extract them, too, you should implement IExtRenderListener
which extends IRenderListener
but also retrieves information about paths. To understand the callback methods, please first be aware how path related instructions work in PDFs:
First there are instructions for building the actual path; these instructions essentially
move to some position,- add a line to some position from the previous position,
- add a Bézier curve to some position from the previous position using some control points, or
- add an upright rectangle at some position using some width and height information.
Then there is an optional instruction to intersect the current clip path with the generated path.
Finally, there is a drawing instruction for any combination of filling the inside of the path and stroking along the path, i.e. for doing both, either one, or neither one.
This corresponds to the callbacks you retrieve in your IExtRenderListener
implementation:
/**
* Called when the current path is being modified. E.g. new segment is being added,
* new subpath is being started etc.
*
* @param renderInfo Contains information about the path segment being added to the current path.
*/
void ModifyPath(PathConstructionRenderInfo renderInfo);
is called once or more often to build the actual path, PathConstructionRenderInfo
containing the actual instruction type in its Operation
property (compare to the PathConstructionRenderInfo
constant members MOVETO
, LINETO
, etc. to determine the operation type) and the required coordinates / dimensions in its SegmentData
property. The Ctm
property additionally returns the affine transformation that currently is set to be applied to all drawing operations.
Then
/**
* Called when the current path should be set as a new clipping path.
*
* @param rule Either {@link PathPaintingRenderInfo#EVEN_ODD_RULE} or {@link PathPaintingRenderInfo#NONZERO_WINDING_RULE}
*/
void ClipPath(int rule);
is called if the current clip path shall be intersected with the constructed path.
Finally
/**
* Called when the current path should be rendered.
*
* @param renderInfo Contains information about the current path which should be rendered.
* @return The path which can be used as a new clipping path.
*/
Path RenderPath(PathPaintingRenderInfo renderInfo);
is called, PathPaintingRenderInfo
containing the drawing operation in its Operation
property (any combination of the PathPaintingRenderInfo
constants STROKE
and FILL
), the rule for determining what "inside the path" means in its Rule
property (NONZERO_WINDING_RULE
or EVEN_ODD_RULE
), and some other drawing details in the Ctm
, LineWidth
, LineCapStyle
, LineJoinStyle
, MiterLimit
, and LineDashPattern
properties.
You say
What I've done so far is to get all text objects and image objects and locations.
but you don't go into detail how you do so. I assume you use a matching IRenderListener
implementation.
But IRenderListener
, as you found out yourself,
only extracts images and texts.
The main missing objects are paths and their usages.
To extract them, too, you should implement IExtRenderListener
which extends IRenderListener
but also retrieves information about paths. To understand the callback methods, please first be aware how path related instructions work in PDFs:
First there are instructions for building the actual path; these instructions essentially
move to some position,- add a line to some position from the previous position,
- add a Bézier curve to some position from the previous position using some control points, or
- add an upright rectangle at some position using some width and height information.
Then there is an optional instruction to intersect the current clip path with the generated path.
Finally, there is a drawing instruction for any combination of filling the inside of the path and stroking along the path, i.e. for doing both, either one, or neither one.
This corresponds to the callbacks you retrieve in your IExtRenderListener
implementation:
/**
* Called when the current path is being modified. E.g. new segment is being added,
* new subpath is being started etc.
*
* @param renderInfo Contains information about the path segment being added to the current path.
*/
void ModifyPath(PathConstructionRenderInfo renderInfo);
is called once or more often to build the actual path, PathConstructionRenderInfo
containing the actual instruction type in its Operation
property (compare to the PathConstructionRenderInfo
constant members MOVETO
, LINETO
, etc. to determine the operation type) and the required coordinates / dimensions in its SegmentData
property. The Ctm
property additionally returns the affine transformation that currently is set to be applied to all drawing operations.
Then
/**
* Called when the current path should be set as a new clipping path.
*
* @param rule Either {@link PathPaintingRenderInfo#EVEN_ODD_RULE} or {@link PathPaintingRenderInfo#NONZERO_WINDING_RULE}
*/
void ClipPath(int rule);
is called if the current clip path shall be intersected with the constructed path.
Finally
/**
* Called when the current path should be rendered.
*
* @param renderInfo Contains information about the current path which should be rendered.
* @return The path which can be used as a new clipping path.
*/
Path RenderPath(PathPaintingRenderInfo renderInfo);
is called, PathPaintingRenderInfo
containing the drawing operation in its Operation
property (any combination of the PathPaintingRenderInfo
constants STROKE
and FILL
), the rule for determining what "inside the path" means in its Rule
property (NONZERO_WINDING_RULE
or EVEN_ODD_RULE
), and some other drawing details in the Ctm
, LineWidth
, LineCapStyle
, LineJoinStyle
, MiterLimit
, and LineDashPattern
properties.
answered Jan 7 at 12:04


mklmkl
55.1k1170149
55.1k1170149
Thanks a lot, @mkl! I think this will be an answer to my another question. Please check and share this link as an answer. I would really appreciate your any comments on my thought in that question. Thanks again. LINK:stackoverflow.com/questions/54059341/…
– Piao David
Jan 7 at 12:20
add a comment |
Thanks a lot, @mkl! I think this will be an answer to my another question. Please check and share this link as an answer. I would really appreciate your any comments on my thought in that question. Thanks again. LINK:stackoverflow.com/questions/54059341/…
– Piao David
Jan 7 at 12:20
Thanks a lot, @mkl! I think this will be an answer to my another question. Please check and share this link as an answer. I would really appreciate your any comments on my thought in that question. Thanks again. LINK:stackoverflow.com/questions/54059341/…
– Piao David
Jan 7 at 12:20
Thanks a lot, @mkl! I think this will be an answer to my another question. Please check and share this link as an answer. I would really appreciate your any comments on my thought in that question. Thanks again. LINK:stackoverflow.com/questions/54059341/…
– Piao David
Jan 7 at 12:20
add a comment |
try to implement IRenderListener
internal class ImageExtractor : IRenderListener
{
private int _currentPage = 1;
private int _imageCount = 0;
private readonly string _outputFilePrefix;
private readonly string _outputFolder;
private readonly bool _overwriteExistingFiles;
private ImageExtractor(string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
{
_outputFilePrefix = outputFilePrefix;
_outputFolder = outputFolder;
_overwriteExistingFiles = overwriteExistingFiles;
}
/// <summary>
/// Extract all images from a PDF file
/// </summary>
/// <param name="pdfPath">Full path and file name of PDF file</param>
/// <param name="outputFilePrefix">Basic name of exported files. If null then uses same name as PDF file.</param>
/// <param name="outputFolder">Where to save images. If null or empty then uses same folder as PDF file.</param>
/// <param name="overwriteExistingFiles">True to overwrite existing image files, false to skip past them</param>
/// <returns>Count of number of images extracted.</returns>
public static int ExtractImagesFromFile(string pdfPath, string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
{
// Handle setting of any default values
outputFilePrefix = outputFilePrefix ?? System.IO.Path.GetFileNameWithoutExtension(pdfPath);
outputFolder = String.IsNullOrEmpty(outputFolder) ? System.IO.Path.GetDirectoryName(pdfPath) : outputFolder;
var instance = new ImageExtractor(outputFilePrefix, outputFolder, overwriteExistingFiles);
using (var pdfReader = new PdfReader(pdfPath))
{
if (pdfReader.IsEncrypted())
throw new ApplicationException(pdfPath + " is encrypted.");
var pdfParser = new PdfReaderContentParser(pdfReader);
while (instance._currentPage <= pdfReader.NumberOfPages)
{
pdfParser.ProcessContent(instance._currentPage, instance);
instance._currentPage++;
}
}
return instance._imageCount;
}
#region Implementation of IRenderListener
public void BeginTextBlock() { }
public void EndTextBlock() { }
public void RenderText(TextRenderInfo renderInfo) { }
public void RenderImage(ImageRenderInfo renderInfo)
{
if (_imageCount == 0)
{
var imageObject = renderInfo.GetImage();
var imageFileName = _outputFilePrefix + _imageCount; //to get multiple file (you should add .jpg or .png ...)
var imagePath = System.IO.Path.Combine(_outputFolder, imageFileName);
if (_overwriteExistingFiles || !File.Exists(imagePath))
{
var imageRawBytes = imageObject.GetImageAsBytes();
//create a new file ()
File.WriteAllBytes(imagePath, imageRawBytes);
}
}
_imageCount++;
}
#endregion // Implementation of IRenderListener
}
Yes, I already tried IRenderListener. This method only extracts images and texts. It does not return anything about tables.There's no Table related function.
– Piao David
Jan 2 at 10:26
add a comment |
try to implement IRenderListener
internal class ImageExtractor : IRenderListener
{
private int _currentPage = 1;
private int _imageCount = 0;
private readonly string _outputFilePrefix;
private readonly string _outputFolder;
private readonly bool _overwriteExistingFiles;
private ImageExtractor(string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
{
_outputFilePrefix = outputFilePrefix;
_outputFolder = outputFolder;
_overwriteExistingFiles = overwriteExistingFiles;
}
/// <summary>
/// Extract all images from a PDF file
/// </summary>
/// <param name="pdfPath">Full path and file name of PDF file</param>
/// <param name="outputFilePrefix">Basic name of exported files. If null then uses same name as PDF file.</param>
/// <param name="outputFolder">Where to save images. If null or empty then uses same folder as PDF file.</param>
/// <param name="overwriteExistingFiles">True to overwrite existing image files, false to skip past them</param>
/// <returns>Count of number of images extracted.</returns>
public static int ExtractImagesFromFile(string pdfPath, string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
{
// Handle setting of any default values
outputFilePrefix = outputFilePrefix ?? System.IO.Path.GetFileNameWithoutExtension(pdfPath);
outputFolder = String.IsNullOrEmpty(outputFolder) ? System.IO.Path.GetDirectoryName(pdfPath) : outputFolder;
var instance = new ImageExtractor(outputFilePrefix, outputFolder, overwriteExistingFiles);
using (var pdfReader = new PdfReader(pdfPath))
{
if (pdfReader.IsEncrypted())
throw new ApplicationException(pdfPath + " is encrypted.");
var pdfParser = new PdfReaderContentParser(pdfReader);
while (instance._currentPage <= pdfReader.NumberOfPages)
{
pdfParser.ProcessContent(instance._currentPage, instance);
instance._currentPage++;
}
}
return instance._imageCount;
}
#region Implementation of IRenderListener
public void BeginTextBlock() { }
public void EndTextBlock() { }
public void RenderText(TextRenderInfo renderInfo) { }
public void RenderImage(ImageRenderInfo renderInfo)
{
if (_imageCount == 0)
{
var imageObject = renderInfo.GetImage();
var imageFileName = _outputFilePrefix + _imageCount; //to get multiple file (you should add .jpg or .png ...)
var imagePath = System.IO.Path.Combine(_outputFolder, imageFileName);
if (_overwriteExistingFiles || !File.Exists(imagePath))
{
var imageRawBytes = imageObject.GetImageAsBytes();
//create a new file ()
File.WriteAllBytes(imagePath, imageRawBytes);
}
}
_imageCount++;
}
#endregion // Implementation of IRenderListener
}
Yes, I already tried IRenderListener. This method only extracts images and texts. It does not return anything about tables.There's no Table related function.
– Piao David
Jan 2 at 10:26
add a comment |
try to implement IRenderListener
internal class ImageExtractor : IRenderListener
{
private int _currentPage = 1;
private int _imageCount = 0;
private readonly string _outputFilePrefix;
private readonly string _outputFolder;
private readonly bool _overwriteExistingFiles;
private ImageExtractor(string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
{
_outputFilePrefix = outputFilePrefix;
_outputFolder = outputFolder;
_overwriteExistingFiles = overwriteExistingFiles;
}
/// <summary>
/// Extract all images from a PDF file
/// </summary>
/// <param name="pdfPath">Full path and file name of PDF file</param>
/// <param name="outputFilePrefix">Basic name of exported files. If null then uses same name as PDF file.</param>
/// <param name="outputFolder">Where to save images. If null or empty then uses same folder as PDF file.</param>
/// <param name="overwriteExistingFiles">True to overwrite existing image files, false to skip past them</param>
/// <returns>Count of number of images extracted.</returns>
public static int ExtractImagesFromFile(string pdfPath, string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
{
// Handle setting of any default values
outputFilePrefix = outputFilePrefix ?? System.IO.Path.GetFileNameWithoutExtension(pdfPath);
outputFolder = String.IsNullOrEmpty(outputFolder) ? System.IO.Path.GetDirectoryName(pdfPath) : outputFolder;
var instance = new ImageExtractor(outputFilePrefix, outputFolder, overwriteExistingFiles);
using (var pdfReader = new PdfReader(pdfPath))
{
if (pdfReader.IsEncrypted())
throw new ApplicationException(pdfPath + " is encrypted.");
var pdfParser = new PdfReaderContentParser(pdfReader);
while (instance._currentPage <= pdfReader.NumberOfPages)
{
pdfParser.ProcessContent(instance._currentPage, instance);
instance._currentPage++;
}
}
return instance._imageCount;
}
#region Implementation of IRenderListener
public void BeginTextBlock() { }
public void EndTextBlock() { }
public void RenderText(TextRenderInfo renderInfo) { }
public void RenderImage(ImageRenderInfo renderInfo)
{
if (_imageCount == 0)
{
var imageObject = renderInfo.GetImage();
var imageFileName = _outputFilePrefix + _imageCount; //to get multiple file (you should add .jpg or .png ...)
var imagePath = System.IO.Path.Combine(_outputFolder, imageFileName);
if (_overwriteExistingFiles || !File.Exists(imagePath))
{
var imageRawBytes = imageObject.GetImageAsBytes();
//create a new file ()
File.WriteAllBytes(imagePath, imageRawBytes);
}
}
_imageCount++;
}
#endregion // Implementation of IRenderListener
}
try to implement IRenderListener
internal class ImageExtractor : IRenderListener
{
private int _currentPage = 1;
private int _imageCount = 0;
private readonly string _outputFilePrefix;
private readonly string _outputFolder;
private readonly bool _overwriteExistingFiles;
private ImageExtractor(string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
{
_outputFilePrefix = outputFilePrefix;
_outputFolder = outputFolder;
_overwriteExistingFiles = overwriteExistingFiles;
}
/// <summary>
/// Extract all images from a PDF file
/// </summary>
/// <param name="pdfPath">Full path and file name of PDF file</param>
/// <param name="outputFilePrefix">Basic name of exported files. If null then uses same name as PDF file.</param>
/// <param name="outputFolder">Where to save images. If null or empty then uses same folder as PDF file.</param>
/// <param name="overwriteExistingFiles">True to overwrite existing image files, false to skip past them</param>
/// <returns>Count of number of images extracted.</returns>
public static int ExtractImagesFromFile(string pdfPath, string outputFilePrefix, string outputFolder, bool overwriteExistingFiles)
{
// Handle setting of any default values
outputFilePrefix = outputFilePrefix ?? System.IO.Path.GetFileNameWithoutExtension(pdfPath);
outputFolder = String.IsNullOrEmpty(outputFolder) ? System.IO.Path.GetDirectoryName(pdfPath) : outputFolder;
var instance = new ImageExtractor(outputFilePrefix, outputFolder, overwriteExistingFiles);
using (var pdfReader = new PdfReader(pdfPath))
{
if (pdfReader.IsEncrypted())
throw new ApplicationException(pdfPath + " is encrypted.");
var pdfParser = new PdfReaderContentParser(pdfReader);
while (instance._currentPage <= pdfReader.NumberOfPages)
{
pdfParser.ProcessContent(instance._currentPage, instance);
instance._currentPage++;
}
}
return instance._imageCount;
}
#region Implementation of IRenderListener
public void BeginTextBlock() { }
public void EndTextBlock() { }
public void RenderText(TextRenderInfo renderInfo) { }
public void RenderImage(ImageRenderInfo renderInfo)
{
if (_imageCount == 0)
{
var imageObject = renderInfo.GetImage();
var imageFileName = _outputFilePrefix + _imageCount; //to get multiple file (you should add .jpg or .png ...)
var imagePath = System.IO.Path.Combine(_outputFolder, imageFileName);
if (_overwriteExistingFiles || !File.Exists(imagePath))
{
var imageRawBytes = imageObject.GetImageAsBytes();
//create a new file ()
File.WriteAllBytes(imagePath, imageRawBytes);
}
}
_imageCount++;
}
#endregion // Implementation of IRenderListener
}
answered Jan 2 at 10:08
AmineAmine
383
383
Yes, I already tried IRenderListener. This method only extracts images and texts. It does not return anything about tables.There's no Table related function.
– Piao David
Jan 2 at 10:26
add a comment |
Yes, I already tried IRenderListener. This method only extracts images and texts. It does not return anything about tables.There's no Table related function.
– Piao David
Jan 2 at 10:26
Yes, I already tried IRenderListener. This method only extracts images and texts. It does not return anything about tables.There's no Table related function.
– Piao David
Jan 2 at 10:26
Yes, I already tried IRenderListener. This method only extracts images and texts. It does not return anything about tables.There's no Table related function.
– Piao David
Jan 2 at 10:26
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54003886%2fi-want-to-get-all-objects-except-text-object-as-an-image-from-pdf-using-itextsha%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
There is nothing like a table object in a pdf (unless it's properly tagged, and even then it's merely a logical table object, not a graphical one), there only are chunks of text (or whatever table content you see) and probably some graphical objects like lines or colored rectangles. Thus, it is unclear what you want.
– mkl
Jan 2 at 11:12
mkl, thanks for your reply. Hope I can get help from you again on this question. I agree that there should be no table objects but it's interesting that when I get all images I can't see ones for tables. I used IRenderListener. Looking forward to your answer.
– Piao David
Jan 2 at 12:52
1
Implement
IExtRenderListener
which extendsIRenderListener
but has additional callbacks for vector graphics related instructions. Most likely these additional callbacks will be invoked for the lines or colored rectangles structuring your table.– mkl
Jan 2 at 17:45
Thanks a lot, mkl. I tried IExtRenderListener but no idea how to use Path. Basically what I need to do is draw all objects on PPTX. I'm afraid Path includes all texts and images too. On the other hand, I'm thinking to remove all text objects from the PDF and get a temporary PDF. Then I can get the whole page (text objects removed) as an image and use it as a background. Do you have any ideas how to implement this way? Removing text objects and make a new PDF without texts. Thanks in advance.
– Piao David
Jan 3 at 3:39
@mkl, I'm still struggling. Looking forward to your answer
– Piao David
Jan 4 at 3:02