Is there a simple way to identify if a pdf is scanned?
up vote
6
down vote
favorite
I have thousand of documents and some of them are scanned. So I need a script to test all pdf files that belong to a directory. Is there a simple way to do that?
- Most pdfs are reports. Thus they have a lot of text.
They are very different, but the scanned ones as mentioned below one can find some text due to a precarious ocr process coupled to the scan.
- NotScanned
- Scanned1
- Scanned2
The proposal due to Sudodus in the comments below seems to be very interesting. Look at the difference between a scanned to a not scanned-pdf:
Scanned:
grep --color -a 'Image' AR-G1002.pdf
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 340615/Name/Obj13/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40452/Name/Obj18/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41680/Name/Obj23/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41432/Name/Obj28/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59084/Name/Obj33/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 472681/Name/Obj38/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 469340/Name/Obj43/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 371863/Name/Obj48/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 344092/Name/Obj53/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59416/Name/Obj58/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 48308/Name/Obj63/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 51564/Name/Obj68/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 63184/Name/Obj73/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40824/Name/Obj78/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 23320/Name/Obj83/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 31504/Name/Obj93/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 18996/Name/Obj98/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 292932/Name/Obj103/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 27720/Name/Obj108/Subtype/Image/Type/XObject/Width 1698>>stream
<rdf:li xml:lang="x-default">Image</rdf:li>
<rdf:li xml:lang="x-default">Image</rdf:li>
Not Scanned:
grep --color -a 'Image' AR-G1003.pdf
<</Lang(en-US)/MarkInfo<</Marked true>>/Metadata 167 0 R/Pages 2 0 R/StructTreeR<</Contents 4 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F4 11 0 R/F5 13 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>/StructParents 0/Tabs/S/Type/<</Filter/FlateDecode/Length 5463>>stream
<</BaseFont/Times#20New#20Roman,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontD<</Ascent 891/AvgWidth 427/CapHeight 677/Descent -216/Flags 32/FontBBox[-558 -216 2000 677]/FontName/Times#20New#20Roman,Bold/FontWeight 700/ItalicAngle 0/Leadi<</BaseFont/Times#20New#20Roman/Encoding/WinAnsiEncoding/FirstChar 32/FontDescri<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontName/Times#20New#20Roman/FontWeight 400/ItalicAngle 0/Leading 42<</BaseFont/Arial,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 10 0<</Ascent 905/AvgWidth 479/CapHeight 728/Descent -210/Flags 32/FontBBox[-628 -210 2000 728]/FontName/Arial,Bold/FontWeight 700/ItalicAngle 0/Leading 33/MaxWidth<</BaseFont/Times#20New#20Roman,Italic/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 12 0 R/LastChar 118/Name/F4/Subtype/TrueType/Type/Font/Widths 164 0 <</Ascent 891/AvgWidth 402/CapHeight 694/Descent -216/Flags 32/FontBBox[-498 -216 1333 694]/FontName/Times#20New#20Roman,Italic/FontWeight 400/ItalicAngle -16.4<</BaseFont/Arial/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 14 0 R/La<</Ascent 905/AvgWidth 441/CapHeight 728/Descent -210/Flags 32/FontBBox[-665 -210 2000 728]/FontName/Arial/FontWeight 400/ItalicAngle 0/Leading 33/MaxWidth 2665<</Contents 16 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 7534>>streamarents 1/Tabs/S/Type/Page>>
<</Contents 18 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 6137>>streamarents 2/Tabs/S/Type/Page>>
<</Contents 20 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R/F6 21 0 R><</Filter/FlateDecode/Length 6533>>stream>>/StructParents 3/Tabs/S/Type/Page>>
<</BaseFont/Times#20New#20Roman/DescendantFonts 22 0 R/Encoding/Identity-H/Subty<</BaseFont/Times#20New#20Roman/CIDSystemInfo 24 0 R/CIDToGIDMap/Identity/DW 100<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontFile2 160 0 R/FontName/Times#20New#20Roman/FontWeight 400/Italic<</Contents 27 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</ExtGState<</GS28 28 0 R/GS29 29 0 R>>/Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F5 13 0 R/F6 21 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC<</Filter/FlateDecode/Length 5369>>streamge>>
The number of images per page are much bigger (about one per page)!
command-line pdf
|
show 6 more comments
up vote
6
down vote
favorite
I have thousand of documents and some of them are scanned. So I need a script to test all pdf files that belong to a directory. Is there a simple way to do that?
- Most pdfs are reports. Thus they have a lot of text.
They are very different, but the scanned ones as mentioned below one can find some text due to a precarious ocr process coupled to the scan.
- NotScanned
- Scanned1
- Scanned2
The proposal due to Sudodus in the comments below seems to be very interesting. Look at the difference between a scanned to a not scanned-pdf:
Scanned:
grep --color -a 'Image' AR-G1002.pdf
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 340615/Name/Obj13/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40452/Name/Obj18/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41680/Name/Obj23/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41432/Name/Obj28/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59084/Name/Obj33/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 472681/Name/Obj38/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 469340/Name/Obj43/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 371863/Name/Obj48/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 344092/Name/Obj53/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59416/Name/Obj58/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 48308/Name/Obj63/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 51564/Name/Obj68/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 63184/Name/Obj73/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40824/Name/Obj78/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 23320/Name/Obj83/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 31504/Name/Obj93/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 18996/Name/Obj98/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 292932/Name/Obj103/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 27720/Name/Obj108/Subtype/Image/Type/XObject/Width 1698>>stream
<rdf:li xml:lang="x-default">Image</rdf:li>
<rdf:li xml:lang="x-default">Image</rdf:li>
Not Scanned:
grep --color -a 'Image' AR-G1003.pdf
<</Lang(en-US)/MarkInfo<</Marked true>>/Metadata 167 0 R/Pages 2 0 R/StructTreeR<</Contents 4 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F4 11 0 R/F5 13 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>/StructParents 0/Tabs/S/Type/<</Filter/FlateDecode/Length 5463>>stream
<</BaseFont/Times#20New#20Roman,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontD<</Ascent 891/AvgWidth 427/CapHeight 677/Descent -216/Flags 32/FontBBox[-558 -216 2000 677]/FontName/Times#20New#20Roman,Bold/FontWeight 700/ItalicAngle 0/Leadi<</BaseFont/Times#20New#20Roman/Encoding/WinAnsiEncoding/FirstChar 32/FontDescri<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontName/Times#20New#20Roman/FontWeight 400/ItalicAngle 0/Leading 42<</BaseFont/Arial,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 10 0<</Ascent 905/AvgWidth 479/CapHeight 728/Descent -210/Flags 32/FontBBox[-628 -210 2000 728]/FontName/Arial,Bold/FontWeight 700/ItalicAngle 0/Leading 33/MaxWidth<</BaseFont/Times#20New#20Roman,Italic/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 12 0 R/LastChar 118/Name/F4/Subtype/TrueType/Type/Font/Widths 164 0 <</Ascent 891/AvgWidth 402/CapHeight 694/Descent -216/Flags 32/FontBBox[-498 -216 1333 694]/FontName/Times#20New#20Roman,Italic/FontWeight 400/ItalicAngle -16.4<</BaseFont/Arial/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 14 0 R/La<</Ascent 905/AvgWidth 441/CapHeight 728/Descent -210/Flags 32/FontBBox[-665 -210 2000 728]/FontName/Arial/FontWeight 400/ItalicAngle 0/Leading 33/MaxWidth 2665<</Contents 16 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 7534>>streamarents 1/Tabs/S/Type/Page>>
<</Contents 18 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 6137>>streamarents 2/Tabs/S/Type/Page>>
<</Contents 20 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R/F6 21 0 R><</Filter/FlateDecode/Length 6533>>stream>>/StructParents 3/Tabs/S/Type/Page>>
<</BaseFont/Times#20New#20Roman/DescendantFonts 22 0 R/Encoding/Identity-H/Subty<</BaseFont/Times#20New#20Roman/CIDSystemInfo 24 0 R/CIDToGIDMap/Identity/DW 100<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontFile2 160 0 R/FontName/Times#20New#20Roman/FontWeight 400/Italic<</Contents 27 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</ExtGState<</GS28 28 0 R/GS29 29 0 R>>/Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F5 13 0 R/F6 21 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC<</Filter/FlateDecode/Length 5369>>streamge>>
The number of images per page are much bigger (about one per page)!
command-line pdf
7
Do you mean whether they're text or images?
– DK Bose
yesterday
8
Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
– sudodus
yesterday
3
@sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
– pipe
yesterday
1
Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
– dessert
yesterday
1
If apdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string/Image/
, which can be found with the command linegrep --color -a 'Image' filename.pdf
. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).
– sudodus
yesterday
|
show 6 more comments
up vote
6
down vote
favorite
up vote
6
down vote
favorite
I have thousand of documents and some of them are scanned. So I need a script to test all pdf files that belong to a directory. Is there a simple way to do that?
- Most pdfs are reports. Thus they have a lot of text.
They are very different, but the scanned ones as mentioned below one can find some text due to a precarious ocr process coupled to the scan.
- NotScanned
- Scanned1
- Scanned2
The proposal due to Sudodus in the comments below seems to be very interesting. Look at the difference between a scanned to a not scanned-pdf:
Scanned:
grep --color -a 'Image' AR-G1002.pdf
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 340615/Name/Obj13/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40452/Name/Obj18/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41680/Name/Obj23/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41432/Name/Obj28/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59084/Name/Obj33/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 472681/Name/Obj38/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 469340/Name/Obj43/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 371863/Name/Obj48/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 344092/Name/Obj53/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59416/Name/Obj58/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 48308/Name/Obj63/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 51564/Name/Obj68/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 63184/Name/Obj73/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40824/Name/Obj78/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 23320/Name/Obj83/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 31504/Name/Obj93/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 18996/Name/Obj98/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 292932/Name/Obj103/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 27720/Name/Obj108/Subtype/Image/Type/XObject/Width 1698>>stream
<rdf:li xml:lang="x-default">Image</rdf:li>
<rdf:li xml:lang="x-default">Image</rdf:li>
Not Scanned:
grep --color -a 'Image' AR-G1003.pdf
<</Lang(en-US)/MarkInfo<</Marked true>>/Metadata 167 0 R/Pages 2 0 R/StructTreeR<</Contents 4 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F4 11 0 R/F5 13 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>/StructParents 0/Tabs/S/Type/<</Filter/FlateDecode/Length 5463>>stream
<</BaseFont/Times#20New#20Roman,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontD<</Ascent 891/AvgWidth 427/CapHeight 677/Descent -216/Flags 32/FontBBox[-558 -216 2000 677]/FontName/Times#20New#20Roman,Bold/FontWeight 700/ItalicAngle 0/Leadi<</BaseFont/Times#20New#20Roman/Encoding/WinAnsiEncoding/FirstChar 32/FontDescri<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontName/Times#20New#20Roman/FontWeight 400/ItalicAngle 0/Leading 42<</BaseFont/Arial,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 10 0<</Ascent 905/AvgWidth 479/CapHeight 728/Descent -210/Flags 32/FontBBox[-628 -210 2000 728]/FontName/Arial,Bold/FontWeight 700/ItalicAngle 0/Leading 33/MaxWidth<</BaseFont/Times#20New#20Roman,Italic/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 12 0 R/LastChar 118/Name/F4/Subtype/TrueType/Type/Font/Widths 164 0 <</Ascent 891/AvgWidth 402/CapHeight 694/Descent -216/Flags 32/FontBBox[-498 -216 1333 694]/FontName/Times#20New#20Roman,Italic/FontWeight 400/ItalicAngle -16.4<</BaseFont/Arial/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 14 0 R/La<</Ascent 905/AvgWidth 441/CapHeight 728/Descent -210/Flags 32/FontBBox[-665 -210 2000 728]/FontName/Arial/FontWeight 400/ItalicAngle 0/Leading 33/MaxWidth 2665<</Contents 16 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 7534>>streamarents 1/Tabs/S/Type/Page>>
<</Contents 18 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 6137>>streamarents 2/Tabs/S/Type/Page>>
<</Contents 20 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R/F6 21 0 R><</Filter/FlateDecode/Length 6533>>stream>>/StructParents 3/Tabs/S/Type/Page>>
<</BaseFont/Times#20New#20Roman/DescendantFonts 22 0 R/Encoding/Identity-H/Subty<</BaseFont/Times#20New#20Roman/CIDSystemInfo 24 0 R/CIDToGIDMap/Identity/DW 100<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontFile2 160 0 R/FontName/Times#20New#20Roman/FontWeight 400/Italic<</Contents 27 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</ExtGState<</GS28 28 0 R/GS29 29 0 R>>/Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F5 13 0 R/F6 21 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC<</Filter/FlateDecode/Length 5369>>streamge>>
The number of images per page are much bigger (about one per page)!
command-line pdf
I have thousand of documents and some of them are scanned. So I need a script to test all pdf files that belong to a directory. Is there a simple way to do that?
- Most pdfs are reports. Thus they have a lot of text.
They are very different, but the scanned ones as mentioned below one can find some text due to a precarious ocr process coupled to the scan.
- NotScanned
- Scanned1
- Scanned2
The proposal due to Sudodus in the comments below seems to be very interesting. Look at the difference between a scanned to a not scanned-pdf:
Scanned:
grep --color -a 'Image' AR-G1002.pdf
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 340615/Name/Obj13/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40452/Name/Obj18/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41680/Name/Obj23/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41432/Name/Obj28/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59084/Name/Obj33/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 472681/Name/Obj38/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 469340/Name/Obj43/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 371863/Name/Obj48/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 344092/Name/Obj53/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59416/Name/Obj58/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 48308/Name/Obj63/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 51564/Name/Obj68/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 63184/Name/Obj73/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40824/Name/Obj78/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 23320/Name/Obj83/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 31504/Name/Obj93/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 18996/Name/Obj98/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 292932/Name/Obj103/Subtype/Image/Type/XObject/Width 1698>>stream
<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 27720/Name/Obj108/Subtype/Image/Type/XObject/Width 1698>>stream
<rdf:li xml:lang="x-default">Image</rdf:li>
<rdf:li xml:lang="x-default">Image</rdf:li>
Not Scanned:
grep --color -a 'Image' AR-G1003.pdf
<</Lang(en-US)/MarkInfo<</Marked true>>/Metadata 167 0 R/Pages 2 0 R/StructTreeR<</Contents 4 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F4 11 0 R/F5 13 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>/StructParents 0/Tabs/S/Type/<</Filter/FlateDecode/Length 5463>>stream
<</BaseFont/Times#20New#20Roman,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontD<</Ascent 891/AvgWidth 427/CapHeight 677/Descent -216/Flags 32/FontBBox[-558 -216 2000 677]/FontName/Times#20New#20Roman,Bold/FontWeight 700/ItalicAngle 0/Leadi<</BaseFont/Times#20New#20Roman/Encoding/WinAnsiEncoding/FirstChar 32/FontDescri<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontName/Times#20New#20Roman/FontWeight 400/ItalicAngle 0/Leading 42<</BaseFont/Arial,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 10 0<</Ascent 905/AvgWidth 479/CapHeight 728/Descent -210/Flags 32/FontBBox[-628 -210 2000 728]/FontName/Arial,Bold/FontWeight 700/ItalicAngle 0/Leading 33/MaxWidth<</BaseFont/Times#20New#20Roman,Italic/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 12 0 R/LastChar 118/Name/F4/Subtype/TrueType/Type/Font/Widths 164 0 <</Ascent 891/AvgWidth 402/CapHeight 694/Descent -216/Flags 32/FontBBox[-498 -216 1333 694]/FontName/Times#20New#20Roman,Italic/FontWeight 400/ItalicAngle -16.4<</BaseFont/Arial/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 14 0 R/La<</Ascent 905/AvgWidth 441/CapHeight 728/Descent -210/Flags 32/FontBBox[-665 -210 2000 728]/FontName/Arial/FontWeight 400/ItalicAngle 0/Leading 33/MaxWidth 2665<</Contents 16 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 7534>>streamarents 1/Tabs/S/Type/Page>>
<</Contents 18 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 6137>>streamarents 2/Tabs/S/Type/Page>>
<</Contents 20 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R/F6 21 0 R><</Filter/FlateDecode/Length 6533>>stream>>/StructParents 3/Tabs/S/Type/Page>>
<</BaseFont/Times#20New#20Roman/DescendantFonts 22 0 R/Encoding/Identity-H/Subty<</BaseFont/Times#20New#20Roman/CIDSystemInfo 24 0 R/CIDToGIDMap/Identity/DW 100<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontFile2 160 0 R/FontName/Times#20New#20Roman/FontWeight 400/Italic<</Contents 27 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</ExtGState<</GS28 28 0 R/GS29 29 0 R>>/Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F5 13 0 R/F6 21 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC<</Filter/FlateDecode/Length 5369>>streamge>>
The number of images per page are much bigger (about one per page)!
command-line pdf
command-line pdf
edited 21 hours ago
muru
133k19282480
133k19282480
asked yesterday
DanielTheRocketMan
3321314
3321314
7
Do you mean whether they're text or images?
– DK Bose
yesterday
8
Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
– sudodus
yesterday
3
@sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
– pipe
yesterday
1
Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
– dessert
yesterday
1
If apdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string/Image/
, which can be found with the command linegrep --color -a 'Image' filename.pdf
. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).
– sudodus
yesterday
|
show 6 more comments
7
Do you mean whether they're text or images?
– DK Bose
yesterday
8
Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
– sudodus
yesterday
3
@sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
– pipe
yesterday
1
Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
– dessert
yesterday
1
If apdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string/Image/
, which can be found with the command linegrep --color -a 'Image' filename.pdf
. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).
– sudodus
yesterday
7
7
Do you mean whether they're text or images?
– DK Bose
yesterday
Do you mean whether they're text or images?
– DK Bose
yesterday
8
8
Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
– sudodus
yesterday
Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
– sudodus
yesterday
3
3
@sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
– pipe
yesterday
@sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
– pipe
yesterday
1
1
Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
– dessert
yesterday
Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
– dessert
yesterday
1
1
If a
pdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/
, which can be found with the command line grep --color -a 'Image' filename.pdf
. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).– sudodus
yesterday
If a
pdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/
, which can be found with the command line grep --color -a 'Image' filename.pdf
. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).– sudodus
yesterday
|
show 6 more comments
5 Answers
5
active
oldest
votes
up vote
1
down vote
accepted
Shellscript
If a
pdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string/Image/
.In the same way you can search for the string
/Text
to tell if a pdf file contains text (not scanned).
I made the shellscript pdf-text-or-image
, and it might work in most cases with your files. The shellscript looks for the text strings /Image/
and /Text
in the pdf
files.
#!/bin/bash
echo "shellscript $0"
ls --color --group-directories-first
read -p "Is it OK to use this shellscript in this directory? (y/N) " ans
if [ "$ans" != "y" ]
then
exit
fi
mkdir -p scanned
mkdir -p text
mkdir -p "s-and-t"
for file in *.pdf
do
grep -aq '/Image/' "$file"
if [ $? -eq 0 ]
then
image=true
else
image=false
fi
grep -aq '/Text' "$file"
if [ $? -eq 0 ]
then
text=true
else
text=false
fi
if $image && $text
then
mv "$file" "s-and-t"
elif $image
then
mv "$file" "scanned"
elif $text
then
mv "$file" "text"
else
echo "$file undecided"
fi
done
Make the shellscript executable,
chmod ugo+x pdf-text-or-image
Change directory to where you have the pdf
files and run the shellscript.
Identified files are moved to the following subdirectories
scanned
text
s-and-t
(for documents with both [scanned?] images and text content)
Unidentified file objects, 'UFOs', remain in the current directory.
Test
I tested the shellscript with two of your files, AR-G1002.pdf
and AR-G1003.pdf
, and with some own pdf
files (that I have created using Libre Office Impress).
$ ./pdf-text-or-image
shellscript ./pdf-text-or-image
s-and-t mkUSB-quick-start-manual-11.pdf mkUSB-quick-start-manual-nox-11.pdf
scanned mkUSB-quick-start-manual-12-0.pdf mkUSB-quick-start-manual-nox.pdf
text mkUSB-quick-start-manual-12.pdf mkUSB-quick-start-manual.pdf
AR-G1002.pdf mkUSB-quick-start-manual-74.pdf OBI-quick-start-manual.pdf
AR-G1003.pdf mkUSB-quick-start-manual-75.pdf oem.pdf
DescriptionoftheOneButtonInstaller.pdf mkUSB-quick-start-manual-8.pdf pdf-text-or-image
GrowIt.pdf mkUSB-quick-start-manual-9.pdf pdf-text-or-image0
list-files.pdf mkUSB-quick-start-manual-bas.pdf README.pdf
Is it OK to use this shellscript in this directory? (y/N) y
$ ls -1 *
pdf-text-or-image
pdf-text-or-image0
s-and-t:
DescriptionoftheOneButtonInstaller.pdf
GrowIt.pdf
mkUSB-quick-start-manual-11.pdf
mkUSB-quick-start-manual-12-0.pdf
mkUSB-quick-start-manual-12.pdf
mkUSB-quick-start-manual-8.pdf
mkUSB-quick-start-manual-9.pdf
mkUSB-quick-start-manual.pdf
OBI-quick-start-manual.pdf
README.pdf
scanned:
AR-G1002.pdf
text:
AR-G1003.pdf
list-files.pdf
mkUSB-quick-start-manual-74.pdf
mkUSB-quick-start-manual-75.pdf
mkUSB-quick-start-manual-bas.pdf
mkUSB-quick-start-manual-nox-11.pdf
mkUSB-quick-start-manual-nox.pdf
oem.pdf
Let us hope that
- there are no UFOs in your set of files
- the sorting is correct concerning text versus scanned/images
instead of redirecting to /dev/null you can just usegrep -q
– phuclv
22 hours ago
@phuclv, Thanks for the tip :-)
– sudodus
17 hours ago
1
@phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, becausegrep -q
exits immediately with zero status if any match is found (instead of seaching through the whole files).
– sudodus
10 hours ago
add a comment |
up vote
6
down vote
- Put all the .pdf files in one folder.
- No .txt file in that folder.
- In terminal change directory to that folder with
cd <path to dir>
- Make one more directory for non scanned files. Example:
mkdir ./x
for file in *.pdf; do
if [ $(pdftotext "$file")"x" == "x" ] ; then mv "$file" ./x; fi
rm *.txt
done
All the pdf scanned files will remain in the folder and other files will move to another folder.
this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
– DanielTheRocketMan
yesterday
8
Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
– Joey
yesterday
2
Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
– jamesqf
yesterday
1
@DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output offile pdf-filename.pdf
will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
– Elder Geek
yesterday
1
@DanielTheRocketMan If that's the case you might find sorting the documents by version using the output offile
helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
– Elder Geek
yesterday
|
show 5 more comments
up vote
1
down vote
Hobbyist offers a good solution if the document collection's scanned documents do not have text added with optical character recognition (OCR). If this is a possibility, you may want to do some scripting that reads the output of pdfinfo -meta
and checks for the tool used to create the file, or employ a Python routine that uses one of the Python libraries to examine them. Searching for text with a tool like strings
will be unreliable because PDF content can be compressed. And checking the creation tool is not failsafe, either, since PDF pages can be combined; I routinely combine PDF text documents with scanned images to keep things together.
I'm sorry that I am unable to offer specific suggestions. It's been a while since I poked at the PDF internal structure, but depending on how stringent your requirements are, you may want to know that it's kind of complicated. Good luck!
New contributor
2
I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
– DanielTheRocketMan
yesterday
add a comment |
up vote
1
down vote
If this is more about actually detecting if PDF was created by scanning rather than pdf has images instead of text then you might need to dig into the metadata of the file, not just content.
In general, for the files I could find on my computer and your test files, following is true:
- Scanned files have less than 1000chars/page vs. non scanned ones who always have more than 1000chars/page
- Multiple independent scanned files had "Canon" listed as the PDF creator, probably referencing Canon scanner software
- PDFs with "Microsoft Word" as creator are likely to not be scanned, as they are word exports. But someone could scan to word, then export to PDF - some people have very strange workflow.
I'm using Windows at the moment, so I used node.js
for the following example:
const fs = require("mz/fs");
const pdf_parse = require("pdf-parse");
const path = require("path");
const SHOW_SCANNED_ONES = process.argv.indexOf("scanned") != -1;
const DEBUG = process.argv.indexOf("debug") != -1;
const STRICT = process.argv.indexOf("strict") != -1;
const debug = DEBUG ? console.error : () => { };
(async () => {
const pdfs = (await fs.readdir(".")).filter((fname) => { return fname.endsWith(".pdf") });
for (let i = 0, l = pdfs.length; i < l; ++i) {
const pdffilename = pdfs[i];
try {
debug("nnFILE: ", pdffilename);
const buffer = await fs.readFile(pdffilename);
const data = await pdf_parse(buffer);
if (!data.info)
data.indo = {};
if (!data.metadata) {
data.metadata = {
_metadata: {}
};
}
// PDF info
debug(data.info);
// PDF metadata
debug(data.metadata);
// text length
const textLen = data.text ? data.text.length : 0;
const textPerPage = textLen / (data.numpages);
debug("Text length: ", textLen);
debug("Chars per page: ", textLen / data.numpages);
// PDF.js version
// check https://mozilla.github.io/pdf.js/getting_started/
debug(data.version);
if (evalScanned(data, textLen, textPerPage) == SHOW_SCANNED_ONES) {
console.log(path.resolve(".", pdffilename));
}
}
catch (e) {
if (strict && !debug) {
console.error("Failed to evaluate " + item);
}
{
debug("Failed to evaluate " + item);
debug(e.stack);
}
if (strict) {
process.exit(1);
}
}
}
})();
const IS_CREATOR_CANON = /canon/i;
const IS_CREATOR_MS_WORD = /microsoft.*?word/i;
// just defined for better clarity or return values
const IS_SCANNED = true;
const IS_NOT_SCANNED = false;
function evalScanned(pdfdata, textLen, textPerPage) {
if (textPerPage < 300 && pdfdata.numpages>1) {
// really low number, definitelly not text pdf
return IS_SCANNED;
}
// definitelly has enough text
// might be scanned but OCRed
// we return this if no
// suspition of scanning is found
let implicitAssumption = textPerPage > 1000 ? IS_NOT_SCANNED : IS_SCANNED;
if (IS_CREATOR_CANON.test(pdfdata.info.Creator)) {
// this is always scanned, canon is brand name
return IS_SCANNED;
}
return implicitAssumption;
}
To run it, you need to have Node.js installed (should be a single command) and you also need to call:
npm install mz pdf-parse
Usage:
node howYouNamedIt.js [scanned] [debug] [strict]
- scanned show PDFs thought to be scanned (otherwise shows not scanned)
- debug shows the debug info such as metadata and error stack traces
- strict kills the program on first error
This example is not considered finished solution, but with the debug
flag, you get some insight into meta information of a file:
FILE: BR-L1411-3-scanned.pdf
{ PDFFormatVersion: '1.3',
IsAcroFormPresent: false,
IsXFAPresent: false,
Creator: 'Canon ',
Producer: ' ',
CreationDate: 'D:20131212150500-03'00'',
ModDate: 'D:20140709104225-03'00'' }
Metadata {
_metadata:
{ 'xmp:createdate': '2013-12-12T15:05-03:00',
'xmp:creatortool': 'Canon',
'xmp:modifydate': '2014-07-09T10:42:25-03:00',
'xmp:metadatadate': '2014-07-09T10:42:25-03:00',
'pdf:producer': '',
'xmpmm:documentid': 'uuid:79a14710-88e2-4849-96b1-512e89ee8dab',
'xmpmm:instanceid': 'uuid:1d2b2106-a13f-48c6-8bca-6795aa955ad1',
'dc:format': 'application/pdf' } }
Text length: 772
Chars per page: 2
1.10.100
D:webso-odpovedipdfBR-L1411-3-scanned.pdf
The naive function that I wrote has 100% success on the documents that I could find on my computer (including your samples). I named the files based on what their status was before running the program, to make it possible to see if results are correct.
D:xxxxpdf>node detect_scanned.js scanned
D:xxxxpdfAR-G1002-scanned.pdf
D:xxxxpdfAR-G1002_scanned.pdf
D:xxxxpdfBR-L1411-3-scanned.pdf
D:xxxxpdfWHO_TRS_696-scanned.pdf
D:xxxxpdf>node detect_scanned.js
D:xxxxpdfAR-G1003-not-scanned.pdf
D:xxxxpdfASEE_-_thermoelectric_paper_-_final-not-scanned.pdf
D:xxxxpdfMULTIMODE ABSORBER-not-scanned.pdf
D:xxxxpdfReductionofOxideMineralsbyHydrogenPlasma-not-scanned.pdf
You can use the debug mode along with a tiny bit of programming to vastly improve your results. You can pass the output of the program to other programs, it will always have one full path per line.
Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
– jamesqf
5 hours ago
add a comment |
up vote
0
down vote
2 ways I can think of:
Using select text tool: if you are using a scanned PDF the texts can not be selected, rather a box will appear. You can use this fact to create the script. I know in C++ QT there is a way, not sure in Linux though.
Search for word in file: In a non-scanned PDF your search will work, however not in scanned file. You just need to find some words common to all PDFs or I would rather say search for letter 'e' in all the PDFs. It has the highest frequency distribution so chances are you will find it in all the documents which have text (Unless its gadsby)
eg
grep -rnw '/path/to/pdf/' -e 'e'
Use any of the text processing tools
1
a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
– phuclv
yesterday
@phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
– jamesqf
yesterday
1
@jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
– DanielTheRocketMan
yesterday
1
i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
– swapedoc
yesterday
add a comment |
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
Shellscript
If a
pdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string/Image/
.In the same way you can search for the string
/Text
to tell if a pdf file contains text (not scanned).
I made the shellscript pdf-text-or-image
, and it might work in most cases with your files. The shellscript looks for the text strings /Image/
and /Text
in the pdf
files.
#!/bin/bash
echo "shellscript $0"
ls --color --group-directories-first
read -p "Is it OK to use this shellscript in this directory? (y/N) " ans
if [ "$ans" != "y" ]
then
exit
fi
mkdir -p scanned
mkdir -p text
mkdir -p "s-and-t"
for file in *.pdf
do
grep -aq '/Image/' "$file"
if [ $? -eq 0 ]
then
image=true
else
image=false
fi
grep -aq '/Text' "$file"
if [ $? -eq 0 ]
then
text=true
else
text=false
fi
if $image && $text
then
mv "$file" "s-and-t"
elif $image
then
mv "$file" "scanned"
elif $text
then
mv "$file" "text"
else
echo "$file undecided"
fi
done
Make the shellscript executable,
chmod ugo+x pdf-text-or-image
Change directory to where you have the pdf
files and run the shellscript.
Identified files are moved to the following subdirectories
scanned
text
s-and-t
(for documents with both [scanned?] images and text content)
Unidentified file objects, 'UFOs', remain in the current directory.
Test
I tested the shellscript with two of your files, AR-G1002.pdf
and AR-G1003.pdf
, and with some own pdf
files (that I have created using Libre Office Impress).
$ ./pdf-text-or-image
shellscript ./pdf-text-or-image
s-and-t mkUSB-quick-start-manual-11.pdf mkUSB-quick-start-manual-nox-11.pdf
scanned mkUSB-quick-start-manual-12-0.pdf mkUSB-quick-start-manual-nox.pdf
text mkUSB-quick-start-manual-12.pdf mkUSB-quick-start-manual.pdf
AR-G1002.pdf mkUSB-quick-start-manual-74.pdf OBI-quick-start-manual.pdf
AR-G1003.pdf mkUSB-quick-start-manual-75.pdf oem.pdf
DescriptionoftheOneButtonInstaller.pdf mkUSB-quick-start-manual-8.pdf pdf-text-or-image
GrowIt.pdf mkUSB-quick-start-manual-9.pdf pdf-text-or-image0
list-files.pdf mkUSB-quick-start-manual-bas.pdf README.pdf
Is it OK to use this shellscript in this directory? (y/N) y
$ ls -1 *
pdf-text-or-image
pdf-text-or-image0
s-and-t:
DescriptionoftheOneButtonInstaller.pdf
GrowIt.pdf
mkUSB-quick-start-manual-11.pdf
mkUSB-quick-start-manual-12-0.pdf
mkUSB-quick-start-manual-12.pdf
mkUSB-quick-start-manual-8.pdf
mkUSB-quick-start-manual-9.pdf
mkUSB-quick-start-manual.pdf
OBI-quick-start-manual.pdf
README.pdf
scanned:
AR-G1002.pdf
text:
AR-G1003.pdf
list-files.pdf
mkUSB-quick-start-manual-74.pdf
mkUSB-quick-start-manual-75.pdf
mkUSB-quick-start-manual-bas.pdf
mkUSB-quick-start-manual-nox-11.pdf
mkUSB-quick-start-manual-nox.pdf
oem.pdf
Let us hope that
- there are no UFOs in your set of files
- the sorting is correct concerning text versus scanned/images
instead of redirecting to /dev/null you can just usegrep -q
– phuclv
22 hours ago
@phuclv, Thanks for the tip :-)
– sudodus
17 hours ago
1
@phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, becausegrep -q
exits immediately with zero status if any match is found (instead of seaching through the whole files).
– sudodus
10 hours ago
add a comment |
up vote
1
down vote
accepted
Shellscript
If a
pdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string/Image/
.In the same way you can search for the string
/Text
to tell if a pdf file contains text (not scanned).
I made the shellscript pdf-text-or-image
, and it might work in most cases with your files. The shellscript looks for the text strings /Image/
and /Text
in the pdf
files.
#!/bin/bash
echo "shellscript $0"
ls --color --group-directories-first
read -p "Is it OK to use this shellscript in this directory? (y/N) " ans
if [ "$ans" != "y" ]
then
exit
fi
mkdir -p scanned
mkdir -p text
mkdir -p "s-and-t"
for file in *.pdf
do
grep -aq '/Image/' "$file"
if [ $? -eq 0 ]
then
image=true
else
image=false
fi
grep -aq '/Text' "$file"
if [ $? -eq 0 ]
then
text=true
else
text=false
fi
if $image && $text
then
mv "$file" "s-and-t"
elif $image
then
mv "$file" "scanned"
elif $text
then
mv "$file" "text"
else
echo "$file undecided"
fi
done
Make the shellscript executable,
chmod ugo+x pdf-text-or-image
Change directory to where you have the pdf
files and run the shellscript.
Identified files are moved to the following subdirectories
scanned
text
s-and-t
(for documents with both [scanned?] images and text content)
Unidentified file objects, 'UFOs', remain in the current directory.
Test
I tested the shellscript with two of your files, AR-G1002.pdf
and AR-G1003.pdf
, and with some own pdf
files (that I have created using Libre Office Impress).
$ ./pdf-text-or-image
shellscript ./pdf-text-or-image
s-and-t mkUSB-quick-start-manual-11.pdf mkUSB-quick-start-manual-nox-11.pdf
scanned mkUSB-quick-start-manual-12-0.pdf mkUSB-quick-start-manual-nox.pdf
text mkUSB-quick-start-manual-12.pdf mkUSB-quick-start-manual.pdf
AR-G1002.pdf mkUSB-quick-start-manual-74.pdf OBI-quick-start-manual.pdf
AR-G1003.pdf mkUSB-quick-start-manual-75.pdf oem.pdf
DescriptionoftheOneButtonInstaller.pdf mkUSB-quick-start-manual-8.pdf pdf-text-or-image
GrowIt.pdf mkUSB-quick-start-manual-9.pdf pdf-text-or-image0
list-files.pdf mkUSB-quick-start-manual-bas.pdf README.pdf
Is it OK to use this shellscript in this directory? (y/N) y
$ ls -1 *
pdf-text-or-image
pdf-text-or-image0
s-and-t:
DescriptionoftheOneButtonInstaller.pdf
GrowIt.pdf
mkUSB-quick-start-manual-11.pdf
mkUSB-quick-start-manual-12-0.pdf
mkUSB-quick-start-manual-12.pdf
mkUSB-quick-start-manual-8.pdf
mkUSB-quick-start-manual-9.pdf
mkUSB-quick-start-manual.pdf
OBI-quick-start-manual.pdf
README.pdf
scanned:
AR-G1002.pdf
text:
AR-G1003.pdf
list-files.pdf
mkUSB-quick-start-manual-74.pdf
mkUSB-quick-start-manual-75.pdf
mkUSB-quick-start-manual-bas.pdf
mkUSB-quick-start-manual-nox-11.pdf
mkUSB-quick-start-manual-nox.pdf
oem.pdf
Let us hope that
- there are no UFOs in your set of files
- the sorting is correct concerning text versus scanned/images
instead of redirecting to /dev/null you can just usegrep -q
– phuclv
22 hours ago
@phuclv, Thanks for the tip :-)
– sudodus
17 hours ago
1
@phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, becausegrep -q
exits immediately with zero status if any match is found (instead of seaching through the whole files).
– sudodus
10 hours ago
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
Shellscript
If a
pdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string/Image/
.In the same way you can search for the string
/Text
to tell if a pdf file contains text (not scanned).
I made the shellscript pdf-text-or-image
, and it might work in most cases with your files. The shellscript looks for the text strings /Image/
and /Text
in the pdf
files.
#!/bin/bash
echo "shellscript $0"
ls --color --group-directories-first
read -p "Is it OK to use this shellscript in this directory? (y/N) " ans
if [ "$ans" != "y" ]
then
exit
fi
mkdir -p scanned
mkdir -p text
mkdir -p "s-and-t"
for file in *.pdf
do
grep -aq '/Image/' "$file"
if [ $? -eq 0 ]
then
image=true
else
image=false
fi
grep -aq '/Text' "$file"
if [ $? -eq 0 ]
then
text=true
else
text=false
fi
if $image && $text
then
mv "$file" "s-and-t"
elif $image
then
mv "$file" "scanned"
elif $text
then
mv "$file" "text"
else
echo "$file undecided"
fi
done
Make the shellscript executable,
chmod ugo+x pdf-text-or-image
Change directory to where you have the pdf
files and run the shellscript.
Identified files are moved to the following subdirectories
scanned
text
s-and-t
(for documents with both [scanned?] images and text content)
Unidentified file objects, 'UFOs', remain in the current directory.
Test
I tested the shellscript with two of your files, AR-G1002.pdf
and AR-G1003.pdf
, and with some own pdf
files (that I have created using Libre Office Impress).
$ ./pdf-text-or-image
shellscript ./pdf-text-or-image
s-and-t mkUSB-quick-start-manual-11.pdf mkUSB-quick-start-manual-nox-11.pdf
scanned mkUSB-quick-start-manual-12-0.pdf mkUSB-quick-start-manual-nox.pdf
text mkUSB-quick-start-manual-12.pdf mkUSB-quick-start-manual.pdf
AR-G1002.pdf mkUSB-quick-start-manual-74.pdf OBI-quick-start-manual.pdf
AR-G1003.pdf mkUSB-quick-start-manual-75.pdf oem.pdf
DescriptionoftheOneButtonInstaller.pdf mkUSB-quick-start-manual-8.pdf pdf-text-or-image
GrowIt.pdf mkUSB-quick-start-manual-9.pdf pdf-text-or-image0
list-files.pdf mkUSB-quick-start-manual-bas.pdf README.pdf
Is it OK to use this shellscript in this directory? (y/N) y
$ ls -1 *
pdf-text-or-image
pdf-text-or-image0
s-and-t:
DescriptionoftheOneButtonInstaller.pdf
GrowIt.pdf
mkUSB-quick-start-manual-11.pdf
mkUSB-quick-start-manual-12-0.pdf
mkUSB-quick-start-manual-12.pdf
mkUSB-quick-start-manual-8.pdf
mkUSB-quick-start-manual-9.pdf
mkUSB-quick-start-manual.pdf
OBI-quick-start-manual.pdf
README.pdf
scanned:
AR-G1002.pdf
text:
AR-G1003.pdf
list-files.pdf
mkUSB-quick-start-manual-74.pdf
mkUSB-quick-start-manual-75.pdf
mkUSB-quick-start-manual-bas.pdf
mkUSB-quick-start-manual-nox-11.pdf
mkUSB-quick-start-manual-nox.pdf
oem.pdf
Let us hope that
- there are no UFOs in your set of files
- the sorting is correct concerning text versus scanned/images
Shellscript
If a
pdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string/Image/
.In the same way you can search for the string
/Text
to tell if a pdf file contains text (not scanned).
I made the shellscript pdf-text-or-image
, and it might work in most cases with your files. The shellscript looks for the text strings /Image/
and /Text
in the pdf
files.
#!/bin/bash
echo "shellscript $0"
ls --color --group-directories-first
read -p "Is it OK to use this shellscript in this directory? (y/N) " ans
if [ "$ans" != "y" ]
then
exit
fi
mkdir -p scanned
mkdir -p text
mkdir -p "s-and-t"
for file in *.pdf
do
grep -aq '/Image/' "$file"
if [ $? -eq 0 ]
then
image=true
else
image=false
fi
grep -aq '/Text' "$file"
if [ $? -eq 0 ]
then
text=true
else
text=false
fi
if $image && $text
then
mv "$file" "s-and-t"
elif $image
then
mv "$file" "scanned"
elif $text
then
mv "$file" "text"
else
echo "$file undecided"
fi
done
Make the shellscript executable,
chmod ugo+x pdf-text-or-image
Change directory to where you have the pdf
files and run the shellscript.
Identified files are moved to the following subdirectories
scanned
text
s-and-t
(for documents with both [scanned?] images and text content)
Unidentified file objects, 'UFOs', remain in the current directory.
Test
I tested the shellscript with two of your files, AR-G1002.pdf
and AR-G1003.pdf
, and with some own pdf
files (that I have created using Libre Office Impress).
$ ./pdf-text-or-image
shellscript ./pdf-text-or-image
s-and-t mkUSB-quick-start-manual-11.pdf mkUSB-quick-start-manual-nox-11.pdf
scanned mkUSB-quick-start-manual-12-0.pdf mkUSB-quick-start-manual-nox.pdf
text mkUSB-quick-start-manual-12.pdf mkUSB-quick-start-manual.pdf
AR-G1002.pdf mkUSB-quick-start-manual-74.pdf OBI-quick-start-manual.pdf
AR-G1003.pdf mkUSB-quick-start-manual-75.pdf oem.pdf
DescriptionoftheOneButtonInstaller.pdf mkUSB-quick-start-manual-8.pdf pdf-text-or-image
GrowIt.pdf mkUSB-quick-start-manual-9.pdf pdf-text-or-image0
list-files.pdf mkUSB-quick-start-manual-bas.pdf README.pdf
Is it OK to use this shellscript in this directory? (y/N) y
$ ls -1 *
pdf-text-or-image
pdf-text-or-image0
s-and-t:
DescriptionoftheOneButtonInstaller.pdf
GrowIt.pdf
mkUSB-quick-start-manual-11.pdf
mkUSB-quick-start-manual-12-0.pdf
mkUSB-quick-start-manual-12.pdf
mkUSB-quick-start-manual-8.pdf
mkUSB-quick-start-manual-9.pdf
mkUSB-quick-start-manual.pdf
OBI-quick-start-manual.pdf
README.pdf
scanned:
AR-G1002.pdf
text:
AR-G1003.pdf
list-files.pdf
mkUSB-quick-start-manual-74.pdf
mkUSB-quick-start-manual-75.pdf
mkUSB-quick-start-manual-bas.pdf
mkUSB-quick-start-manual-nox-11.pdf
mkUSB-quick-start-manual-nox.pdf
oem.pdf
Let us hope that
- there are no UFOs in your set of files
- the sorting is correct concerning text versus scanned/images
edited 10 hours ago
answered yesterday
sudodus
21.2k32770
21.2k32770
instead of redirecting to /dev/null you can just usegrep -q
– phuclv
22 hours ago
@phuclv, Thanks for the tip :-)
– sudodus
17 hours ago
1
@phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, becausegrep -q
exits immediately with zero status if any match is found (instead of seaching through the whole files).
– sudodus
10 hours ago
add a comment |
instead of redirecting to /dev/null you can just usegrep -q
– phuclv
22 hours ago
@phuclv, Thanks for the tip :-)
– sudodus
17 hours ago
1
@phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, becausegrep -q
exits immediately with zero status if any match is found (instead of seaching through the whole files).
– sudodus
10 hours ago
instead of redirecting to /dev/null you can just use
grep -q
– phuclv
22 hours ago
instead of redirecting to /dev/null you can just use
grep -q
– phuclv
22 hours ago
@phuclv, Thanks for the tip :-)
– sudodus
17 hours ago
@phuclv, Thanks for the tip :-)
– sudodus
17 hours ago
1
1
@phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, because
grep -q
exits immediately with zero status if any match is found (instead of seaching through the whole files).– sudodus
10 hours ago
@phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, because
grep -q
exits immediately with zero status if any match is found (instead of seaching through the whole files).– sudodus
10 hours ago
add a comment |
up vote
6
down vote
- Put all the .pdf files in one folder.
- No .txt file in that folder.
- In terminal change directory to that folder with
cd <path to dir>
- Make one more directory for non scanned files. Example:
mkdir ./x
for file in *.pdf; do
if [ $(pdftotext "$file")"x" == "x" ] ; then mv "$file" ./x; fi
rm *.txt
done
All the pdf scanned files will remain in the folder and other files will move to another folder.
this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
– DanielTheRocketMan
yesterday
8
Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
– Joey
yesterday
2
Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
– jamesqf
yesterday
1
@DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output offile pdf-filename.pdf
will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
– Elder Geek
yesterday
1
@DanielTheRocketMan If that's the case you might find sorting the documents by version using the output offile
helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
– Elder Geek
yesterday
|
show 5 more comments
up vote
6
down vote
- Put all the .pdf files in one folder.
- No .txt file in that folder.
- In terminal change directory to that folder with
cd <path to dir>
- Make one more directory for non scanned files. Example:
mkdir ./x
for file in *.pdf; do
if [ $(pdftotext "$file")"x" == "x" ] ; then mv "$file" ./x; fi
rm *.txt
done
All the pdf scanned files will remain in the folder and other files will move to another folder.
this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
– DanielTheRocketMan
yesterday
8
Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
– Joey
yesterday
2
Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
– jamesqf
yesterday
1
@DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output offile pdf-filename.pdf
will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
– Elder Geek
yesterday
1
@DanielTheRocketMan If that's the case you might find sorting the documents by version using the output offile
helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
– Elder Geek
yesterday
|
show 5 more comments
up vote
6
down vote
up vote
6
down vote
- Put all the .pdf files in one folder.
- No .txt file in that folder.
- In terminal change directory to that folder with
cd <path to dir>
- Make one more directory for non scanned files. Example:
mkdir ./x
for file in *.pdf; do
if [ $(pdftotext "$file")"x" == "x" ] ; then mv "$file" ./x; fi
rm *.txt
done
All the pdf scanned files will remain in the folder and other files will move to another folder.
- Put all the .pdf files in one folder.
- No .txt file in that folder.
- In terminal change directory to that folder with
cd <path to dir>
- Make one more directory for non scanned files. Example:
mkdir ./x
for file in *.pdf; do
if [ $(pdftotext "$file")"x" == "x" ] ; then mv "$file" ./x; fi
rm *.txt
done
All the pdf scanned files will remain in the folder and other files will move to another folder.
edited yesterday
dessert
21k55896
21k55896
answered yesterday
Hobbyist
979617
979617
this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
– DanielTheRocketMan
yesterday
8
Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
– Joey
yesterday
2
Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
– jamesqf
yesterday
1
@DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output offile pdf-filename.pdf
will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
– Elder Geek
yesterday
1
@DanielTheRocketMan If that's the case you might find sorting the documents by version using the output offile
helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
– Elder Geek
yesterday
|
show 5 more comments
this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
– DanielTheRocketMan
yesterday
8
Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
– Joey
yesterday
2
Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
– jamesqf
yesterday
1
@DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output offile pdf-filename.pdf
will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
– Elder Geek
yesterday
1
@DanielTheRocketMan If that's the case you might find sorting the documents by version using the output offile
helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
– Elder Geek
yesterday
this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
– DanielTheRocketMan
yesterday
this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
– DanielTheRocketMan
yesterday
8
8
Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
– Joey
yesterday
Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
– Joey
yesterday
2
2
Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
– jamesqf
yesterday
Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
– jamesqf
yesterday
1
1
@DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output of
file pdf-filename.pdf
will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.– Elder Geek
yesterday
@DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output of
file pdf-filename.pdf
will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.– Elder Geek
yesterday
1
1
@DanielTheRocketMan If that's the case you might find sorting the documents by version using the output of
file
helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.– Elder Geek
yesterday
@DanielTheRocketMan If that's the case you might find sorting the documents by version using the output of
file
helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.– Elder Geek
yesterday
|
show 5 more comments
up vote
1
down vote
Hobbyist offers a good solution if the document collection's scanned documents do not have text added with optical character recognition (OCR). If this is a possibility, you may want to do some scripting that reads the output of pdfinfo -meta
and checks for the tool used to create the file, or employ a Python routine that uses one of the Python libraries to examine them. Searching for text with a tool like strings
will be unreliable because PDF content can be compressed. And checking the creation tool is not failsafe, either, since PDF pages can be combined; I routinely combine PDF text documents with scanned images to keep things together.
I'm sorry that I am unable to offer specific suggestions. It's been a while since I poked at the PDF internal structure, but depending on how stringent your requirements are, you may want to know that it's kind of complicated. Good luck!
New contributor
2
I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
– DanielTheRocketMan
yesterday
add a comment |
up vote
1
down vote
Hobbyist offers a good solution if the document collection's scanned documents do not have text added with optical character recognition (OCR). If this is a possibility, you may want to do some scripting that reads the output of pdfinfo -meta
and checks for the tool used to create the file, or employ a Python routine that uses one of the Python libraries to examine them. Searching for text with a tool like strings
will be unreliable because PDF content can be compressed. And checking the creation tool is not failsafe, either, since PDF pages can be combined; I routinely combine PDF text documents with scanned images to keep things together.
I'm sorry that I am unable to offer specific suggestions. It's been a while since I poked at the PDF internal structure, but depending on how stringent your requirements are, you may want to know that it's kind of complicated. Good luck!
New contributor
2
I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
– DanielTheRocketMan
yesterday
add a comment |
up vote
1
down vote
up vote
1
down vote
Hobbyist offers a good solution if the document collection's scanned documents do not have text added with optical character recognition (OCR). If this is a possibility, you may want to do some scripting that reads the output of pdfinfo -meta
and checks for the tool used to create the file, or employ a Python routine that uses one of the Python libraries to examine them. Searching for text with a tool like strings
will be unreliable because PDF content can be compressed. And checking the creation tool is not failsafe, either, since PDF pages can be combined; I routinely combine PDF text documents with scanned images to keep things together.
I'm sorry that I am unable to offer specific suggestions. It's been a while since I poked at the PDF internal structure, but depending on how stringent your requirements are, you may want to know that it's kind of complicated. Good luck!
New contributor
Hobbyist offers a good solution if the document collection's scanned documents do not have text added with optical character recognition (OCR). If this is a possibility, you may want to do some scripting that reads the output of pdfinfo -meta
and checks for the tool used to create the file, or employ a Python routine that uses one of the Python libraries to examine them. Searching for text with a tool like strings
will be unreliable because PDF content can be compressed. And checking the creation tool is not failsafe, either, since PDF pages can be combined; I routinely combine PDF text documents with scanned images to keep things together.
I'm sorry that I am unable to offer specific suggestions. It's been a while since I poked at the PDF internal structure, but depending on how stringent your requirements are, you may want to know that it's kind of complicated. Good luck!
New contributor
New contributor
answered yesterday
ichabod
111
111
New contributor
New contributor
2
I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
– DanielTheRocketMan
yesterday
add a comment |
2
I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
– DanielTheRocketMan
yesterday
2
2
I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
– DanielTheRocketMan
yesterday
I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
– DanielTheRocketMan
yesterday
add a comment |
up vote
1
down vote
If this is more about actually detecting if PDF was created by scanning rather than pdf has images instead of text then you might need to dig into the metadata of the file, not just content.
In general, for the files I could find on my computer and your test files, following is true:
- Scanned files have less than 1000chars/page vs. non scanned ones who always have more than 1000chars/page
- Multiple independent scanned files had "Canon" listed as the PDF creator, probably referencing Canon scanner software
- PDFs with "Microsoft Word" as creator are likely to not be scanned, as they are word exports. But someone could scan to word, then export to PDF - some people have very strange workflow.
I'm using Windows at the moment, so I used node.js
for the following example:
const fs = require("mz/fs");
const pdf_parse = require("pdf-parse");
const path = require("path");
const SHOW_SCANNED_ONES = process.argv.indexOf("scanned") != -1;
const DEBUG = process.argv.indexOf("debug") != -1;
const STRICT = process.argv.indexOf("strict") != -1;
const debug = DEBUG ? console.error : () => { };
(async () => {
const pdfs = (await fs.readdir(".")).filter((fname) => { return fname.endsWith(".pdf") });
for (let i = 0, l = pdfs.length; i < l; ++i) {
const pdffilename = pdfs[i];
try {
debug("nnFILE: ", pdffilename);
const buffer = await fs.readFile(pdffilename);
const data = await pdf_parse(buffer);
if (!data.info)
data.indo = {};
if (!data.metadata) {
data.metadata = {
_metadata: {}
};
}
// PDF info
debug(data.info);
// PDF metadata
debug(data.metadata);
// text length
const textLen = data.text ? data.text.length : 0;
const textPerPage = textLen / (data.numpages);
debug("Text length: ", textLen);
debug("Chars per page: ", textLen / data.numpages);
// PDF.js version
// check https://mozilla.github.io/pdf.js/getting_started/
debug(data.version);
if (evalScanned(data, textLen, textPerPage) == SHOW_SCANNED_ONES) {
console.log(path.resolve(".", pdffilename));
}
}
catch (e) {
if (strict && !debug) {
console.error("Failed to evaluate " + item);
}
{
debug("Failed to evaluate " + item);
debug(e.stack);
}
if (strict) {
process.exit(1);
}
}
}
})();
const IS_CREATOR_CANON = /canon/i;
const IS_CREATOR_MS_WORD = /microsoft.*?word/i;
// just defined for better clarity or return values
const IS_SCANNED = true;
const IS_NOT_SCANNED = false;
function evalScanned(pdfdata, textLen, textPerPage) {
if (textPerPage < 300 && pdfdata.numpages>1) {
// really low number, definitelly not text pdf
return IS_SCANNED;
}
// definitelly has enough text
// might be scanned but OCRed
// we return this if no
// suspition of scanning is found
let implicitAssumption = textPerPage > 1000 ? IS_NOT_SCANNED : IS_SCANNED;
if (IS_CREATOR_CANON.test(pdfdata.info.Creator)) {
// this is always scanned, canon is brand name
return IS_SCANNED;
}
return implicitAssumption;
}
To run it, you need to have Node.js installed (should be a single command) and you also need to call:
npm install mz pdf-parse
Usage:
node howYouNamedIt.js [scanned] [debug] [strict]
- scanned show PDFs thought to be scanned (otherwise shows not scanned)
- debug shows the debug info such as metadata and error stack traces
- strict kills the program on first error
This example is not considered finished solution, but with the debug
flag, you get some insight into meta information of a file:
FILE: BR-L1411-3-scanned.pdf
{ PDFFormatVersion: '1.3',
IsAcroFormPresent: false,
IsXFAPresent: false,
Creator: 'Canon ',
Producer: ' ',
CreationDate: 'D:20131212150500-03'00'',
ModDate: 'D:20140709104225-03'00'' }
Metadata {
_metadata:
{ 'xmp:createdate': '2013-12-12T15:05-03:00',
'xmp:creatortool': 'Canon',
'xmp:modifydate': '2014-07-09T10:42:25-03:00',
'xmp:metadatadate': '2014-07-09T10:42:25-03:00',
'pdf:producer': '',
'xmpmm:documentid': 'uuid:79a14710-88e2-4849-96b1-512e89ee8dab',
'xmpmm:instanceid': 'uuid:1d2b2106-a13f-48c6-8bca-6795aa955ad1',
'dc:format': 'application/pdf' } }
Text length: 772
Chars per page: 2
1.10.100
D:webso-odpovedipdfBR-L1411-3-scanned.pdf
The naive function that I wrote has 100% success on the documents that I could find on my computer (including your samples). I named the files based on what their status was before running the program, to make it possible to see if results are correct.
D:xxxxpdf>node detect_scanned.js scanned
D:xxxxpdfAR-G1002-scanned.pdf
D:xxxxpdfAR-G1002_scanned.pdf
D:xxxxpdfBR-L1411-3-scanned.pdf
D:xxxxpdfWHO_TRS_696-scanned.pdf
D:xxxxpdf>node detect_scanned.js
D:xxxxpdfAR-G1003-not-scanned.pdf
D:xxxxpdfASEE_-_thermoelectric_paper_-_final-not-scanned.pdf
D:xxxxpdfMULTIMODE ABSORBER-not-scanned.pdf
D:xxxxpdfReductionofOxideMineralsbyHydrogenPlasma-not-scanned.pdf
You can use the debug mode along with a tiny bit of programming to vastly improve your results. You can pass the output of the program to other programs, it will always have one full path per line.
Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
– jamesqf
5 hours ago
add a comment |
up vote
1
down vote
If this is more about actually detecting if PDF was created by scanning rather than pdf has images instead of text then you might need to dig into the metadata of the file, not just content.
In general, for the files I could find on my computer and your test files, following is true:
- Scanned files have less than 1000chars/page vs. non scanned ones who always have more than 1000chars/page
- Multiple independent scanned files had "Canon" listed as the PDF creator, probably referencing Canon scanner software
- PDFs with "Microsoft Word" as creator are likely to not be scanned, as they are word exports. But someone could scan to word, then export to PDF - some people have very strange workflow.
I'm using Windows at the moment, so I used node.js
for the following example:
const fs = require("mz/fs");
const pdf_parse = require("pdf-parse");
const path = require("path");
const SHOW_SCANNED_ONES = process.argv.indexOf("scanned") != -1;
const DEBUG = process.argv.indexOf("debug") != -1;
const STRICT = process.argv.indexOf("strict") != -1;
const debug = DEBUG ? console.error : () => { };
(async () => {
const pdfs = (await fs.readdir(".")).filter((fname) => { return fname.endsWith(".pdf") });
for (let i = 0, l = pdfs.length; i < l; ++i) {
const pdffilename = pdfs[i];
try {
debug("nnFILE: ", pdffilename);
const buffer = await fs.readFile(pdffilename);
const data = await pdf_parse(buffer);
if (!data.info)
data.indo = {};
if (!data.metadata) {
data.metadata = {
_metadata: {}
};
}
// PDF info
debug(data.info);
// PDF metadata
debug(data.metadata);
// text length
const textLen = data.text ? data.text.length : 0;
const textPerPage = textLen / (data.numpages);
debug("Text length: ", textLen);
debug("Chars per page: ", textLen / data.numpages);
// PDF.js version
// check https://mozilla.github.io/pdf.js/getting_started/
debug(data.version);
if (evalScanned(data, textLen, textPerPage) == SHOW_SCANNED_ONES) {
console.log(path.resolve(".", pdffilename));
}
}
catch (e) {
if (strict && !debug) {
console.error("Failed to evaluate " + item);
}
{
debug("Failed to evaluate " + item);
debug(e.stack);
}
if (strict) {
process.exit(1);
}
}
}
})();
const IS_CREATOR_CANON = /canon/i;
const IS_CREATOR_MS_WORD = /microsoft.*?word/i;
// just defined for better clarity or return values
const IS_SCANNED = true;
const IS_NOT_SCANNED = false;
function evalScanned(pdfdata, textLen, textPerPage) {
if (textPerPage < 300 && pdfdata.numpages>1) {
// really low number, definitelly not text pdf
return IS_SCANNED;
}
// definitelly has enough text
// might be scanned but OCRed
// we return this if no
// suspition of scanning is found
let implicitAssumption = textPerPage > 1000 ? IS_NOT_SCANNED : IS_SCANNED;
if (IS_CREATOR_CANON.test(pdfdata.info.Creator)) {
// this is always scanned, canon is brand name
return IS_SCANNED;
}
return implicitAssumption;
}
To run it, you need to have Node.js installed (should be a single command) and you also need to call:
npm install mz pdf-parse
Usage:
node howYouNamedIt.js [scanned] [debug] [strict]
- scanned show PDFs thought to be scanned (otherwise shows not scanned)
- debug shows the debug info such as metadata and error stack traces
- strict kills the program on first error
This example is not considered finished solution, but with the debug
flag, you get some insight into meta information of a file:
FILE: BR-L1411-3-scanned.pdf
{ PDFFormatVersion: '1.3',
IsAcroFormPresent: false,
IsXFAPresent: false,
Creator: 'Canon ',
Producer: ' ',
CreationDate: 'D:20131212150500-03'00'',
ModDate: 'D:20140709104225-03'00'' }
Metadata {
_metadata:
{ 'xmp:createdate': '2013-12-12T15:05-03:00',
'xmp:creatortool': 'Canon',
'xmp:modifydate': '2014-07-09T10:42:25-03:00',
'xmp:metadatadate': '2014-07-09T10:42:25-03:00',
'pdf:producer': '',
'xmpmm:documentid': 'uuid:79a14710-88e2-4849-96b1-512e89ee8dab',
'xmpmm:instanceid': 'uuid:1d2b2106-a13f-48c6-8bca-6795aa955ad1',
'dc:format': 'application/pdf' } }
Text length: 772
Chars per page: 2
1.10.100
D:webso-odpovedipdfBR-L1411-3-scanned.pdf
The naive function that I wrote has 100% success on the documents that I could find on my computer (including your samples). I named the files based on what their status was before running the program, to make it possible to see if results are correct.
D:xxxxpdf>node detect_scanned.js scanned
D:xxxxpdfAR-G1002-scanned.pdf
D:xxxxpdfAR-G1002_scanned.pdf
D:xxxxpdfBR-L1411-3-scanned.pdf
D:xxxxpdfWHO_TRS_696-scanned.pdf
D:xxxxpdf>node detect_scanned.js
D:xxxxpdfAR-G1003-not-scanned.pdf
D:xxxxpdfASEE_-_thermoelectric_paper_-_final-not-scanned.pdf
D:xxxxpdfMULTIMODE ABSORBER-not-scanned.pdf
D:xxxxpdfReductionofOxideMineralsbyHydrogenPlasma-not-scanned.pdf
You can use the debug mode along with a tiny bit of programming to vastly improve your results. You can pass the output of the program to other programs, it will always have one full path per line.
Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
– jamesqf
5 hours ago
add a comment |
up vote
1
down vote
up vote
1
down vote
If this is more about actually detecting if PDF was created by scanning rather than pdf has images instead of text then you might need to dig into the metadata of the file, not just content.
In general, for the files I could find on my computer and your test files, following is true:
- Scanned files have less than 1000chars/page vs. non scanned ones who always have more than 1000chars/page
- Multiple independent scanned files had "Canon" listed as the PDF creator, probably referencing Canon scanner software
- PDFs with "Microsoft Word" as creator are likely to not be scanned, as they are word exports. But someone could scan to word, then export to PDF - some people have very strange workflow.
I'm using Windows at the moment, so I used node.js
for the following example:
const fs = require("mz/fs");
const pdf_parse = require("pdf-parse");
const path = require("path");
const SHOW_SCANNED_ONES = process.argv.indexOf("scanned") != -1;
const DEBUG = process.argv.indexOf("debug") != -1;
const STRICT = process.argv.indexOf("strict") != -1;
const debug = DEBUG ? console.error : () => { };
(async () => {
const pdfs = (await fs.readdir(".")).filter((fname) => { return fname.endsWith(".pdf") });
for (let i = 0, l = pdfs.length; i < l; ++i) {
const pdffilename = pdfs[i];
try {
debug("nnFILE: ", pdffilename);
const buffer = await fs.readFile(pdffilename);
const data = await pdf_parse(buffer);
if (!data.info)
data.indo = {};
if (!data.metadata) {
data.metadata = {
_metadata: {}
};
}
// PDF info
debug(data.info);
// PDF metadata
debug(data.metadata);
// text length
const textLen = data.text ? data.text.length : 0;
const textPerPage = textLen / (data.numpages);
debug("Text length: ", textLen);
debug("Chars per page: ", textLen / data.numpages);
// PDF.js version
// check https://mozilla.github.io/pdf.js/getting_started/
debug(data.version);
if (evalScanned(data, textLen, textPerPage) == SHOW_SCANNED_ONES) {
console.log(path.resolve(".", pdffilename));
}
}
catch (e) {
if (strict && !debug) {
console.error("Failed to evaluate " + item);
}
{
debug("Failed to evaluate " + item);
debug(e.stack);
}
if (strict) {
process.exit(1);
}
}
}
})();
const IS_CREATOR_CANON = /canon/i;
const IS_CREATOR_MS_WORD = /microsoft.*?word/i;
// just defined for better clarity or return values
const IS_SCANNED = true;
const IS_NOT_SCANNED = false;
function evalScanned(pdfdata, textLen, textPerPage) {
if (textPerPage < 300 && pdfdata.numpages>1) {
// really low number, definitelly not text pdf
return IS_SCANNED;
}
// definitelly has enough text
// might be scanned but OCRed
// we return this if no
// suspition of scanning is found
let implicitAssumption = textPerPage > 1000 ? IS_NOT_SCANNED : IS_SCANNED;
if (IS_CREATOR_CANON.test(pdfdata.info.Creator)) {
// this is always scanned, canon is brand name
return IS_SCANNED;
}
return implicitAssumption;
}
To run it, you need to have Node.js installed (should be a single command) and you also need to call:
npm install mz pdf-parse
Usage:
node howYouNamedIt.js [scanned] [debug] [strict]
- scanned show PDFs thought to be scanned (otherwise shows not scanned)
- debug shows the debug info such as metadata and error stack traces
- strict kills the program on first error
This example is not considered finished solution, but with the debug
flag, you get some insight into meta information of a file:
FILE: BR-L1411-3-scanned.pdf
{ PDFFormatVersion: '1.3',
IsAcroFormPresent: false,
IsXFAPresent: false,
Creator: 'Canon ',
Producer: ' ',
CreationDate: 'D:20131212150500-03'00'',
ModDate: 'D:20140709104225-03'00'' }
Metadata {
_metadata:
{ 'xmp:createdate': '2013-12-12T15:05-03:00',
'xmp:creatortool': 'Canon',
'xmp:modifydate': '2014-07-09T10:42:25-03:00',
'xmp:metadatadate': '2014-07-09T10:42:25-03:00',
'pdf:producer': '',
'xmpmm:documentid': 'uuid:79a14710-88e2-4849-96b1-512e89ee8dab',
'xmpmm:instanceid': 'uuid:1d2b2106-a13f-48c6-8bca-6795aa955ad1',
'dc:format': 'application/pdf' } }
Text length: 772
Chars per page: 2
1.10.100
D:webso-odpovedipdfBR-L1411-3-scanned.pdf
The naive function that I wrote has 100% success on the documents that I could find on my computer (including your samples). I named the files based on what their status was before running the program, to make it possible to see if results are correct.
D:xxxxpdf>node detect_scanned.js scanned
D:xxxxpdfAR-G1002-scanned.pdf
D:xxxxpdfAR-G1002_scanned.pdf
D:xxxxpdfBR-L1411-3-scanned.pdf
D:xxxxpdfWHO_TRS_696-scanned.pdf
D:xxxxpdf>node detect_scanned.js
D:xxxxpdfAR-G1003-not-scanned.pdf
D:xxxxpdfASEE_-_thermoelectric_paper_-_final-not-scanned.pdf
D:xxxxpdfMULTIMODE ABSORBER-not-scanned.pdf
D:xxxxpdfReductionofOxideMineralsbyHydrogenPlasma-not-scanned.pdf
You can use the debug mode along with a tiny bit of programming to vastly improve your results. You can pass the output of the program to other programs, it will always have one full path per line.
If this is more about actually detecting if PDF was created by scanning rather than pdf has images instead of text then you might need to dig into the metadata of the file, not just content.
In general, for the files I could find on my computer and your test files, following is true:
- Scanned files have less than 1000chars/page vs. non scanned ones who always have more than 1000chars/page
- Multiple independent scanned files had "Canon" listed as the PDF creator, probably referencing Canon scanner software
- PDFs with "Microsoft Word" as creator are likely to not be scanned, as they are word exports. But someone could scan to word, then export to PDF - some people have very strange workflow.
I'm using Windows at the moment, so I used node.js
for the following example:
const fs = require("mz/fs");
const pdf_parse = require("pdf-parse");
const path = require("path");
const SHOW_SCANNED_ONES = process.argv.indexOf("scanned") != -1;
const DEBUG = process.argv.indexOf("debug") != -1;
const STRICT = process.argv.indexOf("strict") != -1;
const debug = DEBUG ? console.error : () => { };
(async () => {
const pdfs = (await fs.readdir(".")).filter((fname) => { return fname.endsWith(".pdf") });
for (let i = 0, l = pdfs.length; i < l; ++i) {
const pdffilename = pdfs[i];
try {
debug("nnFILE: ", pdffilename);
const buffer = await fs.readFile(pdffilename);
const data = await pdf_parse(buffer);
if (!data.info)
data.indo = {};
if (!data.metadata) {
data.metadata = {
_metadata: {}
};
}
// PDF info
debug(data.info);
// PDF metadata
debug(data.metadata);
// text length
const textLen = data.text ? data.text.length : 0;
const textPerPage = textLen / (data.numpages);
debug("Text length: ", textLen);
debug("Chars per page: ", textLen / data.numpages);
// PDF.js version
// check https://mozilla.github.io/pdf.js/getting_started/
debug(data.version);
if (evalScanned(data, textLen, textPerPage) == SHOW_SCANNED_ONES) {
console.log(path.resolve(".", pdffilename));
}
}
catch (e) {
if (strict && !debug) {
console.error("Failed to evaluate " + item);
}
{
debug("Failed to evaluate " + item);
debug(e.stack);
}
if (strict) {
process.exit(1);
}
}
}
})();
const IS_CREATOR_CANON = /canon/i;
const IS_CREATOR_MS_WORD = /microsoft.*?word/i;
// just defined for better clarity or return values
const IS_SCANNED = true;
const IS_NOT_SCANNED = false;
function evalScanned(pdfdata, textLen, textPerPage) {
if (textPerPage < 300 && pdfdata.numpages>1) {
// really low number, definitelly not text pdf
return IS_SCANNED;
}
// definitelly has enough text
// might be scanned but OCRed
// we return this if no
// suspition of scanning is found
let implicitAssumption = textPerPage > 1000 ? IS_NOT_SCANNED : IS_SCANNED;
if (IS_CREATOR_CANON.test(pdfdata.info.Creator)) {
// this is always scanned, canon is brand name
return IS_SCANNED;
}
return implicitAssumption;
}
To run it, you need to have Node.js installed (should be a single command) and you also need to call:
npm install mz pdf-parse
Usage:
node howYouNamedIt.js [scanned] [debug] [strict]
- scanned show PDFs thought to be scanned (otherwise shows not scanned)
- debug shows the debug info such as metadata and error stack traces
- strict kills the program on first error
This example is not considered finished solution, but with the debug
flag, you get some insight into meta information of a file:
FILE: BR-L1411-3-scanned.pdf
{ PDFFormatVersion: '1.3',
IsAcroFormPresent: false,
IsXFAPresent: false,
Creator: 'Canon ',
Producer: ' ',
CreationDate: 'D:20131212150500-03'00'',
ModDate: 'D:20140709104225-03'00'' }
Metadata {
_metadata:
{ 'xmp:createdate': '2013-12-12T15:05-03:00',
'xmp:creatortool': 'Canon',
'xmp:modifydate': '2014-07-09T10:42:25-03:00',
'xmp:metadatadate': '2014-07-09T10:42:25-03:00',
'pdf:producer': '',
'xmpmm:documentid': 'uuid:79a14710-88e2-4849-96b1-512e89ee8dab',
'xmpmm:instanceid': 'uuid:1d2b2106-a13f-48c6-8bca-6795aa955ad1',
'dc:format': 'application/pdf' } }
Text length: 772
Chars per page: 2
1.10.100
D:webso-odpovedipdfBR-L1411-3-scanned.pdf
The naive function that I wrote has 100% success on the documents that I could find on my computer (including your samples). I named the files based on what their status was before running the program, to make it possible to see if results are correct.
D:xxxxpdf>node detect_scanned.js scanned
D:xxxxpdfAR-G1002-scanned.pdf
D:xxxxpdfAR-G1002_scanned.pdf
D:xxxxpdfBR-L1411-3-scanned.pdf
D:xxxxpdfWHO_TRS_696-scanned.pdf
D:xxxxpdf>node detect_scanned.js
D:xxxxpdfAR-G1003-not-scanned.pdf
D:xxxxpdfASEE_-_thermoelectric_paper_-_final-not-scanned.pdf
D:xxxxpdfMULTIMODE ABSORBER-not-scanned.pdf
D:xxxxpdfReductionofOxideMineralsbyHydrogenPlasma-not-scanned.pdf
You can use the debug mode along with a tiny bit of programming to vastly improve your results. You can pass the output of the program to other programs, it will always have one full path per line.
answered yesterday
Tomáš Zato
169113
169113
Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
– jamesqf
5 hours ago
add a comment |
Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
– jamesqf
5 hours ago
Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
– jamesqf
5 hours ago
Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
– jamesqf
5 hours ago
add a comment |
up vote
0
down vote
2 ways I can think of:
Using select text tool: if you are using a scanned PDF the texts can not be selected, rather a box will appear. You can use this fact to create the script. I know in C++ QT there is a way, not sure in Linux though.
Search for word in file: In a non-scanned PDF your search will work, however not in scanned file. You just need to find some words common to all PDFs or I would rather say search for letter 'e' in all the PDFs. It has the highest frequency distribution so chances are you will find it in all the documents which have text (Unless its gadsby)
eg
grep -rnw '/path/to/pdf/' -e 'e'
Use any of the text processing tools
1
a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
– phuclv
yesterday
@phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
– jamesqf
yesterday
1
@jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
– DanielTheRocketMan
yesterday
1
i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
– swapedoc
yesterday
add a comment |
up vote
0
down vote
2 ways I can think of:
Using select text tool: if you are using a scanned PDF the texts can not be selected, rather a box will appear. You can use this fact to create the script. I know in C++ QT there is a way, not sure in Linux though.
Search for word in file: In a non-scanned PDF your search will work, however not in scanned file. You just need to find some words common to all PDFs or I would rather say search for letter 'e' in all the PDFs. It has the highest frequency distribution so chances are you will find it in all the documents which have text (Unless its gadsby)
eg
grep -rnw '/path/to/pdf/' -e 'e'
Use any of the text processing tools
1
a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
– phuclv
yesterday
@phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
– jamesqf
yesterday
1
@jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
– DanielTheRocketMan
yesterday
1
i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
– swapedoc
yesterday
add a comment |
up vote
0
down vote
up vote
0
down vote
2 ways I can think of:
Using select text tool: if you are using a scanned PDF the texts can not be selected, rather a box will appear. You can use this fact to create the script. I know in C++ QT there is a way, not sure in Linux though.
Search for word in file: In a non-scanned PDF your search will work, however not in scanned file. You just need to find some words common to all PDFs or I would rather say search for letter 'e' in all the PDFs. It has the highest frequency distribution so chances are you will find it in all the documents which have text (Unless its gadsby)
eg
grep -rnw '/path/to/pdf/' -e 'e'
Use any of the text processing tools
2 ways I can think of:
Using select text tool: if you are using a scanned PDF the texts can not be selected, rather a box will appear. You can use this fact to create the script. I know in C++ QT there is a way, not sure in Linux though.
Search for word in file: In a non-scanned PDF your search will work, however not in scanned file. You just need to find some words common to all PDFs or I would rather say search for letter 'e' in all the PDFs. It has the highest frequency distribution so chances are you will find it in all the documents which have text (Unless its gadsby)
eg
grep -rnw '/path/to/pdf/' -e 'e'
Use any of the text processing tools
edited yesterday
phuclv
318224
318224
answered yesterday
swapedoc
416
416
1
a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
– phuclv
yesterday
@phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
– jamesqf
yesterday
1
@jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
– DanielTheRocketMan
yesterday
1
i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
– swapedoc
yesterday
add a comment |
1
a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
– phuclv
yesterday
@phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
– jamesqf
yesterday
1
@jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
– DanielTheRocketMan
yesterday
1
i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
– swapedoc
yesterday
1
1
a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
– phuclv
yesterday
a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
– phuclv
yesterday
@phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
– jamesqf
yesterday
@phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
– jamesqf
yesterday
1
1
@jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
– DanielTheRocketMan
yesterday
@jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
– DanielTheRocketMan
yesterday
1
1
i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
– swapedoc
yesterday
i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
– swapedoc
yesterday
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1094198%2fis-there-a-simple-way-to-identify-if-a-pdf-is-scanned%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
7
Do you mean whether they're text or images?
– DK Bose
yesterday
8
Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
– sudodus
yesterday
3
@sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
– pipe
yesterday
1
Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
– dessert
yesterday
1
If a
pdf
file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string/Image/
, which can be found with the command linegrep --color -a 'Image' filename.pdf
. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).– sudodus
yesterday