Is there a simple way to identify if a pdf is scanned?

up vote
6
down vote

favorite

I have thousand of documents and some of them are scanned. So I need a script to test all pdf files that belong to a directory. Is there a simple way to do that?

Most pdfs are reports. Thus they have a lot of text.

They are very different, but the scanned ones as mentioned below one can find some text due to a precarious ocr process coupled to the scan.
- NotScanned
- Scanned1
- Scanned2

The proposal due to Sudodus in the comments below seems to be very interesting. Look at the difference between a scanned to a not scanned-pdf:

Scanned:

grep --color -a 'Image' AR-G1002.pdf

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 340615/Name/Obj13/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40452/Name/Obj18/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41680/Name/Obj23/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41432/Name/Obj28/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59084/Name/Obj33/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 472681/Name/Obj38/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 469340/Name/Obj43/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 371863/Name/Obj48/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 344092/Name/Obj53/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59416/Name/Obj58/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 48308/Name/Obj63/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 51564/Name/Obj68/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 63184/Name/Obj73/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40824/Name/Obj78/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 23320/Name/Obj83/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 31504/Name/Obj93/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 18996/Name/Obj98/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 292932/Name/Obj103/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 27720/Name/Obj108/Subtype/Image/Type/XObject/Width 1698>>stream

               <rdf:li xml:lang="x-default">Image</rdf:li>

               <rdf:li xml:lang="x-default">Image</rdf:li>

Not Scanned:

grep --color -a 'Image' AR-G1003.pdf

<</Lang(en-US)/MarkInfo<</Marked true>>/Metadata 167 0 R/Pages 2 0 R/StructTreeR<</Contents 4 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F4 11 0 R/F5 13 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>/StructParents 0/Tabs/S/Type/<</Filter/FlateDecode/Length 5463>>stream

<</BaseFont/Times#20New#20Roman,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontD<</Ascent 891/AvgWidth 427/CapHeight 677/Descent -216/Flags 32/FontBBox[-558 -216 2000 677]/FontName/Times#20New#20Roman,Bold/FontWeight 700/ItalicAngle 0/Leadi<</BaseFont/Times#20New#20Roman/Encoding/WinAnsiEncoding/FirstChar 32/FontDescri<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontName/Times#20New#20Roman/FontWeight 400/ItalicAngle 0/Leading 42<</BaseFont/Arial,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 10 0<</Ascent 905/AvgWidth 479/CapHeight 728/Descent -210/Flags 32/FontBBox[-628 -210 2000 728]/FontName/Arial,Bold/FontWeight 700/ItalicAngle 0/Leading 33/MaxWidth<</BaseFont/Times#20New#20Roman,Italic/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 12 0 R/LastChar 118/Name/F4/Subtype/TrueType/Type/Font/Widths 164 0 <</Ascent 891/AvgWidth 402/CapHeight 694/Descent -216/Flags 32/FontBBox[-498 -216 1333 694]/FontName/Times#20New#20Roman,Italic/FontWeight 400/ItalicAngle -16.4<</BaseFont/Arial/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 14 0 R/La<</Ascent 905/AvgWidth 441/CapHeight 728/Descent -210/Flags 32/FontBBox[-665 -210 2000 728]/FontName/Arial/FontWeight 400/ItalicAngle 0/Leading 33/MaxWidth 2665<</Contents 16 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 7534>>streamarents 1/Tabs/S/Type/Page>>

<</Contents 18 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 6137>>streamarents 2/Tabs/S/Type/Page>>

<</Contents 20 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R/F6 21 0 R><</Filter/FlateDecode/Length 6533>>stream>>/StructParents 3/Tabs/S/Type/Page>>

<</BaseFont/Times#20New#20Roman/DescendantFonts 22 0 R/Encoding/Identity-H/Subty<</BaseFont/Times#20New#20Roman/CIDSystemInfo 24 0 R/CIDToGIDMap/Identity/DW 100<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontFile2 160 0 R/FontName/Times#20New#20Roman/FontWeight 400/Italic<</Contents 27 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</ExtGState<</GS28 28 0 R/GS29 29 0 R>>/Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F5 13 0 R/F6 21 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC<</Filter/FlateDecode/Length 5369>>streamge>>

The number of images per page are much bigger (about one per page)!

edited 21 hours ago

muru

133k19282480

asked yesterday

DanielTheRocketMan

3321314

7

Do you mean whether they're text or images?
– DK Bose
yesterday

8

Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
– sudodus
yesterday

3

@sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
– pipe
yesterday

1

Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
– dessert
yesterday

1

If a pdf file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/, which can be found with the command line grep --color -a 'Image' filename.pdf. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).
– sudodus
yesterday

|
show 6 more comments

up vote
6
down vote

favorite

I have thousand of documents and some of them are scanned. So I need a script to test all pdf files that belong to a directory. Is there a simple way to do that?

Most pdfs are reports. Thus they have a lot of text.

They are very different, but the scanned ones as mentioned below one can find some text due to a precarious ocr process coupled to the scan.
- NotScanned
- Scanned1
- Scanned2

The proposal due to Sudodus in the comments below seems to be very interesting. Look at the difference between a scanned to a not scanned-pdf:

Scanned:

grep --color -a 'Image' AR-G1002.pdf

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 340615/Name/Obj13/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40452/Name/Obj18/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41680/Name/Obj23/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41432/Name/Obj28/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59084/Name/Obj33/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 472681/Name/Obj38/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 469340/Name/Obj43/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 371863/Name/Obj48/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 344092/Name/Obj53/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59416/Name/Obj58/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 48308/Name/Obj63/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 51564/Name/Obj68/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 63184/Name/Obj73/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40824/Name/Obj78/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 23320/Name/Obj83/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 31504/Name/Obj93/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 18996/Name/Obj98/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 292932/Name/Obj103/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 27720/Name/Obj108/Subtype/Image/Type/XObject/Width 1698>>stream

               <rdf:li xml:lang="x-default">Image</rdf:li>

               <rdf:li xml:lang="x-default">Image</rdf:li>

Not Scanned:

grep --color -a 'Image' AR-G1003.pdf

<</Lang(en-US)/MarkInfo<</Marked true>>/Metadata 167 0 R/Pages 2 0 R/StructTreeR<</Contents 4 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F4 11 0 R/F5 13 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>/StructParents 0/Tabs/S/Type/<</Filter/FlateDecode/Length 5463>>stream

<</BaseFont/Times#20New#20Roman,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontD<</Ascent 891/AvgWidth 427/CapHeight 677/Descent -216/Flags 32/FontBBox[-558 -216 2000 677]/FontName/Times#20New#20Roman,Bold/FontWeight 700/ItalicAngle 0/Leadi<</BaseFont/Times#20New#20Roman/Encoding/WinAnsiEncoding/FirstChar 32/FontDescri<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontName/Times#20New#20Roman/FontWeight 400/ItalicAngle 0/Leading 42<</BaseFont/Arial,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 10 0<</Ascent 905/AvgWidth 479/CapHeight 728/Descent -210/Flags 32/FontBBox[-628 -210 2000 728]/FontName/Arial,Bold/FontWeight 700/ItalicAngle 0/Leading 33/MaxWidth<</BaseFont/Times#20New#20Roman,Italic/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 12 0 R/LastChar 118/Name/F4/Subtype/TrueType/Type/Font/Widths 164 0 <</Ascent 891/AvgWidth 402/CapHeight 694/Descent -216/Flags 32/FontBBox[-498 -216 1333 694]/FontName/Times#20New#20Roman,Italic/FontWeight 400/ItalicAngle -16.4<</BaseFont/Arial/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 14 0 R/La<</Ascent 905/AvgWidth 441/CapHeight 728/Descent -210/Flags 32/FontBBox[-665 -210 2000 728]/FontName/Arial/FontWeight 400/ItalicAngle 0/Leading 33/MaxWidth 2665<</Contents 16 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 7534>>streamarents 1/Tabs/S/Type/Page>>

<</Contents 18 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 6137>>streamarents 2/Tabs/S/Type/Page>>

<</Contents 20 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R/F6 21 0 R><</Filter/FlateDecode/Length 6533>>stream>>/StructParents 3/Tabs/S/Type/Page>>

<</BaseFont/Times#20New#20Roman/DescendantFonts 22 0 R/Encoding/Identity-H/Subty<</BaseFont/Times#20New#20Roman/CIDSystemInfo 24 0 R/CIDToGIDMap/Identity/DW 100<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontFile2 160 0 R/FontName/Times#20New#20Roman/FontWeight 400/Italic<</Contents 27 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</ExtGState<</GS28 28 0 R/GS29 29 0 R>>/Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F5 13 0 R/F6 21 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC<</Filter/FlateDecode/Length 5369>>streamge>>

The number of images per page are much bigger (about one per page)!

edited 21 hours ago

muru

133k19282480

asked yesterday

DanielTheRocketMan

3321314

7

Do you mean whether they're text or images?
– DK Bose
yesterday

8

Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
– sudodus
yesterday

3

@sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
– pipe
yesterday

1

Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
– dessert
yesterday

1

If a pdf file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/, which can be found with the command line grep --color -a 'Image' filename.pdf. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).
– sudodus
yesterday

|
show 6 more comments

up vote
6
down vote

favorite

I have thousand of documents and some of them are scanned. So I need a script to test all pdf files that belong to a directory. Is there a simple way to do that?

Most pdfs are reports. Thus they have a lot of text.

They are very different, but the scanned ones as mentioned below one can find some text due to a precarious ocr process coupled to the scan.
- NotScanned
- Scanned1
- Scanned2

The proposal due to Sudodus in the comments below seems to be very interesting. Look at the difference between a scanned to a not scanned-pdf:

Scanned:

grep --color -a 'Image' AR-G1002.pdf

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 340615/Name/Obj13/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40452/Name/Obj18/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41680/Name/Obj23/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41432/Name/Obj28/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59084/Name/Obj33/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 472681/Name/Obj38/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 469340/Name/Obj43/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 371863/Name/Obj48/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 344092/Name/Obj53/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59416/Name/Obj58/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 48308/Name/Obj63/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 51564/Name/Obj68/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 63184/Name/Obj73/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40824/Name/Obj78/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 23320/Name/Obj83/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 31504/Name/Obj93/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 18996/Name/Obj98/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 292932/Name/Obj103/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 27720/Name/Obj108/Subtype/Image/Type/XObject/Width 1698>>stream

               <rdf:li xml:lang="x-default">Image</rdf:li>

               <rdf:li xml:lang="x-default">Image</rdf:li>

Not Scanned:

grep --color -a 'Image' AR-G1003.pdf

<</Lang(en-US)/MarkInfo<</Marked true>>/Metadata 167 0 R/Pages 2 0 R/StructTreeR<</Contents 4 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F4 11 0 R/F5 13 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>/StructParents 0/Tabs/S/Type/<</Filter/FlateDecode/Length 5463>>stream

<</BaseFont/Times#20New#20Roman,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontD<</Ascent 891/AvgWidth 427/CapHeight 677/Descent -216/Flags 32/FontBBox[-558 -216 2000 677]/FontName/Times#20New#20Roman,Bold/FontWeight 700/ItalicAngle 0/Leadi<</BaseFont/Times#20New#20Roman/Encoding/WinAnsiEncoding/FirstChar 32/FontDescri<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontName/Times#20New#20Roman/FontWeight 400/ItalicAngle 0/Leading 42<</BaseFont/Arial,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 10 0<</Ascent 905/AvgWidth 479/CapHeight 728/Descent -210/Flags 32/FontBBox[-628 -210 2000 728]/FontName/Arial,Bold/FontWeight 700/ItalicAngle 0/Leading 33/MaxWidth<</BaseFont/Times#20New#20Roman,Italic/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 12 0 R/LastChar 118/Name/F4/Subtype/TrueType/Type/Font/Widths 164 0 <</Ascent 891/AvgWidth 402/CapHeight 694/Descent -216/Flags 32/FontBBox[-498 -216 1333 694]/FontName/Times#20New#20Roman,Italic/FontWeight 400/ItalicAngle -16.4<</BaseFont/Arial/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 14 0 R/La<</Ascent 905/AvgWidth 441/CapHeight 728/Descent -210/Flags 32/FontBBox[-665 -210 2000 728]/FontName/Arial/FontWeight 400/ItalicAngle 0/Leading 33/MaxWidth 2665<</Contents 16 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 7534>>streamarents 1/Tabs/S/Type/Page>>

<</Contents 18 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 6137>>streamarents 2/Tabs/S/Type/Page>>

<</Contents 20 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R/F6 21 0 R><</Filter/FlateDecode/Length 6533>>stream>>/StructParents 3/Tabs/S/Type/Page>>

<</BaseFont/Times#20New#20Roman/DescendantFonts 22 0 R/Encoding/Identity-H/Subty<</BaseFont/Times#20New#20Roman/CIDSystemInfo 24 0 R/CIDToGIDMap/Identity/DW 100<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontFile2 160 0 R/FontName/Times#20New#20Roman/FontWeight 400/Italic<</Contents 27 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</ExtGState<</GS28 28 0 R/GS29 29 0 R>>/Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F5 13 0 R/F6 21 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC<</Filter/FlateDecode/Length 5369>>streamge>>

The number of images per page are much bigger (about one per page)!

edited 21 hours ago

muru

133k19282480

asked yesterday

DanielTheRocketMan

3321314

I have thousand of documents and some of them are scanned. So I need a script to test all pdf files that belong to a directory. Is there a simple way to do that?

Most pdfs are reports. Thus they have a lot of text.

They are very different, but the scanned ones as mentioned below one can find some text due to a precarious ocr process coupled to the scan.
- NotScanned
- Scanned1
- Scanned2

The proposal due to Sudodus in the comments below seems to be very interesting. Look at the difference between a scanned to a not scanned-pdf:

Scanned:

grep --color -a 'Image' AR-G1002.pdf

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 340615/Name/Obj13/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40452/Name/Obj18/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41680/Name/Obj23/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 41432/Name/Obj28/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59084/Name/Obj33/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 472681/Name/Obj38/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 469340/Name/Obj43/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 371863/Name/Obj48/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 344092/Name/Obj53/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 59416/Name/Obj58/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 48308/Name/Obj63/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 51564/Name/Obj68/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 63184/Name/Obj73/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 40824/Name/Obj78/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 23320/Name/Obj83/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 31504/Name/Obj93/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 18996/Name/Obj98/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 8/ColorSpace/DeviceRGB/Filter[/DCTDecode]/Height 2197/Length 292932/Name/Obj103/Subtype/Image/Type/XObject/Width 1698>>stream

<</BitsPerComponent 1/ColorSpace/DeviceGray/DecodeParms<</Columns 1698/K -1>>/Filter/CCITTFaxDecode/Height 2197/Length 27720/Name/Obj108/Subtype/Image/Type/XObject/Width 1698>>stream

               <rdf:li xml:lang="x-default">Image</rdf:li>

               <rdf:li xml:lang="x-default">Image</rdf:li>

Not Scanned:

grep --color -a 'Image' AR-G1003.pdf

<</Lang(en-US)/MarkInfo<</Marked true>>/Metadata 167 0 R/Pages 2 0 R/StructTreeR<</Contents 4 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F4 11 0 R/F5 13 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>/StructParents 0/Tabs/S/Type/<</Filter/FlateDecode/Length 5463>>stream

<</BaseFont/Times#20New#20Roman,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontD<</Ascent 891/AvgWidth 427/CapHeight 677/Descent -216/Flags 32/FontBBox[-558 -216 2000 677]/FontName/Times#20New#20Roman,Bold/FontWeight 700/ItalicAngle 0/Leadi<</BaseFont/Times#20New#20Roman/Encoding/WinAnsiEncoding/FirstChar 32/FontDescri<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontName/Times#20New#20Roman/FontWeight 400/ItalicAngle 0/Leading 42<</BaseFont/Arial,Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 10 0<</Ascent 905/AvgWidth 479/CapHeight 728/Descent -210/Flags 32/FontBBox[-628 -210 2000 728]/FontName/Arial,Bold/FontWeight 700/ItalicAngle 0/Leading 33/MaxWidth<</BaseFont/Times#20New#20Roman,Italic/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 12 0 R/LastChar 118/Name/F4/Subtype/TrueType/Type/Font/Widths 164 0 <</Ascent 891/AvgWidth 402/CapHeight 694/Descent -216/Flags 32/FontBBox[-498 -216 1333 694]/FontName/Times#20New#20Roman,Italic/FontWeight 400/ItalicAngle -16.4<</BaseFont/Arial/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 14 0 R/La<</Ascent 905/AvgWidth 441/CapHeight 728/Descent -210/Flags 32/FontBBox[-665 -210 2000 728]/FontName/Arial/FontWeight 400/ItalicAngle 0/Leading 33/MaxWidth 2665<</Contents 16 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 7534>>streamarents 1/Tabs/S/Type/Page>>

<</Contents 18 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R>>/ProcSet[<</Filter/FlateDecode/Length 6137>>streamarents 2/Tabs/S/Type/Page>>

<</Contents 20 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 7 0 R/F5 13 0 R/F6 21 0 R><</Filter/FlateDecode/Length 6533>>stream>>/StructParents 3/Tabs/S/Type/Page>>

<</BaseFont/Times#20New#20Roman/DescendantFonts 22 0 R/Encoding/Identity-H/Subty<</BaseFont/Times#20New#20Roman/CIDSystemInfo 24 0 R/CIDToGIDMap/Identity/DW 100<</Ascent 891/AvgWidth 401/CapHeight 693/Descent -216/Flags 32/FontBBox[-568 -216 2000 693]/FontFile2 160 0 R/FontName/Times#20New#20Roman/FontWeight 400/Italic<</Contents 27 0 R/Group<</CS/DeviceRGB/S/Transparency/Type/Group>>/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</ExtGState<</GS28 28 0 R/GS29 29 0 R>>/Font<</F1 5 0 R/F2 7 0 R/F3 9 0 R/F5 13 0 R/F6 21 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC<</Filter/FlateDecode/Length 5369>>streamge>>

The number of images per page are much bigger (about one per page)!

command-line pdf

edited 21 hours ago

muru

133k19282480

asked yesterday

DanielTheRocketMan

3321314

edited 21 hours ago

muru

133k19282480

asked yesterday

DanielTheRocketMan

3321314

edited 21 hours ago

muru

133k19282480

edited 21 hours ago

muru

133k19282480

edited 21 hours ago

muru

133k19282480

asked yesterday

DanielTheRocketMan

3321314

asked yesterday

DanielTheRocketMan

3321314

asked yesterday

DanielTheRocketMan

3321314

7

Do you mean whether they're text or images?
– DK Bose
yesterday

8

Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
– sudodus
yesterday

3

@sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
– pipe
yesterday

1

Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
– dessert
yesterday

1

If a pdf file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/, which can be found with the command line grep --color -a 'Image' filename.pdf. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).
– sudodus
yesterday

|
show 6 more comments

7

Do you mean whether they're text or images?
– DK Bose
yesterday

8

Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
– sudodus
yesterday

3

@sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
– pipe
yesterday

1

Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
– dessert
yesterday

1

If a pdf file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/, which can be found with the command line grep --color -a 'Image' filename.pdf. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).
– sudodus
yesterday

Do you mean whether they're text or images?
– DK Bose
yesterday

Why do you want to know, if a pdf file is scanned or not? How do you intend to use that information?
– sudodus
yesterday

@sudodus Asks a very good question. For example, most scanned PDFs have their text available for selection, converted using OCR. Do you make a difference between such files and text files? Do you know the source of your PDFs?
– pipe
yesterday

Is there any difference in the metadata of scanned and not scanned documents? That would offer a very clean and easy way.
– dessert
yesterday

If a pdf file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/, which can be found with the command line grep --color -a 'Image' filename.pdf. This will separate files which contain only text from those containing images (full page images as well as text pages with small logos and medium-sized illustrating pictures).
– sudodus
yesterday

|
show 6 more comments

5 Answers
5

active

oldest

votes

up vote
1
down vote

accepted

Shellscript

If a pdf file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/.

In the same way you can search for the string /Text to tell if a pdf file contains text (not scanned).

I made the shellscript pdf-text-or-image, and it might work in most cases with your files. The shellscript looks for the text strings /Image/ and /Text in the pdf files.

#!/bin/bash



echo "shellscript $0"

ls --color --group-directories-first

read -p "Is it OK to use this shellscript in this directory? (y/N) " ans

if [ "$ans" != "y" ]

then

 exit

fi



mkdir -p scanned

mkdir -p text

mkdir -p "s-and-t"



for file in *.pdf

do

 grep -aq '/Image/' "$file"

 if [ $? -eq 0 ]

 then

  image=true

 else

  image=false

 fi

 grep -aq '/Text' "$file"

 if [ $? -eq 0 ]

 then

  text=true

 else

  text=false

 fi





 if $image && $text

 then

  mv "$file" "s-and-t"

 elif $image

 then

  mv "$file" "scanned"

 elif $text

 then

  mv "$file" "text"

 else

  echo "$file undecided"

 fi

done

Make the shellscript executable,

chmod ugo+x pdf-text-or-image

Change directory to where you have the pdf files and run the shellscript.

Identified files are moved to the following subdirectories

scanned

text

s-and-t (for documents with both [scanned?] images and text content)

Unidentified file objects, 'UFOs', remain in the current directory.

Test

I tested the shellscript with two of your files, AR-G1002.pdf and AR-G1003.pdf, and with some own pdf files (that I have created using Libre Office Impress).

$ ./pdf-text-or-image

shellscript ./pdf-text-or-image

s-and-t                                 mkUSB-quick-start-manual-11.pdf    mkUSB-quick-start-manual-nox-11.pdf

scanned                                 mkUSB-quick-start-manual-12-0.pdf  mkUSB-quick-start-manual-nox.pdf

text                                    mkUSB-quick-start-manual-12.pdf    mkUSB-quick-start-manual.pdf

AR-G1002.pdf                            mkUSB-quick-start-manual-74.pdf    OBI-quick-start-manual.pdf

AR-G1003.pdf                            mkUSB-quick-start-manual-75.pdf    oem.pdf

DescriptionoftheOneButtonInstaller.pdf  mkUSB-quick-start-manual-8.pdf     pdf-text-or-image

GrowIt.pdf                              mkUSB-quick-start-manual-9.pdf     pdf-text-or-image0

list-files.pdf                          mkUSB-quick-start-manual-bas.pdf   README.pdf

Is it OK to use this shellscript in this directory? (y/N) y



$ ls -1 *

pdf-text-or-image

pdf-text-or-image0



s-and-t:

DescriptionoftheOneButtonInstaller.pdf

GrowIt.pdf

mkUSB-quick-start-manual-11.pdf

mkUSB-quick-start-manual-12-0.pdf

mkUSB-quick-start-manual-12.pdf

mkUSB-quick-start-manual-8.pdf

mkUSB-quick-start-manual-9.pdf

mkUSB-quick-start-manual.pdf

OBI-quick-start-manual.pdf

README.pdf



scanned:

AR-G1002.pdf



text:

AR-G1003.pdf

list-files.pdf

mkUSB-quick-start-manual-74.pdf

mkUSB-quick-start-manual-75.pdf

mkUSB-quick-start-manual-bas.pdf

mkUSB-quick-start-manual-nox-11.pdf

mkUSB-quick-start-manual-nox.pdf

oem.pdf

Let us hope that

there are no UFOs in your set of files

the sorting is correct concerning text versus scanned/images

edited 10 hours ago

answered yesterday

sudodus

21.2k32770

instead of redirecting to /dev/null you can just use grep -q
– phuclv
22 hours ago

@phuclv, Thanks for the tip :-)
– sudodus
17 hours ago

1

@phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, because grep -q exits immediately with zero status if any match is found (instead of seaching through the whole files).
– sudodus
10 hours ago

add a comment |

up vote
6
down vote

Put all the .pdf files in one folder.

No .txt file in that folder.

In terminal change directory to that folder with cd <path to dir>

Make one more directory for non scanned files. Example:

    mkdir ./x 

    for file in *.pdf; do

        if [ $(pdftotext "$file")"x" == "x" ] ; then mv "$file" ./x; fi

    rm *.txt

    done

All the pdf scanned files will remain in the folder and other files will move to another folder.

edited yesterday

dessert

21k55896

answered yesterday

Hobbyist

979617

this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
– DanielTheRocketMan
yesterday

8

Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
– Joey
yesterday

2

Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
– jamesqf
yesterday

1

@DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output of file pdf-filename.pdf will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
– Elder Geek
yesterday

1

@DanielTheRocketMan If that's the case you might find sorting the documents by version using the output of file helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
– Elder Geek
yesterday

|
show 5 more comments

up vote
1
down vote

Hobbyist offers a good solution if the document collection's scanned documents do not have text added with optical character recognition (OCR). If this is a possibility, you may want to do some scripting that reads the output of pdfinfo -meta and checks for the tool used to create the file, or employ a Python routine that uses one of the Python libraries to examine them. Searching for text with a tool like strings will be unreliable because PDF content can be compressed. And checking the creation tool is not failsafe, either, since PDF pages can be combined; I routinely combine PDF text documents with scanned images to keep things together.

I'm sorry that I am unable to offer specific suggestions. It's been a while since I poked at the PDF internal structure, but depending on how stringent your requirements are, you may want to know that it's kind of complicated. Good luck!

answered yesterday

ichabod

111

New contributor

2

I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
– DanielTheRocketMan
yesterday

add a comment |

up vote
1
down vote

If this is more about actually detecting if PDF was created by scanning rather than pdf has images instead of text then you might need to dig into the metadata of the file, not just content.

In general, for the files I could find on my computer and your test files, following is true:

Scanned files have less than 1000chars/page vs. non scanned ones who always have more than 1000chars/page

Multiple independent scanned files had "Canon" listed as the PDF creator, probably referencing Canon scanner software

PDFs with "Microsoft Word" as creator are likely to not be scanned, as they are word exports. But someone could scan to word, then export to PDF - some people have very strange workflow.

I'm using Windows at the moment, so I used node.js for the following example:

const fs = require("mz/fs");

const pdf_parse = require("pdf-parse");

const path = require("path");





const SHOW_SCANNED_ONES = process.argv.indexOf("scanned") != -1;



const DEBUG = process.argv.indexOf("debug") != -1;

const STRICT = process.argv.indexOf("strict") != -1;



const debug = DEBUG ? console.error : () => { };



(async () => {

    const pdfs = (await fs.readdir(".")).filter((fname) => { return fname.endsWith(".pdf") });



    for (let i = 0, l = pdfs.length; i < l; ++i) {

        const pdffilename = pdfs[i];

        try {

            debug("nnFILE: ", pdffilename);

            const buffer = await fs.readFile(pdffilename);

            const data = await pdf_parse(buffer);



            if (!data.info)

                data.indo = {};

            if (!data.metadata) {

                data.metadata = {

                    _metadata: {}

                };

            }





            // PDF info

            debug(data.info);

            // PDF metadata

            debug(data.metadata);

            // text length

            const textLen = data.text ? data.text.length : 0;

            const textPerPage = textLen / (data.numpages);

            debug("Text length: ", textLen);

            debug("Chars per page: ", textLen / data.numpages);

            // PDF.js version

            // check https://mozilla.github.io/pdf.js/getting_started/

            debug(data.version);



            if (evalScanned(data, textLen, textPerPage) == SHOW_SCANNED_ONES) {

                console.log(path.resolve(".", pdffilename));

            }

        }

        catch (e) {

            if (strict && !debug) {

                console.error("Failed to evaluate " + item);

            }

            {

                debug("Failed to evaluate " + item);

                debug(e.stack);

            }

            if (strict) {

                process.exit(1);

            }

        }

    }

})();

const IS_CREATOR_CANON = /canon/i;

const IS_CREATOR_MS_WORD = /microsoft.*?word/i;

// just defined for better clarity or return values

const IS_SCANNED = true;

const IS_NOT_SCANNED = false;

function evalScanned(pdfdata, textLen, textPerPage) {

    if (textPerPage < 300 && pdfdata.numpages>1) {

        // really low number, definitelly not text pdf

        return IS_SCANNED;

    }

    // definitelly has enough text

    // might be scanned but OCRed

    // we return this if no 

    // suspition of scanning is found

    let implicitAssumption = textPerPage > 1000 ? IS_NOT_SCANNED : IS_SCANNED;

    if (IS_CREATOR_CANON.test(pdfdata.info.Creator)) {

        // this is always scanned, canon is brand name

        return IS_SCANNED;

    }

    return implicitAssumption;

}

To run it, you need to have Node.js installed (should be a single command) and you also need to call:

npm install mz pdf-parse

Usage:

node howYouNamedIt.js [scanned] [debug] [strict]



 - scanned show PDFs thought to be scanned (otherwise shows not scanned)

 - debug shows the debug info such as metadata and error stack traces

 - strict kills the program on first error

This example is not considered finished solution, but with the debug flag, you get some insight into meta information of a file:

FILE:  BR-L1411-3-scanned.pdf

{ PDFFormatVersion: '1.3',

  IsAcroFormPresent: false,

  IsXFAPresent: false,

  Creator: 'Canon ',

  Producer: ' ',

  CreationDate: 'D:20131212150500-03'00'',

  ModDate: 'D:20140709104225-03'00'' }

Metadata {

  _metadata:

   { 'xmp:createdate': '2013-12-12T15:05-03:00',

     'xmp:creatortool': 'Canon',

     'xmp:modifydate': '2014-07-09T10:42:25-03:00',

     'xmp:metadatadate': '2014-07-09T10:42:25-03:00',

     'pdf:producer': '',

     'xmpmm:documentid': 'uuid:79a14710-88e2-4849-96b1-512e89ee8dab',

     'xmpmm:instanceid': 'uuid:1d2b2106-a13f-48c6-8bca-6795aa955ad1',

     'dc:format': 'application/pdf' } }

Text length:  772

Chars per page:  2

1.10.100

D:webso-odpovedipdfBR-L1411-3-scanned.pdf

The naive function that I wrote has 100% success on the documents that I could find on my computer (including your samples). I named the files based on what their status was before running the program, to make it possible to see if results are correct.

D:xxxxpdf>node detect_scanned.js scanned

D:xxxxpdfAR-G1002-scanned.pdf

D:xxxxpdfAR-G1002_scanned.pdf

D:xxxxpdfBR-L1411-3-scanned.pdf

D:xxxxpdfWHO_TRS_696-scanned.pdf



D:xxxxpdf>node detect_scanned.js

D:xxxxpdfAR-G1003-not-scanned.pdf

D:xxxxpdfASEE_-_thermoelectric_paper_-_final-not-scanned.pdf

D:xxxxpdfMULTIMODE ABSORBER-not-scanned.pdf

D:xxxxpdfReductionofOxideMineralsbyHydrogenPlasma-not-scanned.pdf

You can use the debug mode along with a tiny bit of programming to vastly improve your results. You can pass the output of the program to other programs, it will always have one full path per line.

answered yesterday

Tomáš Zato

169113

Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
– jamesqf
5 hours ago

add a comment |

up vote
0
down vote

2 ways I can think of:

Using select text tool: if you are using a scanned PDF the texts can not be selected, rather a box will appear. You can use this fact to create the script. I know in C++ QT there is a way, not sure in Linux though.

Search for word in file: In a non-scanned PDF your search will work, however not in scanned file. You just need to find some words common to all PDFs or I would rather say search for letter 'e' in all the PDFs. It has the highest frequency distribution so chances are you will find it in all the documents which have text (Unless its gadsby)

grep -rnw '/path/to/pdf/' -e 'e'

Use any of the text processing tools

edited yesterday

phuclv

318224

answered yesterday

swapedoc

416

1

a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
– phuclv
yesterday

@phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
– jamesqf
yesterday

1

@jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
– DanielTheRocketMan
yesterday

1

i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
– swapedoc
yesterday

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "89"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1094198%2fis-there-a-simple-way-to-identify-if-a-pdf-is-scanned%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

5 Answers
5

active

oldest

votes

5 Answers
5

active

oldest

votes

up vote
1
down vote

accepted

Shellscript

If a pdf file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/.

In the same way you can search for the string /Text to tell if a pdf file contains text (not scanned).

I made the shellscript pdf-text-or-image, and it might work in most cases with your files. The shellscript looks for the text strings /Image/ and /Text in the pdf files.

#!/bin/bash



echo "shellscript $0"

ls --color --group-directories-first

read -p "Is it OK to use this shellscript in this directory? (y/N) " ans

if [ "$ans" != "y" ]

then

 exit

fi



mkdir -p scanned

mkdir -p text

mkdir -p "s-and-t"



for file in *.pdf

do

 grep -aq '/Image/' "$file"

 if [ $? -eq 0 ]

 then

  image=true

 else

  image=false

 fi

 grep -aq '/Text' "$file"

 if [ $? -eq 0 ]

 then

  text=true

 else

  text=false

 fi





 if $image && $text

 then

  mv "$file" "s-and-t"

 elif $image

 then

  mv "$file" "scanned"

 elif $text

 then

  mv "$file" "text"

 else

  echo "$file undecided"

 fi

done

Make the shellscript executable,

chmod ugo+x pdf-text-or-image

Change directory to where you have the pdf files and run the shellscript.

Identified files are moved to the following subdirectories

scanned

text

s-and-t (for documents with both [scanned?] images and text content)

Unidentified file objects, 'UFOs', remain in the current directory.

Test

I tested the shellscript with two of your files, AR-G1002.pdf and AR-G1003.pdf, and with some own pdf files (that I have created using Libre Office Impress).

$ ./pdf-text-or-image

shellscript ./pdf-text-or-image

s-and-t                                 mkUSB-quick-start-manual-11.pdf    mkUSB-quick-start-manual-nox-11.pdf

scanned                                 mkUSB-quick-start-manual-12-0.pdf  mkUSB-quick-start-manual-nox.pdf

text                                    mkUSB-quick-start-manual-12.pdf    mkUSB-quick-start-manual.pdf

AR-G1002.pdf                            mkUSB-quick-start-manual-74.pdf    OBI-quick-start-manual.pdf

AR-G1003.pdf                            mkUSB-quick-start-manual-75.pdf    oem.pdf

DescriptionoftheOneButtonInstaller.pdf  mkUSB-quick-start-manual-8.pdf     pdf-text-or-image

GrowIt.pdf                              mkUSB-quick-start-manual-9.pdf     pdf-text-or-image0

list-files.pdf                          mkUSB-quick-start-manual-bas.pdf   README.pdf

Is it OK to use this shellscript in this directory? (y/N) y



$ ls -1 *

pdf-text-or-image

pdf-text-or-image0



s-and-t:

DescriptionoftheOneButtonInstaller.pdf

GrowIt.pdf

mkUSB-quick-start-manual-11.pdf

mkUSB-quick-start-manual-12-0.pdf

mkUSB-quick-start-manual-12.pdf

mkUSB-quick-start-manual-8.pdf

mkUSB-quick-start-manual-9.pdf

mkUSB-quick-start-manual.pdf

OBI-quick-start-manual.pdf

README.pdf



scanned:

AR-G1002.pdf



text:

AR-G1003.pdf

list-files.pdf

mkUSB-quick-start-manual-74.pdf

mkUSB-quick-start-manual-75.pdf

mkUSB-quick-start-manual-bas.pdf

mkUSB-quick-start-manual-nox-11.pdf

mkUSB-quick-start-manual-nox.pdf

oem.pdf

Let us hope that

there are no UFOs in your set of files

the sorting is correct concerning text versus scanned/images

edited 10 hours ago

answered yesterday

sudodus

21.2k32770

instead of redirecting to /dev/null you can just use grep -q
– phuclv
22 hours ago

@phuclv, Thanks for the tip :-)
– sudodus
17 hours ago

1

@phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, because grep -q exits immediately with zero status if any match is found (instead of seaching through the whole files).
– sudodus
10 hours ago

add a comment |

up vote
1
down vote

accepted

Shellscript

If a pdf file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/.

In the same way you can search for the string /Text to tell if a pdf file contains text (not scanned).

I made the shellscript pdf-text-or-image, and it might work in most cases with your files. The shellscript looks for the text strings /Image/ and /Text in the pdf files.

#!/bin/bash



echo "shellscript $0"

ls --color --group-directories-first

read -p "Is it OK to use this shellscript in this directory? (y/N) " ans

if [ "$ans" != "y" ]

then

 exit

fi



mkdir -p scanned

mkdir -p text

mkdir -p "s-and-t"



for file in *.pdf

do

 grep -aq '/Image/' "$file"

 if [ $? -eq 0 ]

 then

  image=true

 else

  image=false

 fi

 grep -aq '/Text' "$file"

 if [ $? -eq 0 ]

 then

  text=true

 else

  text=false

 fi





 if $image && $text

 then

  mv "$file" "s-and-t"

 elif $image

 then

  mv "$file" "scanned"

 elif $text

 then

  mv "$file" "text"

 else

  echo "$file undecided"

 fi

done

Make the shellscript executable,

chmod ugo+x pdf-text-or-image

Change directory to where you have the pdf files and run the shellscript.

Identified files are moved to the following subdirectories

scanned

text

s-and-t (for documents with both [scanned?] images and text content)

Unidentified file objects, 'UFOs', remain in the current directory.

Test

I tested the shellscript with two of your files, AR-G1002.pdf and AR-G1003.pdf, and with some own pdf files (that I have created using Libre Office Impress).

$ ./pdf-text-or-image

shellscript ./pdf-text-or-image

s-and-t                                 mkUSB-quick-start-manual-11.pdf    mkUSB-quick-start-manual-nox-11.pdf

scanned                                 mkUSB-quick-start-manual-12-0.pdf  mkUSB-quick-start-manual-nox.pdf

text                                    mkUSB-quick-start-manual-12.pdf    mkUSB-quick-start-manual.pdf

AR-G1002.pdf                            mkUSB-quick-start-manual-74.pdf    OBI-quick-start-manual.pdf

AR-G1003.pdf                            mkUSB-quick-start-manual-75.pdf    oem.pdf

DescriptionoftheOneButtonInstaller.pdf  mkUSB-quick-start-manual-8.pdf     pdf-text-or-image

GrowIt.pdf                              mkUSB-quick-start-manual-9.pdf     pdf-text-or-image0

list-files.pdf                          mkUSB-quick-start-manual-bas.pdf   README.pdf

Is it OK to use this shellscript in this directory? (y/N) y



$ ls -1 *

pdf-text-or-image

pdf-text-or-image0



s-and-t:

DescriptionoftheOneButtonInstaller.pdf

GrowIt.pdf

mkUSB-quick-start-manual-11.pdf

mkUSB-quick-start-manual-12-0.pdf

mkUSB-quick-start-manual-12.pdf

mkUSB-quick-start-manual-8.pdf

mkUSB-quick-start-manual-9.pdf

mkUSB-quick-start-manual.pdf

OBI-quick-start-manual.pdf

README.pdf



scanned:

AR-G1002.pdf



text:

AR-G1003.pdf

list-files.pdf

mkUSB-quick-start-manual-74.pdf

mkUSB-quick-start-manual-75.pdf

mkUSB-quick-start-manual-bas.pdf

mkUSB-quick-start-manual-nox-11.pdf

mkUSB-quick-start-manual-nox.pdf

oem.pdf

Let us hope that

there are no UFOs in your set of files

the sorting is correct concerning text versus scanned/images

edited 10 hours ago

answered yesterday

sudodus

21.2k32770

instead of redirecting to /dev/null you can just use grep -q
– phuclv
22 hours ago

@phuclv, Thanks for the tip :-)
– sudodus
17 hours ago

1

@phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, because grep -q exits immediately with zero status if any match is found (instead of seaching through the whole files).
– sudodus
10 hours ago

add a comment |

up vote
1
down vote

accepted

Shellscript

If a pdf file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/.

In the same way you can search for the string /Text to tell if a pdf file contains text (not scanned).

I made the shellscript pdf-text-or-image, and it might work in most cases with your files. The shellscript looks for the text strings /Image/ and /Text in the pdf files.

#!/bin/bash



echo "shellscript $0"

ls --color --group-directories-first

read -p "Is it OK to use this shellscript in this directory? (y/N) " ans

if [ "$ans" != "y" ]

then

 exit

fi



mkdir -p scanned

mkdir -p text

mkdir -p "s-and-t"



for file in *.pdf

do

 grep -aq '/Image/' "$file"

 if [ $? -eq 0 ]

 then

  image=true

 else

  image=false

 fi

 grep -aq '/Text' "$file"

 if [ $? -eq 0 ]

 then

  text=true

 else

  text=false

 fi





 if $image && $text

 then

  mv "$file" "s-and-t"

 elif $image

 then

  mv "$file" "scanned"

 elif $text

 then

  mv "$file" "text"

 else

  echo "$file undecided"

 fi

done

Make the shellscript executable,

chmod ugo+x pdf-text-or-image

Change directory to where you have the pdf files and run the shellscript.

Identified files are moved to the following subdirectories

scanned

text

s-and-t (for documents with both [scanned?] images and text content)

Unidentified file objects, 'UFOs', remain in the current directory.

Test

I tested the shellscript with two of your files, AR-G1002.pdf and AR-G1003.pdf, and with some own pdf files (that I have created using Libre Office Impress).

$ ./pdf-text-or-image

shellscript ./pdf-text-or-image

s-and-t                                 mkUSB-quick-start-manual-11.pdf    mkUSB-quick-start-manual-nox-11.pdf

scanned                                 mkUSB-quick-start-manual-12-0.pdf  mkUSB-quick-start-manual-nox.pdf

text                                    mkUSB-quick-start-manual-12.pdf    mkUSB-quick-start-manual.pdf

AR-G1002.pdf                            mkUSB-quick-start-manual-74.pdf    OBI-quick-start-manual.pdf

AR-G1003.pdf                            mkUSB-quick-start-manual-75.pdf    oem.pdf

DescriptionoftheOneButtonInstaller.pdf  mkUSB-quick-start-manual-8.pdf     pdf-text-or-image

GrowIt.pdf                              mkUSB-quick-start-manual-9.pdf     pdf-text-or-image0

list-files.pdf                          mkUSB-quick-start-manual-bas.pdf   README.pdf

Is it OK to use this shellscript in this directory? (y/N) y



$ ls -1 *

pdf-text-or-image

pdf-text-or-image0



s-and-t:

DescriptionoftheOneButtonInstaller.pdf

GrowIt.pdf

mkUSB-quick-start-manual-11.pdf

mkUSB-quick-start-manual-12-0.pdf

mkUSB-quick-start-manual-12.pdf

mkUSB-quick-start-manual-8.pdf

mkUSB-quick-start-manual-9.pdf

mkUSB-quick-start-manual.pdf

OBI-quick-start-manual.pdf

README.pdf



scanned:

AR-G1002.pdf



text:

AR-G1003.pdf

list-files.pdf

mkUSB-quick-start-manual-74.pdf

mkUSB-quick-start-manual-75.pdf

mkUSB-quick-start-manual-bas.pdf

mkUSB-quick-start-manual-nox-11.pdf

mkUSB-quick-start-manual-nox.pdf

oem.pdf

Let us hope that

there are no UFOs in your set of files

the sorting is correct concerning text versus scanned/images

edited 10 hours ago

answered yesterday

sudodus

21.2k32770

Shellscript

If a pdf file contains an image (inserted in a document alongside text or as whole pages, 'scanned pdf'), the file often (maybe always) contains the string /Image/.

In the same way you can search for the string /Text to tell if a pdf file contains text (not scanned).

I made the shellscript pdf-text-or-image, and it might work in most cases with your files. The shellscript looks for the text strings /Image/ and /Text in the pdf files.

#!/bin/bash



echo "shellscript $0"

ls --color --group-directories-first

read -p "Is it OK to use this shellscript in this directory? (y/N) " ans

if [ "$ans" != "y" ]

then

 exit

fi



mkdir -p scanned

mkdir -p text

mkdir -p "s-and-t"



for file in *.pdf

do

 grep -aq '/Image/' "$file"

 if [ $? -eq 0 ]

 then

  image=true

 else

  image=false

 fi

 grep -aq '/Text' "$file"

 if [ $? -eq 0 ]

 then

  text=true

 else

  text=false

 fi





 if $image && $text

 then

  mv "$file" "s-and-t"

 elif $image

 then

  mv "$file" "scanned"

 elif $text

 then

  mv "$file" "text"

 else

  echo "$file undecided"

 fi

done

Make the shellscript executable,

chmod ugo+x pdf-text-or-image

Change directory to where you have the pdf files and run the shellscript.

Identified files are moved to the following subdirectories

scanned

text

s-and-t (for documents with both [scanned?] images and text content)

Unidentified file objects, 'UFOs', remain in the current directory.

Test

I tested the shellscript with two of your files, AR-G1002.pdf and AR-G1003.pdf, and with some own pdf files (that I have created using Libre Office Impress).

$ ./pdf-text-or-image

shellscript ./pdf-text-or-image

s-and-t                                 mkUSB-quick-start-manual-11.pdf    mkUSB-quick-start-manual-nox-11.pdf

scanned                                 mkUSB-quick-start-manual-12-0.pdf  mkUSB-quick-start-manual-nox.pdf

text                                    mkUSB-quick-start-manual-12.pdf    mkUSB-quick-start-manual.pdf

AR-G1002.pdf                            mkUSB-quick-start-manual-74.pdf    OBI-quick-start-manual.pdf

AR-G1003.pdf                            mkUSB-quick-start-manual-75.pdf    oem.pdf

DescriptionoftheOneButtonInstaller.pdf  mkUSB-quick-start-manual-8.pdf     pdf-text-or-image

GrowIt.pdf                              mkUSB-quick-start-manual-9.pdf     pdf-text-or-image0

list-files.pdf                          mkUSB-quick-start-manual-bas.pdf   README.pdf

Is it OK to use this shellscript in this directory? (y/N) y



$ ls -1 *

pdf-text-or-image

pdf-text-or-image0



s-and-t:

DescriptionoftheOneButtonInstaller.pdf

GrowIt.pdf

mkUSB-quick-start-manual-11.pdf

mkUSB-quick-start-manual-12-0.pdf

mkUSB-quick-start-manual-12.pdf

mkUSB-quick-start-manual-8.pdf

mkUSB-quick-start-manual-9.pdf

mkUSB-quick-start-manual.pdf

OBI-quick-start-manual.pdf

README.pdf



scanned:

AR-G1002.pdf



text:

AR-G1003.pdf

list-files.pdf

mkUSB-quick-start-manual-74.pdf

mkUSB-quick-start-manual-75.pdf

mkUSB-quick-start-manual-bas.pdf

mkUSB-quick-start-manual-nox-11.pdf

mkUSB-quick-start-manual-nox.pdf

oem.pdf

Let us hope that

there are no UFOs in your set of files

the sorting is correct concerning text versus scanned/images

edited 10 hours ago

answered yesterday

sudodus

21.2k32770

edited 10 hours ago

answered yesterday

sudodus

21.2k32770

answered yesterday

sudodus

21.2k32770

answered yesterday

sudodus

21.2k32770

instead of redirecting to /dev/null you can just use grep -q
– phuclv
22 hours ago

@phuclv, Thanks for the tip :-)
– sudodus
17 hours ago

1

@phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, because grep -q exits immediately with zero status if any match is found (instead of seaching through the whole files).
– sudodus
10 hours ago

add a comment |

instead of redirecting to /dev/null you can just use grep -q
– phuclv
22 hours ago

@phuclv, Thanks for the tip :-)
– sudodus
17 hours ago

1

@phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, because grep -q exits immediately with zero status if any match is found (instead of seaching through the whole files).
– sudodus
10 hours ago

instead of redirecting to /dev/null you can just use grep -q
– phuclv
22 hours ago

@phuclv, Thanks for the tip :-)
– sudodus
17 hours ago

@phuclv, Thanks for the tip :-) This makes it somewhat faster too, particularly with big files, because grep -q exits immediately with zero status if any match is found (instead of seaching through the whole files).
– sudodus
10 hours ago

add a comment |

up vote
6
down vote

Put all the .pdf files in one folder.

No .txt file in that folder.

In terminal change directory to that folder with cd <path to dir>

Make one more directory for non scanned files. Example:

    mkdir ./x 

    for file in *.pdf; do

        if [ $(pdftotext "$file")"x" == "x" ] ; then mv "$file" ./x; fi

    rm *.txt

    done

All the pdf scanned files will remain in the folder and other files will move to another folder.

edited yesterday

dessert

21k55896

answered yesterday

Hobbyist

979617

this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
– DanielTheRocketMan
yesterday

8

Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
– Joey
yesterday

2

Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
– jamesqf
yesterday

1

@DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output of file pdf-filename.pdf will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
– Elder Geek
yesterday

1

@DanielTheRocketMan If that's the case you might find sorting the documents by version using the output of file helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
– Elder Geek
yesterday

|
show 5 more comments

up vote
6
down vote

Put all the .pdf files in one folder.

No .txt file in that folder.

In terminal change directory to that folder with cd <path to dir>

Make one more directory for non scanned files. Example:

    mkdir ./x 

    for file in *.pdf; do

        if [ $(pdftotext "$file")"x" == "x" ] ; then mv "$file" ./x; fi

    rm *.txt

    done

All the pdf scanned files will remain in the folder and other files will move to another folder.

edited yesterday

dessert

21k55896

answered yesterday

Hobbyist

979617

this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
– DanielTheRocketMan
yesterday

8

Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
– Joey
yesterday

2

Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
– jamesqf
yesterday

1

@DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output of file pdf-filename.pdf will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
– Elder Geek
yesterday

1

@DanielTheRocketMan If that's the case you might find sorting the documents by version using the output of file helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
– Elder Geek
yesterday

|
show 5 more comments

up vote
6
down vote

Put all the .pdf files in one folder.

No .txt file in that folder.

In terminal change directory to that folder with cd <path to dir>

Make one more directory for non scanned files. Example:

    mkdir ./x 

    for file in *.pdf; do

        if [ $(pdftotext "$file")"x" == "x" ] ; then mv "$file" ./x; fi

    rm *.txt

    done

All the pdf scanned files will remain in the folder and other files will move to another folder.

edited yesterday

dessert

21k55896

answered yesterday

Hobbyist

979617

Put all the .pdf files in one folder.

No .txt file in that folder.

In terminal change directory to that folder with cd <path to dir>

Make one more directory for non scanned files. Example:

    mkdir ./x 

    for file in *.pdf; do

        if [ $(pdftotext "$file")"x" == "x" ] ; then mv "$file" ./x; fi

    rm *.txt

    done

All the pdf scanned files will remain in the folder and other files will move to another folder.

edited yesterday

dessert

21k55896

answered yesterday

Hobbyist

979617

edited yesterday

dessert

21k55896

edited yesterday

dessert

21k55896

edited yesterday

dessert

21k55896

answered yesterday

Hobbyist

979617

answered yesterday

Hobbyist

979617

answered yesterday

Hobbyist

979617

this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
– DanielTheRocketMan
yesterday

8

Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
– Joey
yesterday

2

Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
– jamesqf
yesterday

1

@DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output of file pdf-filename.pdf will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
– Elder Geek
yesterday

1

@DanielTheRocketMan If that's the case you might find sorting the documents by version using the output of file helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
– Elder Geek
yesterday

|
show 5 more comments

this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
– DanielTheRocketMan
yesterday

8

Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
– Joey
yesterday

2

Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
– jamesqf
yesterday

1

@DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output of file pdf-filename.pdf will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
– Elder Geek
yesterday

1

@DanielTheRocketMan If that's the case you might find sorting the documents by version using the output of file helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
– Elder Geek
yesterday

this is great. However, this file goes to the other folder and it is scanned: drive.google.com/open?id=12xIQdRo_cyTf27Ck6DQKvRyRvlkYEzjl What is happening?
– DanielTheRocketMan
yesterday

Scanned PDFs often always contain the OCRed text content, so I'd guess that simple test would fail for them. A better indicator might be one large image per page, regardless of text content.
– Joey
yesterday

Downvoted because of the very obvious flaw: how do you know if the files are scanned or not in the first place? That's what the OP is asking: how to programmatically test for scanned or not.
– jamesqf
yesterday

@DanielTheRocketMan The version of the PDF file is likely having an impact on the tool you are using to select text. The output of file pdf-filename.pdf will produce a version number. I was unable to search for specific text in BR-L1411-3.pdf BR-L1411-3.pdf: PDF document, version 1.3 but was able to search for text in both of the other files you provided, which are version 1.5 and 1.6 and get one or more matches. I used PDF XChange viewer to search these files but had similar results with evince. the version 1.3 document matched nothing.
– Elder Geek
yesterday

@DanielTheRocketMan If that's the case you might find sorting the documents by version using the output of file helpful in completing your project. Although I as it seems others are still unclear on exactly what you are attempting to accomplish.
– Elder Geek
yesterday

|
show 5 more comments

up vote
1
down vote

answered yesterday

ichabod

111

New contributor

2

I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
– DanielTheRocketMan
yesterday

add a comment |

up vote
1
down vote

answered yesterday

ichabod

111

New contributor

2

I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
– DanielTheRocketMan
yesterday

add a comment |

up vote
1
down vote

answered yesterday

ichabod

111

New contributor

answered yesterday

ichabod

111

New contributor

answered yesterday

ichabod

111

New contributor

answered yesterday

ichabod

111

answered yesterday

ichabod

111

New contributor

ichabod is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

2

I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
– DanielTheRocketMan
yesterday

add a comment |

2

I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
– DanielTheRocketMan
yesterday

I am also trying to use python, but it is not trivial to know whether a pdf is scanned or not. The point is that even documents that you cannot select text presents some text when it is converted to txt. For instance, I am using pdf miner in Python and I can find some text in the conversion even for pdfs that select tool does not work.
– DanielTheRocketMan
yesterday

add a comment |

up vote
1
down vote

If this is more about actually detecting if PDF was created by scanning rather than pdf has images instead of text then you might need to dig into the metadata of the file, not just content.

In general, for the files I could find on my computer and your test files, following is true:

Scanned files have less than 1000chars/page vs. non scanned ones who always have more than 1000chars/page

Multiple independent scanned files had "Canon" listed as the PDF creator, probably referencing Canon scanner software

PDFs with "Microsoft Word" as creator are likely to not be scanned, as they are word exports. But someone could scan to word, then export to PDF - some people have very strange workflow.

I'm using Windows at the moment, so I used node.js for the following example:

const fs = require("mz/fs");

const pdf_parse = require("pdf-parse");

const path = require("path");





const SHOW_SCANNED_ONES = process.argv.indexOf("scanned") != -1;



const DEBUG = process.argv.indexOf("debug") != -1;

const STRICT = process.argv.indexOf("strict") != -1;



const debug = DEBUG ? console.error : () => { };



(async () => {

    const pdfs = (await fs.readdir(".")).filter((fname) => { return fname.endsWith(".pdf") });



    for (let i = 0, l = pdfs.length; i < l; ++i) {

        const pdffilename = pdfs[i];

        try {

            debug("nnFILE: ", pdffilename);

            const buffer = await fs.readFile(pdffilename);

            const data = await pdf_parse(buffer);



            if (!data.info)

                data.indo = {};

            if (!data.metadata) {

                data.metadata = {

                    _metadata: {}

                };

            }





            // PDF info

            debug(data.info);

            // PDF metadata

            debug(data.metadata);

            // text length

            const textLen = data.text ? data.text.length : 0;

            const textPerPage = textLen / (data.numpages);

            debug("Text length: ", textLen);

            debug("Chars per page: ", textLen / data.numpages);

            // PDF.js version

            // check https://mozilla.github.io/pdf.js/getting_started/

            debug(data.version);



            if (evalScanned(data, textLen, textPerPage) == SHOW_SCANNED_ONES) {

                console.log(path.resolve(".", pdffilename));

            }

        }

        catch (e) {

            if (strict && !debug) {

                console.error("Failed to evaluate " + item);

            }

            {

                debug("Failed to evaluate " + item);

                debug(e.stack);

            }

            if (strict) {

                process.exit(1);

            }

        }

    }

})();

const IS_CREATOR_CANON = /canon/i;

const IS_CREATOR_MS_WORD = /microsoft.*?word/i;

// just defined for better clarity or return values

const IS_SCANNED = true;

const IS_NOT_SCANNED = false;

function evalScanned(pdfdata, textLen, textPerPage) {

    if (textPerPage < 300 && pdfdata.numpages>1) {

        // really low number, definitelly not text pdf

        return IS_SCANNED;

    }

    // definitelly has enough text

    // might be scanned but OCRed

    // we return this if no 

    // suspition of scanning is found

    let implicitAssumption = textPerPage > 1000 ? IS_NOT_SCANNED : IS_SCANNED;

    if (IS_CREATOR_CANON.test(pdfdata.info.Creator)) {

        // this is always scanned, canon is brand name

        return IS_SCANNED;

    }

    return implicitAssumption;

}

To run it, you need to have Node.js installed (should be a single command) and you also need to call:

npm install mz pdf-parse

Usage:

node howYouNamedIt.js [scanned] [debug] [strict]



 - scanned show PDFs thought to be scanned (otherwise shows not scanned)

 - debug shows the debug info such as metadata and error stack traces

 - strict kills the program on first error

This example is not considered finished solution, but with the debug flag, you get some insight into meta information of a file:

FILE:  BR-L1411-3-scanned.pdf

{ PDFFormatVersion: '1.3',

  IsAcroFormPresent: false,

  IsXFAPresent: false,

  Creator: 'Canon ',

  Producer: ' ',

  CreationDate: 'D:20131212150500-03'00'',

  ModDate: 'D:20140709104225-03'00'' }

Metadata {

  _metadata:

   { 'xmp:createdate': '2013-12-12T15:05-03:00',

     'xmp:creatortool': 'Canon',

     'xmp:modifydate': '2014-07-09T10:42:25-03:00',

     'xmp:metadatadate': '2014-07-09T10:42:25-03:00',

     'pdf:producer': '',

     'xmpmm:documentid': 'uuid:79a14710-88e2-4849-96b1-512e89ee8dab',

     'xmpmm:instanceid': 'uuid:1d2b2106-a13f-48c6-8bca-6795aa955ad1',

     'dc:format': 'application/pdf' } }

Text length:  772

Chars per page:  2

1.10.100

D:webso-odpovedipdfBR-L1411-3-scanned.pdf

D:xxxxpdf>node detect_scanned.js scanned

D:xxxxpdfAR-G1002-scanned.pdf

D:xxxxpdfAR-G1002_scanned.pdf

D:xxxxpdfBR-L1411-3-scanned.pdf

D:xxxxpdfWHO_TRS_696-scanned.pdf



D:xxxxpdf>node detect_scanned.js

D:xxxxpdfAR-G1003-not-scanned.pdf

D:xxxxpdfASEE_-_thermoelectric_paper_-_final-not-scanned.pdf

D:xxxxpdfMULTIMODE ABSORBER-not-scanned.pdf

D:xxxxpdfReductionofOxideMineralsbyHydrogenPlasma-not-scanned.pdf

You can use the debug mode along with a tiny bit of programming to vastly improve your results. You can pass the output of the program to other programs, it will always have one full path per line.

answered yesterday

Tomáš Zato

169113

Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
– jamesqf
5 hours ago

add a comment |

up vote
1
down vote

If this is more about actually detecting if PDF was created by scanning rather than pdf has images instead of text then you might need to dig into the metadata of the file, not just content.

In general, for the files I could find on my computer and your test files, following is true:

Scanned files have less than 1000chars/page vs. non scanned ones who always have more than 1000chars/page

Multiple independent scanned files had "Canon" listed as the PDF creator, probably referencing Canon scanner software

PDFs with "Microsoft Word" as creator are likely to not be scanned, as they are word exports. But someone could scan to word, then export to PDF - some people have very strange workflow.

I'm using Windows at the moment, so I used node.js for the following example:

const fs = require("mz/fs");

const pdf_parse = require("pdf-parse");

const path = require("path");





const SHOW_SCANNED_ONES = process.argv.indexOf("scanned") != -1;



const DEBUG = process.argv.indexOf("debug") != -1;

const STRICT = process.argv.indexOf("strict") != -1;



const debug = DEBUG ? console.error : () => { };



(async () => {

    const pdfs = (await fs.readdir(".")).filter((fname) => { return fname.endsWith(".pdf") });



    for (let i = 0, l = pdfs.length; i < l; ++i) {

        const pdffilename = pdfs[i];

        try {

            debug("nnFILE: ", pdffilename);

            const buffer = await fs.readFile(pdffilename);

            const data = await pdf_parse(buffer);



            if (!data.info)

                data.indo = {};

            if (!data.metadata) {

                data.metadata = {

                    _metadata: {}

                };

            }





            // PDF info

            debug(data.info);

            // PDF metadata

            debug(data.metadata);

            // text length

            const textLen = data.text ? data.text.length : 0;

            const textPerPage = textLen / (data.numpages);

            debug("Text length: ", textLen);

            debug("Chars per page: ", textLen / data.numpages);

            // PDF.js version

            // check https://mozilla.github.io/pdf.js/getting_started/

            debug(data.version);



            if (evalScanned(data, textLen, textPerPage) == SHOW_SCANNED_ONES) {

                console.log(path.resolve(".", pdffilename));

            }

        }

        catch (e) {

            if (strict && !debug) {

                console.error("Failed to evaluate " + item);

            }

            {

                debug("Failed to evaluate " + item);

                debug(e.stack);

            }

            if (strict) {

                process.exit(1);

            }

        }

    }

})();

const IS_CREATOR_CANON = /canon/i;

const IS_CREATOR_MS_WORD = /microsoft.*?word/i;

// just defined for better clarity or return values

const IS_SCANNED = true;

const IS_NOT_SCANNED = false;

function evalScanned(pdfdata, textLen, textPerPage) {

    if (textPerPage < 300 && pdfdata.numpages>1) {

        // really low number, definitelly not text pdf

        return IS_SCANNED;

    }

    // definitelly has enough text

    // might be scanned but OCRed

    // we return this if no 

    // suspition of scanning is found

    let implicitAssumption = textPerPage > 1000 ? IS_NOT_SCANNED : IS_SCANNED;

    if (IS_CREATOR_CANON.test(pdfdata.info.Creator)) {

        // this is always scanned, canon is brand name

        return IS_SCANNED;

    }

    return implicitAssumption;

}

To run it, you need to have Node.js installed (should be a single command) and you also need to call:

npm install mz pdf-parse

Usage:

node howYouNamedIt.js [scanned] [debug] [strict]



 - scanned show PDFs thought to be scanned (otherwise shows not scanned)

 - debug shows the debug info such as metadata and error stack traces

 - strict kills the program on first error

This example is not considered finished solution, but with the debug flag, you get some insight into meta information of a file:

FILE:  BR-L1411-3-scanned.pdf

{ PDFFormatVersion: '1.3',

  IsAcroFormPresent: false,

  IsXFAPresent: false,

  Creator: 'Canon ',

  Producer: ' ',

  CreationDate: 'D:20131212150500-03'00'',

  ModDate: 'D:20140709104225-03'00'' }

Metadata {

  _metadata:

   { 'xmp:createdate': '2013-12-12T15:05-03:00',

     'xmp:creatortool': 'Canon',

     'xmp:modifydate': '2014-07-09T10:42:25-03:00',

     'xmp:metadatadate': '2014-07-09T10:42:25-03:00',

     'pdf:producer': '',

     'xmpmm:documentid': 'uuid:79a14710-88e2-4849-96b1-512e89ee8dab',

     'xmpmm:instanceid': 'uuid:1d2b2106-a13f-48c6-8bca-6795aa955ad1',

     'dc:format': 'application/pdf' } }

Text length:  772

Chars per page:  2

1.10.100

D:webso-odpovedipdfBR-L1411-3-scanned.pdf

D:xxxxpdf>node detect_scanned.js scanned

D:xxxxpdfAR-G1002-scanned.pdf

D:xxxxpdfAR-G1002_scanned.pdf

D:xxxxpdfBR-L1411-3-scanned.pdf

D:xxxxpdfWHO_TRS_696-scanned.pdf



D:xxxxpdf>node detect_scanned.js

D:xxxxpdfAR-G1003-not-scanned.pdf

D:xxxxpdfASEE_-_thermoelectric_paper_-_final-not-scanned.pdf

D:xxxxpdfMULTIMODE ABSORBER-not-scanned.pdf

D:xxxxpdfReductionofOxideMineralsbyHydrogenPlasma-not-scanned.pdf

You can use the debug mode along with a tiny bit of programming to vastly improve your results. You can pass the output of the program to other programs, it will always have one full path per line.

answered yesterday

Tomáš Zato

169113

Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
– jamesqf
5 hours ago

add a comment |

up vote
1
down vote

If this is more about actually detecting if PDF was created by scanning rather than pdf has images instead of text then you might need to dig into the metadata of the file, not just content.

In general, for the files I could find on my computer and your test files, following is true:

Scanned files have less than 1000chars/page vs. non scanned ones who always have more than 1000chars/page

Multiple independent scanned files had "Canon" listed as the PDF creator, probably referencing Canon scanner software

PDFs with "Microsoft Word" as creator are likely to not be scanned, as they are word exports. But someone could scan to word, then export to PDF - some people have very strange workflow.

I'm using Windows at the moment, so I used node.js for the following example:

const fs = require("mz/fs");

const pdf_parse = require("pdf-parse");

const path = require("path");





const SHOW_SCANNED_ONES = process.argv.indexOf("scanned") != -1;



const DEBUG = process.argv.indexOf("debug") != -1;

const STRICT = process.argv.indexOf("strict") != -1;



const debug = DEBUG ? console.error : () => { };



(async () => {

    const pdfs = (await fs.readdir(".")).filter((fname) => { return fname.endsWith(".pdf") });



    for (let i = 0, l = pdfs.length; i < l; ++i) {

        const pdffilename = pdfs[i];

        try {

            debug("nnFILE: ", pdffilename);

            const buffer = await fs.readFile(pdffilename);

            const data = await pdf_parse(buffer);



            if (!data.info)

                data.indo = {};

            if (!data.metadata) {

                data.metadata = {

                    _metadata: {}

                };

            }





            // PDF info

            debug(data.info);

            // PDF metadata

            debug(data.metadata);

            // text length

            const textLen = data.text ? data.text.length : 0;

            const textPerPage = textLen / (data.numpages);

            debug("Text length: ", textLen);

            debug("Chars per page: ", textLen / data.numpages);

            // PDF.js version

            // check https://mozilla.github.io/pdf.js/getting_started/

            debug(data.version);



            if (evalScanned(data, textLen, textPerPage) == SHOW_SCANNED_ONES) {

                console.log(path.resolve(".", pdffilename));

            }

        }

        catch (e) {

            if (strict && !debug) {

                console.error("Failed to evaluate " + item);

            }

            {

                debug("Failed to evaluate " + item);

                debug(e.stack);

            }

            if (strict) {

                process.exit(1);

            }

        }

    }

})();

const IS_CREATOR_CANON = /canon/i;

const IS_CREATOR_MS_WORD = /microsoft.*?word/i;

// just defined for better clarity or return values

const IS_SCANNED = true;

const IS_NOT_SCANNED = false;

function evalScanned(pdfdata, textLen, textPerPage) {

    if (textPerPage < 300 && pdfdata.numpages>1) {

        // really low number, definitelly not text pdf

        return IS_SCANNED;

    }

    // definitelly has enough text

    // might be scanned but OCRed

    // we return this if no 

    // suspition of scanning is found

    let implicitAssumption = textPerPage > 1000 ? IS_NOT_SCANNED : IS_SCANNED;

    if (IS_CREATOR_CANON.test(pdfdata.info.Creator)) {

        // this is always scanned, canon is brand name

        return IS_SCANNED;

    }

    return implicitAssumption;

}

To run it, you need to have Node.js installed (should be a single command) and you also need to call:

npm install mz pdf-parse

Usage:

node howYouNamedIt.js [scanned] [debug] [strict]



 - scanned show PDFs thought to be scanned (otherwise shows not scanned)

 - debug shows the debug info such as metadata and error stack traces

 - strict kills the program on first error

This example is not considered finished solution, but with the debug flag, you get some insight into meta information of a file:

FILE:  BR-L1411-3-scanned.pdf

{ PDFFormatVersion: '1.3',

  IsAcroFormPresent: false,

  IsXFAPresent: false,

  Creator: 'Canon ',

  Producer: ' ',

  CreationDate: 'D:20131212150500-03'00'',

  ModDate: 'D:20140709104225-03'00'' }

Metadata {

  _metadata:

   { 'xmp:createdate': '2013-12-12T15:05-03:00',

     'xmp:creatortool': 'Canon',

     'xmp:modifydate': '2014-07-09T10:42:25-03:00',

     'xmp:metadatadate': '2014-07-09T10:42:25-03:00',

     'pdf:producer': '',

     'xmpmm:documentid': 'uuid:79a14710-88e2-4849-96b1-512e89ee8dab',

     'xmpmm:instanceid': 'uuid:1d2b2106-a13f-48c6-8bca-6795aa955ad1',

     'dc:format': 'application/pdf' } }

Text length:  772

Chars per page:  2

1.10.100

D:webso-odpovedipdfBR-L1411-3-scanned.pdf

D:xxxxpdf>node detect_scanned.js scanned

D:xxxxpdfAR-G1002-scanned.pdf

D:xxxxpdfAR-G1002_scanned.pdf

D:xxxxpdfBR-L1411-3-scanned.pdf

D:xxxxpdfWHO_TRS_696-scanned.pdf



D:xxxxpdf>node detect_scanned.js

D:xxxxpdfAR-G1003-not-scanned.pdf

D:xxxxpdfASEE_-_thermoelectric_paper_-_final-not-scanned.pdf

D:xxxxpdfMULTIMODE ABSORBER-not-scanned.pdf

D:xxxxpdfReductionofOxideMineralsbyHydrogenPlasma-not-scanned.pdf

You can use the debug mode along with a tiny bit of programming to vastly improve your results. You can pass the output of the program to other programs, it will always have one full path per line.

answered yesterday

Tomáš Zato

169113

If this is more about actually detecting if PDF was created by scanning rather than pdf has images instead of text then you might need to dig into the metadata of the file, not just content.

In general, for the files I could find on my computer and your test files, following is true:

Scanned files have less than 1000chars/page vs. non scanned ones who always have more than 1000chars/page

Multiple independent scanned files had "Canon" listed as the PDF creator, probably referencing Canon scanner software

PDFs with "Microsoft Word" as creator are likely to not be scanned, as they are word exports. But someone could scan to word, then export to PDF - some people have very strange workflow.

I'm using Windows at the moment, so I used node.js for the following example:

const fs = require("mz/fs");

const pdf_parse = require("pdf-parse");

const path = require("path");





const SHOW_SCANNED_ONES = process.argv.indexOf("scanned") != -1;



const DEBUG = process.argv.indexOf("debug") != -1;

const STRICT = process.argv.indexOf("strict") != -1;



const debug = DEBUG ? console.error : () => { };



(async () => {

    const pdfs = (await fs.readdir(".")).filter((fname) => { return fname.endsWith(".pdf") });



    for (let i = 0, l = pdfs.length; i < l; ++i) {

        const pdffilename = pdfs[i];

        try {

            debug("nnFILE: ", pdffilename);

            const buffer = await fs.readFile(pdffilename);

            const data = await pdf_parse(buffer);



            if (!data.info)

                data.indo = {};

            if (!data.metadata) {

                data.metadata = {

                    _metadata: {}

                };

            }





            // PDF info

            debug(data.info);

            // PDF metadata

            debug(data.metadata);

            // text length

            const textLen = data.text ? data.text.length : 0;

            const textPerPage = textLen / (data.numpages);

            debug("Text length: ", textLen);

            debug("Chars per page: ", textLen / data.numpages);

            // PDF.js version

            // check https://mozilla.github.io/pdf.js/getting_started/

            debug(data.version);



            if (evalScanned(data, textLen, textPerPage) == SHOW_SCANNED_ONES) {

                console.log(path.resolve(".", pdffilename));

            }

        }

        catch (e) {

            if (strict && !debug) {

                console.error("Failed to evaluate " + item);

            }

            {

                debug("Failed to evaluate " + item);

                debug(e.stack);

            }

            if (strict) {

                process.exit(1);

            }

        }

    }

})();

const IS_CREATOR_CANON = /canon/i;

const IS_CREATOR_MS_WORD = /microsoft.*?word/i;

// just defined for better clarity or return values

const IS_SCANNED = true;

const IS_NOT_SCANNED = false;

function evalScanned(pdfdata, textLen, textPerPage) {

    if (textPerPage < 300 && pdfdata.numpages>1) {

        // really low number, definitelly not text pdf

        return IS_SCANNED;

    }

    // definitelly has enough text

    // might be scanned but OCRed

    // we return this if no 

    // suspition of scanning is found

    let implicitAssumption = textPerPage > 1000 ? IS_NOT_SCANNED : IS_SCANNED;

    if (IS_CREATOR_CANON.test(pdfdata.info.Creator)) {

        // this is always scanned, canon is brand name

        return IS_SCANNED;

    }

    return implicitAssumption;

}

To run it, you need to have Node.js installed (should be a single command) and you also need to call:

npm install mz pdf-parse

Usage:

node howYouNamedIt.js [scanned] [debug] [strict]



 - scanned show PDFs thought to be scanned (otherwise shows not scanned)

 - debug shows the debug info such as metadata and error stack traces

 - strict kills the program on first error

This example is not considered finished solution, but with the debug flag, you get some insight into meta information of a file:

FILE:  BR-L1411-3-scanned.pdf

{ PDFFormatVersion: '1.3',

  IsAcroFormPresent: false,

  IsXFAPresent: false,

  Creator: 'Canon ',

  Producer: ' ',

  CreationDate: 'D:20131212150500-03'00'',

  ModDate: 'D:20140709104225-03'00'' }

Metadata {

  _metadata:

   { 'xmp:createdate': '2013-12-12T15:05-03:00',

     'xmp:creatortool': 'Canon',

     'xmp:modifydate': '2014-07-09T10:42:25-03:00',

     'xmp:metadatadate': '2014-07-09T10:42:25-03:00',

     'pdf:producer': '',

     'xmpmm:documentid': 'uuid:79a14710-88e2-4849-96b1-512e89ee8dab',

     'xmpmm:instanceid': 'uuid:1d2b2106-a13f-48c6-8bca-6795aa955ad1',

     'dc:format': 'application/pdf' } }

Text length:  772

Chars per page:  2

1.10.100

D:webso-odpovedipdfBR-L1411-3-scanned.pdf

D:xxxxpdf>node detect_scanned.js scanned

D:xxxxpdfAR-G1002-scanned.pdf

D:xxxxpdfAR-G1002_scanned.pdf

D:xxxxpdfBR-L1411-3-scanned.pdf

D:xxxxpdfWHO_TRS_696-scanned.pdf



D:xxxxpdf>node detect_scanned.js

D:xxxxpdfAR-G1003-not-scanned.pdf

D:xxxxpdfASEE_-_thermoelectric_paper_-_final-not-scanned.pdf

D:xxxxpdfMULTIMODE ABSORBER-not-scanned.pdf

D:xxxxpdfReductionofOxideMineralsbyHydrogenPlasma-not-scanned.pdf

You can use the debug mode along with a tiny bit of programming to vastly improve your results. You can pass the output of the program to other programs, it will always have one full path per line.

answered yesterday

Tomáš Zato

169113

answered yesterday

Tomáš Zato

169113

answered yesterday

Tomáš Zato

169113

answered yesterday

Tomáš Zato

169113

Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
– jamesqf
5 hours ago

add a comment |

Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
– jamesqf
5 hours ago

Re "Microsoft Word" as creator, that's going to depend on the source of the original documents. If for instance they're scientific papers, many if not most are going to have been created by something in the LaTeX toolchain.
– jamesqf
5 hours ago

add a comment |

up vote
0
down vote

2 ways I can think of:

Using select text tool: if you are using a scanned PDF the texts can not be selected, rather a box will appear. You can use this fact to create the script. I know in C++ QT there is a way, not sure in Linux though.

Search for word in file: In a non-scanned PDF your search will work, however not in scanned file. You just need to find some words common to all PDFs or I would rather say search for letter 'e' in all the PDFs. It has the highest frequency distribution so chances are you will find it in all the documents which have text (Unless its gadsby)

grep -rnw '/path/to/pdf/' -e 'e'

Use any of the text processing tools

edited yesterday

phuclv

318224

answered yesterday

swapedoc

416

1

a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
– phuclv
yesterday

@phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
– jamesqf
yesterday

1

@jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
– DanielTheRocketMan
yesterday

1

i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
– swapedoc
yesterday

add a comment |

up vote
0
down vote

2 ways I can think of:

Using select text tool: if you are using a scanned PDF the texts can not be selected, rather a box will appear. You can use this fact to create the script. I know in C++ QT there is a way, not sure in Linux though.

Search for word in file: In a non-scanned PDF your search will work, however not in scanned file. You just need to find some words common to all PDFs or I would rather say search for letter 'e' in all the PDFs. It has the highest frequency distribution so chances are you will find it in all the documents which have text (Unless its gadsby)

grep -rnw '/path/to/pdf/' -e 'e'

Use any of the text processing tools

edited yesterday

phuclv

318224

answered yesterday

swapedoc

416

1

a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
– phuclv
yesterday

@phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
– jamesqf
yesterday

1

@jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
– DanielTheRocketMan
yesterday

1

i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
– swapedoc
yesterday

add a comment |

up vote
0
down vote

2 ways I can think of:

Using select text tool: if you are using a scanned PDF the texts can not be selected, rather a box will appear. You can use this fact to create the script. I know in C++ QT there is a way, not sure in Linux though.

Search for word in file: In a non-scanned PDF your search will work, however not in scanned file. You just need to find some words common to all PDFs or I would rather say search for letter 'e' in all the PDFs. It has the highest frequency distribution so chances are you will find it in all the documents which have text (Unless its gadsby)

grep -rnw '/path/to/pdf/' -e 'e'

Use any of the text processing tools

edited yesterday

phuclv

318224

answered yesterday

swapedoc

416

2 ways I can think of:

Using select text tool: if you are using a scanned PDF the texts can not be selected, rather a box will appear. You can use this fact to create the script. I know in C++ QT there is a way, not sure in Linux though.

Search for word in file: In a non-scanned PDF your search will work, however not in scanned file. You just need to find some words common to all PDFs or I would rather say search for letter 'e' in all the PDFs. It has the highest frequency distribution so chances are you will find it in all the documents which have text (Unless its gadsby)

grep -rnw '/path/to/pdf/' -e 'e'

Use any of the text processing tools

edited yesterday

phuclv

318224

answered yesterday

swapedoc

416

edited yesterday

phuclv

318224

edited yesterday

phuclv

318224

edited yesterday

phuclv

318224

answered yesterday

swapedoc

416

answered yesterday

swapedoc

416

answered yesterday

swapedoc

416

1

a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
– phuclv
yesterday

@phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
– jamesqf
yesterday

1

@jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
– DanielTheRocketMan
yesterday

1

i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
– swapedoc
yesterday

add a comment |

1

a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
– phuclv
yesterday

@phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
– jamesqf
yesterday

1

@jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
– DanielTheRocketMan
yesterday

1

i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
– swapedoc
yesterday

a scanned PDF can also have selectable texts because OCR is not a strange thing nowadays and even many free PDF readers have OCR feature
– phuclv
yesterday

@phuclv: But if the file was converted to text with OCR, it is no longer a "scanned" file, at least as I understand the OP's purpose. Though really you'd now have 3 types of pdf files: text ab initio, text from OCR, and "text" that is a scanned image.
– jamesqf
yesterday

@jamesqf please look at the example above. They are scanned pdf. Most of the text I cannot retrieve using a conventional pdfminer.
– DanielTheRocketMan
yesterday

i think the op needs to rethink/rephrase the definition of scanned in that case or stop using acrobat x, which takes scanned copy and takes it as an ocr rather than image
– swapedoc
yesterday

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu