Which file types can the Crawler Connector extract text content f

Which file types can the Crawler Connector extract text content from?

Johan Kronberg

Vote:

Some quick investigation seems to suggest that crawled Word documents get their content extracted but not PDFs.

Can't find (hehe) anything on this in the docs it seems.

#207429

Sep 19, 2019 16:59

Bob Bolt

Vote:

Johan, I'm researching this with the Find team. Do you have examples of pdf documents whose text is not being extracted properly?

#207438

Sep 19, 2019 20:17

Vote:

As I've followed up on my own it seems that they have a lot of scanned paper documents saved as PDFs and not real documents "exported to PDF".

The customer compares with Google search where Google seem to run image text recognition on those when indexing.

I've now also found crawled PDFs where SearchText has extracted text.

#207462

Sep 20, 2019 9:39

Bob Bolt

Vote:

Johan,

Are you saying that you don't have examples of pdf documents whose text is not being extracted properly?

#207471

Sep 20, 2019 16:11

Vote:

Yes. After looking further the examples I have are essentially just a scanned A4 (an image saved as PDF) which I can understand you don't support extracting text from.

The examples I have of PDFs that they've exported from an actual digital document seem to get text extracted as expected.

I guess "character recognition in images to put in SearchText of the WebContent if the crawled file looks like a scanned document" would be my feature request. I think Azure Cognitive Services would do a good job with this.

For this thread a list of which file types get text extracted would still be welcome.

#207477

Edited, Sep 20, 2019 16:59

Bob Bolt

Vote:

Word, Excel, Power Point, PDF

#207480

Sep 20, 2019 18:35

Try our conversational search powered by Generative AI!

Which file types can the Crawler Connector extract text content from?