Which file types can the Crawler Connector extract text content from?

Vote:
 

Some quick investigation seems to suggest that crawled Word documents get their content extracted but not PDFs.

Can't find (hehe) anything on this in the docs it seems.

#207429
Sep 19, 2019 16:59
Vote:
 

Johan, I'm researching this with the Find team. Do you have examples of pdf documents whose text is not being extracted properly?

#207438
Sep 19, 2019 20:17
Vote:
 

As I've followed up on my own it seems that they have a lot of scanned paper documents saved as PDFs and not real documents "exported to PDF".

The customer compares with Google search where Google seem to run image text recognition on those when indexing.

I've now also found crawled PDFs where SearchText has extracted text.

#207462
Sep 20, 2019 9:39
Vote:
 

Johan,

Are you saying that you don't have examples of pdf documents whose text is not being extracted properly?

#207471
Sep 20, 2019 16:11
Vote:
 

Yes. After looking further the examples I have are essentially just a scanned A4 (an image saved as PDF) which I can understand you don't support extracting text from.

The examples I have of PDFs that they've exported from an actual digital document seem to get text extracted as expected.

I guess "character recognition in images to put in SearchText of the WebContent if the crawled file looks like a scanned document" would be my feature request. I think Azure Cognitive Services would do a good job with this.

For this thread a list of which file types get text extracted would still be welcome.

#207477
Edited, Sep 20, 2019 16:59
Vote:
 

Word, Excel, Power Point, PDF

#207480
Sep 20, 2019 18:35
This topic was created over six months ago and has been resolved. If you have a similar question, please create a new topic and refer to this one.
* You are NOT allowed to include any hyperlinks in the post because your account hasn't associated to your company. User profile should be updated.