Vulnerability in EPiServer.Forms
Some quick investigation seems to suggest that crawled Word documents get their content extracted but not PDFs.
Can't find (hehe) anything on this in the docs it seems.
Johan, I'm researching this with the Find team. Do you have examples of pdf documents whose text is not being extracted properly?
As I've followed up on my own it seems that they have a lot of scanned paper documents saved as PDFs and not real documents "exported to PDF".
The customer compares with Google search where Google seem to run image text recognition on those when indexing.
I've now also found crawled PDFs where SearchText has extracted text.
Are you saying that you don't have examples of pdf documents whose text is not being extracted properly?
Yes. After looking further the examples I have are essentially just a scanned A4 (an image saved as PDF) which I can understand you don't support extracting text from.
The examples I have of PDFs that they've exported from an actual digital document seem to get text extracted as expected.
I guess "character recognition in images to put in SearchText of the WebContent if the crawled file looks like a scanned document" would be my feature request. I think Azure Cognitive Services would do a good job with this.
For this thread a list of which file types get text extracted would still be welcome.
Word, Excel, Power Point, PDF