Class TikaDocumentParser

java.lang.Object
org.craftercms.search.opensearch.impl.AbstractDocumentParser
org.craftercms.search.opensearch.impl.tika.TikaDocumentParser
All Implemented Interfaces:
DocumentParser

public class TikaDocumentParser extends AbstractDocumentParser
Implementation of DocumentParser that uses Apache Tika
Author:
joseross
  • Field Details

    • charLimit

      protected int charLimit
      The maximum number of characters to parse from the document. Defaults to 0 to parse only metadata.
    • objectMapper

      protected com.fasterxml.jackson.databind.ObjectMapper objectMapper
      Jackson ObjectMapper instance
    • metadataExtractors

      protected final List<MetadataExtractor<org.apache.tika.metadata.Metadata>> metadataExtractors
      List of metadata extractors to apply after parsing documents
    • tika

      protected org.apache.tika.Tika tika
      Apache Tika instance
    • fileTypeMap

      protected final jakarta.activation.FileTypeMap fileTypeMap
  • Constructor Details

    • TikaDocumentParser

      public TikaDocumentParser(List<MetadataExtractor<org.apache.tika.metadata.Metadata>> metadataExtractors)
  • Method Details

    • setCharLimit

      public void setCharLimit(int charLimit)
    • setObjectMapper

      public void setObjectMapper(com.fasterxml.jackson.databind.ObjectMapper objectMapper)
    • setTika

      public void setTika(org.apache.tika.Tika tika)
    • parseToXml

      public String parseToXml(String filename, org.springframework.core.io.Resource resource, Map<String,Object> additionalFields)
      Parses the given document and generates an XML file
      Parameters:
      filename - the name of the file
      resource - the document to parse
      additionalFields - additional fields to add
      Returns:
      an XML ready to be indexed
    • extractMetadata

      protected String extractMetadata(String filename, org.springframework.core.io.Resource resource, String parsedContent, org.apache.tika.metadata.Metadata metadata, Map<String,Object> additionalFields)
      Prepares the document to be indexed
      Parameters:
      resource - the content of the parsed file
      metadata - the metadata of the parsed file
      additionalFields - additional fields to be added
      Returns:
      the XML ready to be indexed