Class TikaDocumentParser
java.lang.Object
org.craftercms.search.opensearch.impl.AbstractDocumentParser
org.craftercms.search.opensearch.impl.tika.TikaDocumentParser
- All Implemented Interfaces:
DocumentParser
Implementation of
DocumentParser
that uses Apache Tika- Author:
- joseross
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected int
The maximum number of characters to parse from the document.protected final jakarta.activation.FileTypeMap
protected final List
<MetadataExtractor<org.apache.tika.metadata.Metadata>> List of metadata extractors to apply after parsing documentsprotected com.fasterxml.jackson.databind.ObjectMapper
JacksonObjectMapper
instanceprotected org.apache.tika.Tika
ApacheTika
instanceFields inherited from class org.craftercms.search.opensearch.impl.AbstractDocumentParser
fieldNameAuthor, fieldNameContent, fieldNameContentType, fieldNameCreated, fieldNameDescription, fieldNameKeywords, fieldNameModified, fieldNAmeTitle
-
Constructor Summary
ConstructorsConstructorDescriptionTikaDocumentParser
(List<MetadataExtractor<org.apache.tika.metadata.Metadata>> metadataExtractors) -
Method Summary
Modifier and TypeMethodDescriptionprotected String
extractMetadata
(String filename, org.springframework.core.io.Resource resource, String parsedContent, org.apache.tika.metadata.Metadata metadata, Map<String, Object> additionalFields) Prepares the document to be indexedparseToXml
(String filename, org.springframework.core.io.Resource resource, Map<String, Object> additionalFields) Parses the given document and generates an XML filevoid
setCharLimit
(int charLimit) void
setObjectMapper
(com.fasterxml.jackson.databind.ObjectMapper objectMapper) void
setTika
(org.apache.tika.Tika tika) Methods inherited from class org.craftercms.search.opensearch.impl.AbstractDocumentParser
setFieldNameAuthor, setFieldNameContent, setFieldNameContentType, setFieldNameCreated, setFieldNameDescription, setFieldNameKeywords, setFieldNameModified, setFieldNAmeTitle
-
Field Details
-
charLimit
protected int charLimitThe maximum number of characters to parse from the document. Defaults to 0 to parse only metadata. -
objectMapper
protected com.fasterxml.jackson.databind.ObjectMapper objectMapperJacksonObjectMapper
instance -
metadataExtractors
List of metadata extractors to apply after parsing documents -
tika
protected org.apache.tika.Tika tikaApacheTika
instance -
fileTypeMap
protected final jakarta.activation.FileTypeMap fileTypeMap
-
-
Constructor Details
-
TikaDocumentParser
public TikaDocumentParser(List<MetadataExtractor<org.apache.tika.metadata.Metadata>> metadataExtractors)
-
-
Method Details
-
setCharLimit
public void setCharLimit(int charLimit) -
setObjectMapper
public void setObjectMapper(com.fasterxml.jackson.databind.ObjectMapper objectMapper) -
setTika
public void setTika(org.apache.tika.Tika tika) -
parseToXml
public String parseToXml(String filename, org.springframework.core.io.Resource resource, Map<String, Object> additionalFields) Parses the given document and generates an XML file- Parameters:
filename
- the name of the fileresource
- the document to parseadditionalFields
- additional fields to add- Returns:
- an XML ready to be indexed
-
extractMetadata
protected String extractMetadata(String filename, org.springframework.core.io.Resource resource, String parsedContent, org.apache.tika.metadata.Metadata metadata, Map<String, Object> additionalFields) Prepares the document to be indexed- Parameters:
resource
- the content of the parsed filemetadata
- the metadata of the parsed fileadditionalFields
- additional fields to be added- Returns:
- the XML ready to be indexed
-