Tika allows search engines, content management systems and other web applications that work with various digital documents to easily detect, access and extract metadata and content from major file formats.
Here are some key features of "Apache Tika":
Supported formats:
· HyperText Markup Language
· XML and derived formats
· Microsoft Office document formats
· OpenDocument Format
· Portable Document Format
· Electronic Publication Format
· Rich Text Format
· Compression and packaging formats
· Text formats
· Audio formats
· Image formats
· Video formats
· Java class files and archives
· The mbox format
What's New in This Release: [ read full changelog ]
· Apache Tika 1.2 contains a number of improvements and bug fixes.