GNU Libextractor
GNU Libextractor is a library used to extract meta data from files.
The goal is to provide developers of file-sharing networks, browsers
or WWW-indexing bots with a universal library to obtain simple
keywords and meta data to match against queries and to show to users
instead of only relying on filenames. libextractor contains the shell
command extract
that, similar to the well-known file
command, can extract meta data from a file and print the results to
stdout.
Currently, libextractor supports the following formats: HTML, MAN, PS, DVI, OLE2 (DOC, XLS, PPT), OpenOffice (sxw), StarOffice (sdw), FLAC, MP3 (ID3v1 and ID3v2), OGG, WAV, S3M (Scream Tracker 3), XM (eXtended Module), IT (Impulse Tracker), NSF(E) (NES music), SID (C64 music), EXIV2, JPEG, GIF, PNG, TIFF, DEB, RPM, TAR(.GZ), LZH, LHA, RAR, ZIP, CAB, 7-ZIP, AR, MTREE, PAX, CPIO, ISO9660, SHAR, RAW, XAR FLV, REAL, RIFF (AVI), MPEG, QT and ASF. Also, various additional MIME types are detected.
GNU libextractor uses helper-libraries (plugins) to perform the actual extraction. As a result, GNU libextractor can be extended simply by installing additional plugins. Writing robust parsers can be difficult. GNU libextractor protects the main applications from haning or crashing plugins by executing all plugins out-of-process.
Table of Contents
- Back to main heading
- Downloading
- Documentation
- Mailing lists
- Getting involved
- Quick Introduction
- Licensing
Downloading Libextractor
- Source Code
- Libextractor is available from the main GNU FTP server via HTTP(S) and FTP. It can also be found on the GNU mirrors; please use a mirror if possible.
- Debian .deb package
-
The debian package can be downloaded from the official debian archive.
The
extract
package can be found under Utilities and the library under Libraries. The respective packages for libextractor are extract, libextractor, and for development, libextractor-dev. - Tar Package
-
The latest version can be found on GNU mirrors.
If the mirror does not work, you should be able to find them on the main FTP server.
Latest release is libextractor-latest.tar.gz.
Latest Java-binding is libextractor-java-1.0.0.tar.gz.
Latest Mono-binding is libextractor-mono-0.5.23.tar.gz.
Latest Python-binding is libextractor-python-0.5.tar.gz. - RPM Packages
- RPMs are available for Fedora, Mageia and several other distributions.
- Windows
- The latest Windows binary is libextractor-w32-1.0.0.zip.
Documentation
Documentation for
Libextractor
is available online, as
is documentation for most GNU software. You may
also find more information about
Libextractor
by running
info libextractor
or
man libextractor,
or
man extract,
or by looking at
/usr/share/doc/libextractor/
,
/usr/local/doc/libextractor/
,
or similar directories on your system. A brief summary is available by
running extract --help.
You might also be interested in an API compatibility report
comparing the various Libextractor versions.
Articles related to libextractor:
- Reading File Metadata with extract and libextractor
- How to recover lost files after you accidentally wipe your hard drive
- All your Metadata are belong to Us
Mailing lists
Libextractor has the following mailing lists:
- bug-libextractor is used to discuss most aspects of Libextractor, including development and enhancement requests, as well as bug reports.
- help-libextractor is for general user help and discussion.
Announcements about Libextractor and most other GNU software are made on info-gnu (archive).
Security reports that should not be made immediately public can be added as private reports on the bugtracker. If there is no response to an urgent issue, you can escalate to the general security mailing list for advice.
Getting involved
Development of Libextractor, and GNU in general, is a volunteer effort, and you can contribute. For information, please read How to help GNU. If you'd like to get involved, it's a good idea to join the discussion mailing list (see above).
- Development
- Known bugs and open feature requests are tracked in our bugtracker.
- Git access
-
You can access the current development version of libextractor using
$ git clone https://git.gnunet.org/libextractor.git
A Java binding for libextractor is in
$ git clone https://git.gnunet.org/libextractor-java
A Mono binding for libextractor is in
$ git clone https://git.gnunet.org/libextractor-mono
A Python binding can be found under
$ git clone https://git.gnunet.org/libextractor-python
A source package is available on the GNU FTP server. This binding has been packaged as a python egg.
A second Python binding includes a binding for doodle.
A Perl binding is in CPAN. The latest version of the Perl binding is available using
$ git clone git://git.perldition.org/File-Extractor.git/
Ruby bindings have been published on raa.ruby-lang.org (mirror) and rubyforge.org (mirror).
An initial draft of a PHP binding can be found under
$ git clone https://git.gnunet.org/libextractor-php
- Translating Libextractor
- To translate Libextractor's messages into other languages, please see the Translation Project page for Libextractor. If you have a new translation of the message strings, or updates to the existing strings, please have the changes made in this repository. Only translations from this site will be incorporated into Libextractor. For more information, see the Translation Project.
Quick Introduction
- Installation
-
The simplest way to install GNU libextractor is to use one of the binary packages which are available online for many distributions. Note that under Debian, the extract tool is in a separate package
extract
and headers required to compile other applications against libextractor are inlibextractor-dev
. Thus, under Debian, you should use:# apt-get install libextractor-dev extract
Compiling by hand follows the usual sequence:
$ tar xzvf libextractor.x.y.z.tar.gz $ cd libextractor.x.y.z $ ./configure $ make # make install
Note that you need various dependencies (read
README
for an up-to-date list) in order to compile all of the plugins. - Using the extract tool
-
After installing GNU libextractor, the extract tool can be used to obtain meta data from documents. By default, the extract tool uses the canonical set of plugins, which consists of all format-specific plugins supported by the current version of libextractor together with the mime-type detection plugin. If you are a user of BibTeX the option -b is likely to come in handy to automatically create bibtex entries from documents that have been properly equipped with meta-data (if available).
Further options are described in the extract manpage (
man 1 extract
). - Example Output
-
$ extract libextractor-0.1.3-1.src.rpm Keywords for file libextractor-0.1.3-1.src.rpm: os - linux resource-identifier - http://ovmj.org/libextractor/ group -System Environment/Libraries license - LGPL copyright - LGPL size - 251545 build-host - wedge.cs.purdue.edu creation date - Wed Dec 25 07:50:07 2002 description - libextractor is a simple library... summary - keyword extraction library release - 1 version - 0.1.3 title - libextractor unknown - SOURCE RPM 3.0 mimetype - application/x-rpm
$ extract extractor_logo.png Keywords for file extractor_logo.png: image dimensions - 272x188 thumbnail - (binary, 5932 bytes) image dimensions - 272x188 thumbnail - (binary, 6427 bytes) image dimensions - 272x188 thumbnail - (binary, 6427 bytes) mimetype - image/png mimetype - image/png image dimensions - 272x188 keywords - The libextractor logo
- Using the GNU libextractor library in your programs
-
The following listing shows the code of a minimalistic program that uses GNU libextractor. Compiling the fragment requires passing the option -lextractor to gcc. For details and additional functions for loading plugins and manipulating the keyword list, see the libextractor manpage (man 3 libextractor). Java programmers should note that a Java class that uses JNI to communicate with libextractor is also available. Python programmers will find that libextractor (since 0.5.0) can also be used from Python, just import Extractor.
#include <extractor.h> int main (int argc, char * argv[]) { struct EXTRACTOR_PluginList *plugins = EXTRACTOR_plugin_add_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY); EXTRACTOR_extract (plugins, argv[1], NULL, 0, &EXTRACTOR_meta_data_print, stdout); EXTRACTOR_plugin_remove_all (plugins); return 0; }
- Writing new Plugins for GNU libextractor
-
The most complicated thing when writing a new plugin for GNU libextractor is the writing of the actual parser for a specific format. Nevertheless, the basic pattern is always the same. The plugin library must be called
libextractor_XXX.so
where XXX denotes the file format supported by the plugin and must be placed in the plugin directory (typically$PREFIX/lib/libextractor/
). The library must export a methodEXTRACTOR_XXX_extract_method
with the following signature:void EXTRACTOR_XXX_extract_method (struct EXTRACTOR_ExtractContext *ec);
ec
provides a callback to invoke with meta data as well as functions for reading data from the file that is being processed. Most plugins start by reading the first bytes of the file and checking that that the header of data matches the specific format. Theextract
function is expected to callec->proc
with each meta data item found.ec->cls
must be passed as the first argument toproc
and other function invoked from withinec
. Finally,ec->config
is an arbitrary string of options that the plugin is free to interpret. Most plugins ignoreconfig
.If the meta data extracted is a string, it is supposed to be converted into the UTF-8 character set by the plugin. However, in cases where the character encoding used in the document is unknown, no conversion should be done. Binary meta data can also be extracted. Plugins indicate the format of the meta data using the
format
argument toproc
. Supported formats are UTF-8 strings, C strings (for strings of unknown encoding) and binary data. In addition to this rough categorization, the plugin is also supposed to indicate the mime type of the meta data. For strings, that mime type is most often "text/plain". Finally, the plugin must specify the meta data type. Common meta data types are "author", "title" and "mime-type". The full signature of theproc
callback is:typedef int (*EXTRACTOR_MetaDataProcessor)(void *cls, const char *plugin_name, enum EXTRACTOR_MetaType type, enum EXTRACTOR_MetaFormat format, const char *data_mime_type, const char *data, size_t data_len);
If
proc
returns non-zero, the plugin should abort processing the current file and return. - Related projects and useful resources
- Projects that use libextractor
Licensing
Libextractor is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.