The GNU libextractor Reference Manual

Short Contents

The GNU libextractor Reference Manual
1 Introduction
2 Preparation
3 Generalities
4 Extracting meta data
5 Language bindings
6 Utility functions
7 Existing Plugins
8 Writing new Plugins
9 Internal utility functions
10 Reporting bugs
Appendix A GNU GENERAL PUBLIC LICENSE
Concept Index
Function and Data Index
Type Index

The GNU libextractor Reference Manual
1 Introduction
2 Preparation
3 Generalities
4 Extracting meta data
5 Language bindings
6 Utility functions
- 6.1 Utility Constants
- 6.2 Meta data printing
7 Existing Plugins
8 Writing new Plugins
- 8.1 Example for a minimal extract method
9 Internal utility functions
10 Reporting bugs
Appendix A GNU GENERAL PUBLIC LICENSE
- How to Apply These Terms to Your New Programs
Concept Index
Function and Data Index
Type Index

The GNU libextractor Reference Manual

This manual is for GNU libextractor (version 1.0.0, 5 September 2012).

GNU libextractor is a GNU package.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".

Introduction: What is GNU libextractor.
Preparation: What you should do before using the library.
Generalities: General library functions and data types.
Extracting meta data: How to use GNU libextractor to obtain meta data.
Language bindings: How to use GNU libextractor from languages other than C.
Utility functions: Utility functions of GNU libextractor.
Existing Plugins: What plugins are available.
Writing new Plugins: How to write new plugins for GNU libextractor.
Internal utility functions: Utility functions of GNU libextractor for writing plugins.
Reporting bugs: How to report bugs or request new features.

Appendices

Indices

Next: Preparation, Previous: Top, Up: Top

1 Introduction

GNU libextractor is GNU's library for extracting meta data from files. Meta data includes format information (such as mime type, image dimensions, color depth, recording frequency), content descriptions (such as document title or document description) and copyright information (such as license, author and contributors). Meta data extraction is an inherently uncertain business — a parse error can be a corrupt file, an incompatibility in the file format version, an entirely different file format or a bug in the parser. As a result of this uncertainty, GNU libextractor deliberately avoids to ever report any errors. Unexpected file contents simply result in less or possibly no meta data being extracted.

GNU libextractor uses plugins to handle various file formats. Technically a plugin can support multiple file formats; however, most plugins only support one particular format. By default, GNU libextractor will use all plugins that are available and found in the plugin installation directory. Applications can request the use of only specific plugins or the exclusion of certain plugins.

GNU libextractor is distributed with the extract command¹ which is a command-line tool for extracting meta data. extract is given a list of filenames and prints the resulting meta data to the console. The extract source code also serves as an advanced example for how to use GNU libextractor.

This manual focuses on providing documentation for writing software with GNU libextractor. The only relevant parts for end-users are the chapter on compiling and installing GNU libextractor (See Preparation.). Also, the chapter on existing plugins maybe of interest (See Existing Plugins.). Additional documentation for end-users can be find in the man page on extract (using man extract).

GNU libextractor is licensed under the GNU General Public License, specifically, since version 0.7, GNU libextractor is licensed under GPLv3 or any later version.

Next: Generalities, Previous: Introduction, Up: Top

2 Preparation

This chapter first describes the general build instructions that should apply to all systems. Specific instructions for known problems for particular platforms are then described in individual sections afterwards.

Compiling GNU libextractor follows the standard GNU autotools build process using configure and make. For details on the GNU autotools build process, read the INSTALL file and query ./configure --help for additional options.

GNU libextractor has various dependencies, most of which are optional. Instead of specifying the names of the software packages, we will give the list in terms of the names of the respective Debian (unstable) packages that should be installed.

You absolutely need:

libtool
gcc
make
g++
libltdl7-dev

Recommended dependencies are:

zlib1g-dev
libbz2-dev
libgif-dev
libvorbis-dev
libflac-dev
libmpeg2-4-dev
librpm-dev
libgtk2.0-dev
libgsf-1-dev
libqt4-dev
libpoppler-dev
libexiv2-dev
libavformat-dev
libswscale-dev

For Subversion access and compilation one also needs:

subversion
autoconf
automake

Please notify us if we missed some dependencies (note that the list is supposed to only list direct dependencies, not transitive dependencies).

Once you have compiled and installed GNU libextractor, you should have a file extractor.h installed in your include/ directory. This file should be the starting point for your C and C++ development with GNU libextractor. The build process also installs the extract binary and man pages for extract and GNU libextractor. The extract man page documents the extract tool. The GNU libextractor man page gives a brief summary of the C API for GNU libextractor.

When you install GNU libextractor, various plugins will be installed in the lib/libextractor/ directory. The main library will be installed as lib/libextractor.so. Note that GNU libextractor will attempt to find the plugins relative to the path of the main library. Consequently, a package manager can move the library and its plugins to a different location later — as long as the relative path between the main library and the plugins is preserved. As a method of last resort, the user can specify an environment variable LIBEXTRACTOR_PREFIX. If GNU libextractor cannot locate a plugin, it will look in LIBEXTRACTOR_PREFIX/lib/libextractor/.

2.1 Installation on GNU/Linux

Should work using the standard instructions without problems.

2.2 Installation on FreeBSD

Should work using the standard instructions without problems.

2.3 Installation on OpenBSD

OpenBSD 3.8 also doesn't have CODESET in langinfo.h. CODESET is used in GNU libextractor in about three places. This causes problems during compilation.

2.4 Installation on NetBSD

No reports so far.

2.5 Installation using MinGW

Linking -lstdc++ with the provided libtool fails on Cygwin, this is a problem with libtool, there is unfortunately no flag to tell libtool how to do its job on Cygwin and it seems that it cannot be the default to set the library check to 'pass_all'. Patching libtool may help.

Note: this is a rather dated report and may no longer apply.

2.6 Installation on OS X

libextractor has two installation methods on Mac OS X: it can be installed as a Mac OS X framework or with the standard ./configure; make; make install shell commands. The framework package is self-contained, but currently omits some of the extractor plugins that can be compiled in if libextractor is installed with ./configure; make; make install (provided that the required dependencies exist.)

2.6.1 Installing and uninstalling the framework

The binary framework is distributed as a disk image (Extractor-x.x.xx.dmg). Installation is done by opening the disk image and clicking Extractor.pkg inside it. The Mac OS X installer application will then run. The framework is installed to the root volume's /Library/Frameworks folder and installing will require admin privileges.

The framework can be uninstalled by dragging /Library/Frameworks/Extractor.framework cto the Trash.

2.6.2 Using the framework

In the framework, the extract command line tool can be found at /Library/Frameworks/Extractor.framework/Versions/Current/bin/extract

The framework can be used in software projects as a framework or as a dynamic library.

When using the framework as a dynamic library in projects using autotools, one would most likely want to add "-I/Library/Frameworks/Extractor.framework/Versions/Current/include" to CPPFLAGS and "-L/Library/Frameworks/Extractor.framework/Versions/Current/lib" to LDFLAGS.

2.6.3 Example for using the framework

     // hello.c
     #include <Extractor/extractor.h>
     
     int
     main (int argc, char **argv)
     {
       struct EXTRACTOR_PluginList *el;
       el = EXTRACTOR_plugin_load_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY);
       // ...
       EXTRACTOR_plugin_remove_all (el);
       return 0;
     }

You can then compile the example using

$ gcc -o hello hello.c -framework Extractor

2.6.4 Example for using the dynamic library

     // hello.c
     #include <extractor.h>
     int main()
     {
       struct EXTRACTOR_PluginList *el;
       el = EXTRACTOR_plugin_load_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY);
       // ...
       EXTRACTOR_plugin_remove_all (el);
       return 0;
     }

You can then compile the example using

$ gcc -I/Library/Frameworks/Extractor.framework/Versions/Current/include \
  -o hello hello.c \
  -L/Library/Frameworks/Extractor.framework/Versions/Current/lib \
  -lextractor

Notice the difference in the #include line.

2.7 Note to package maintainers

The suggested way to package GNU libextractor is to split it into roughly the following binary packages:

libextractor (main library only, only hard dependency for other packages depending on GNU libextractor)
extract (command-line tool and man page extract.1)
libextractor-dev (extractor.h header and man page libextractor.3)
libextractor-doc (this manual)
libextractor-plugins (plugins without external dependencies; recommended but not required by extract and libextractor package)
libextractor-plugin-XXX (plugin with dependency on libXXX, for example for XXX=mpeg this would be libextractor_mpeg.so)
libextractor-plugins-all (meta package that requires all plugins except experimental plugins)

This would enable minimal installations (i.e. for embedded systems) to not include any plugins, as well as moderate-size installations (that do not trigger GTK and X11) for systems that have limited resources. Right now, the MP4 plugin is experimental and does nothing and should thus never be included at all. The gstreamer plugin is experimental but largely works with the correct version of gstreamer and can thus be packaged (especially if the dependency is available on the target system) but should probably not be part of libextractor-plugins-all.

Next: Extracting meta data, Previous: Preparation, Up: Top

3 Generalities

3.1 Introduction to the “extract” command

The extract command takes a list of file names as arguments, extracts meta data from each of those files and prints the result to the console. By default, extract will use all available plugins and print all (non-binary) meta data that is found.

The set of plugins used by extract can be controlled using the “-l” and “-n” options. Use “-n” to not load all of the default plugins. Use “-l NAME” to specifically load a certain plugin. For example, specify “-n -l mime” to only use the MIME plugin.

Using the “-p” option the output of extract can be limited to only certain keyword types. Similarly, using the “-x” option, certain keyword types can be excluded. A list of all known keyword types can be obtained using the “-L” option.

The output format of extract can be influenced with the “-V” (more verbose, lists filenames), “-g” (grep-friendly, all meta data on a single line per file) and “-b” (bibTeX style) options.

3.2 Common usage examples for “extract”

     $ extract test/test.jpg
     comment - (C) 2001 by Christian Grothoff, using gimp 1.2 1
     mimetype - image/jpeg
     
     $ extract -V -x comment test/test.jpg
     Keywords for file test/test.jpg:
     mimetype - image/jpeg
     
     $ extract -p comment test/test.jpg
     comment - (C) 2001 by Christian Grothoff, using gimp 1.2 1
     
     $ extract -nV -l png.so -p comment test/test.jpg test/test.png
     Keywords for file test/test.jpg:
     Keywords for file test/test.png:
     comment - Testing keyword extraction

3.3 Introduction to the libextractor library

Each public symbol exported by GNU libextractor has the prefix EXTRACTOR_. All-caps names are used for constants. For the impatient, the minimal C code for using GNU libextractor (on the executing binary itself) looks like this:

#include <extractor.h>

int 
main (int argc, char ** argv) 
{
  struct EXTRACTOR_PluginList *plugins
    = EXTRACTOR_plugin_add_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY);
  EXTRACTOR_extract (plugins, argv[1],
                     NULL, 0, 
                     &EXTRACTOR_meta_data_print, stdout);
  EXTRACTOR_plugin_remove_all (plugins);
  return 0;
}

The minimal API illustrated by this example is actually sufficient for many applications. The full external C API of GNU libextractor is described in chapter See Extracting meta data. Bindings for other languages are described in chapter See Language bindings. The API for writing new plugins is described in chapter See Writing new Plugins.

Next: Language bindings, Previous: Generalities, Up: Top

4 Extracting meta data

In order to extract meta data with GNU libextractor you first need to load the respective plugins and then call the extraction API with the plugins and the data to process. This section documents how to load and unload plugins, the various types and formats in which meta data is returned to the application and finally the extraction API itself.

Next: Meta types, Up: Extracting meta data

4.1 Plugin management

Using GNU libextractor from a multi-threaded parent process requires some care. The problem is that on most platforms GNU libextractor starts sub-processes for the actual extraction work. This is useful to isolate the parent process from potential bugs; however, it can cause problems if the parent process is multi-threaded. The issue is that at the time of the fork, another thread of the application may hold a lock (i.e. in gettext or libc). That lock would then never be released in the child process (as the other thread is not present in the child process). As a result, the child process would then deadlock on trying to acquire the lock and never terminate. This has actually been observed with a lock in GNU gettext that is triggered by the plugin startup code when it interacts with libltdl.

The problem can be solved by loading the plugins using the EXTRACTOR_OPTION_IN_PROCESS option, which will run GNU libextractor in-process and thus avoid the locking issue. In this case, all of the functions for loading and unloading plugins, including EXTRACTOR_plugin_add_defaults and EXTRACTOR_plugin_remove_all, are thread-safe and reentrant. However, using the same plugin list from multiple threads at the same time is not safe.

All plugin code is expected required to be reentrant and state-less, but due to the extensive use of 3rd party libraries this cannot be guaranteed.

— C Struct: EXTRACTOR_PluginList

A plugin list represents a set of GNU libextractor plugins. Most of the GNU libextractor API is concerned with either constructing a plugin list or using it to extract meta data. The internal representation of the plugin list is of no concern to users or plugin developers.

— Function: void EXTRACTOR_plugin_remove_all (struct EXTRACTOR_PluginList *plugins)

Unload all of the plugins in the given list.

— Function: struct EXTRACTOR_PluginList * EXTRACTOR_plugin_remove (struct EXTRACTOR_PluginList *plugins, const char*name)

Unloads a particular plugin. The given name should be the short name of the plugin, for example “mime” for the mime-type extractor or “mpeg” for the MPEG extractor.

— Function: struct EXTRACTOR_PluginList * EXTRACTOR_plugin_add (struct EXTRACTOR_PluginList *plugins, const char* name,const char* options, enum EXTRACTOR_Options flags)

Loads a particular plugin. The plugin is added to the existing list, which can be NULL. The second argument specifies the name of the plugin (i.e. “ogg”). The third argument can be NULL and specifies plugin-specific options. Finally, the last argument specifies if the plugin should be executed out-of-process (EXTRACTOR_OPTION_DEFAULT_POLICY) or not.

— Function: struct EXTRACTOR_PluginList * EXTRACTOR_plugin_add_config (struct EXTRACTOR_PluginList *plugins, const char* config, enum EXTRACTOR_Options flags)

Loads and unloads plugins based on a configuration string, modifying the existing list, which can be NULL. The string has the format “[-]NAME(OPTIONS){:[-]NAME(OPTIONS)}*”. Prefixing the plugin name with a “-” means that the plugin should be unloaded.

— Function: struct EXTRACTOR_PluginList * EXTRACTOR_plugin_add_defaults (enum EXTRACTOR_Options flags)

Loads all of the plugins in the plugin directory. This function is what most GNU libextractor applications should use to setup the plugins.

Next: Meta formats, Previous: Plugin management, Up: Extracting meta data

4.2 Meta types

enum EXTRACTOR_MetaType is a C enum which defines a list of over 100 different types of meta data. The total number can differ between different GNU libextractor releases; the maximum value for the current release can be obtained using the EXTRACTOR_metatype_get_max function. All values in this enumeration are of the form EXTRACTOR_METATYPE_XXX.

— Function: const char * EXTRACTOR_metatype_to_string (enum EXTRACTOR_MetaType type)

The function EXTRACTOR_metatype_to_string can be used to obtain a short English string ‘s’ describing the meta data type. The string can be translated into other languages using GNU gettext with the domain set to GNU libextractor (dgettext("libextractor", s)).

— Function: const char * EXTRACTOR_metatype_to_description (enum EXTRACTOR_MetaType type)

The function EXTRACTOR_metatype_to_description can be used to obtain a longer English string ‘s’ describing the meta data type. The description may be empty if the short description returned by EXTRACTOR_metatype_to_string is already comprehensive. The string can be translated into other languages using GNU gettext with the domain set to GNU libextractor (dgettext("libextractor", s)).

Next: Extracting, Previous: Meta types, Up: Extracting meta data

4.3 Meta formats

enum EXTRACTOR_MetaFormat is a C enum which defines on a high level how the extracted meta data is represented. Currently, the library uses three formats: UTF-8 strings, C strings and binary data. A fourth value, EXTRACTOR_METAFORMAT_UNKNOWN is defined but not used. UTF-8 strings are 0-terminated strings that have been converted to UTF-8. The format code is EXTRACTOR_METAFORMAT_UTF8. Ideally, most text meta data will be of this format. Some file formats fail to specify the encoding used for the text. In this case, the text cannot be converted to UTF-8. However, the meta data is still known to be 0-terminated and presumably human-readable. In this case, the format code used is EXTRACTOR_METAFORMAT_C_STRING; however, this should not be understood to mean that the encoding is the same as that used by the C compiler. Finally, for binary data (mostly images), the format EXTRACTOR_METAFORMAT_BINARY is used.

Naturally this is not a precise description of the meta format. Plugins can provide a more precise description (if known) by providing the respective mime type of the meta data. For example, binary image meta data could be also tagged as “image/png” and normal text would typically be tagged as “text/plain”.

Previous: Meta formats, Up: Extracting meta data

4.4 Extracting

— Function Pointer: int (*EXTRACTOR_MetaDataProcessor)(void *cls, const char *plugin_name, enum EXTRACTOR_MetaType type, enum EXTRACTOR_MetaFormat format, const char *data_mime_type, const char *data, size_t data_len)

Type of a function that libextractor calls for each meta data item found.

cls
closure (user-defined)
plugin_name
name of the plugin that produced this value; special values can be used (i.e. '<zlib>' for zlib being used in the main libextractor library and yielding meta data);
type
libextractor-type describing the meta data;
format basic
format information about data
data_mime_type
mime-type of data (not of the original file); can be NULL (if mime-type is not known);
data
actual meta-data found
data_len
number of bytes in data

Return 0 to continue extracting, 1 to abort.

— Function: void EXTRACTOR_extract (struct EXTRACTOR_PluginList *plugins, const char *filename, const void *data, size_t size, EXTRACTOR_MetaDataProcessor proc, void *proc_cls)

This is the main function for extracting keywords with GNU libextractor. The first argument is a plugin list which specifies the set of plugins that should be used for extracting meta data. The ‘filename’ argument is optional and can be used to specify the name of a file to process. If ‘filename’ is NULL, then the ‘data’ argument must point to the in-memory data to extract meta data from. If ‘filename’ is non-NULL, ‘data’ can be NULL. If ‘data’ is non-null, then ‘size’ is the size of ‘data’ in bytes. Otherwise ‘size’ should be zero. For each meta data item found, GNU libextractor will call the ‘proc’ function, passing ‘proc_cls’ as the first argument to ‘proc’. The other arguments to ‘proc’ depend on the specific meta data found.
Meta data extraction should never really fail — at worst, GNU libextractor should not call ‘proc’ with any meta data. By design, GNU libextractor should never crash or leak memory, even given corrupt files as input. Note however, that running GNU libextractor on a corrupt file system (or incorrectly mmaped files) can result in the operating system sending a SIGBUS (bus error) to the process. While GNU libextractor runs plugins out-of-process, it first maps the file into memory and then attempts to decompress it. During decompression it is possible to encounter a SIGBUS. GNU libextractor will not attempt to catch this signal and your application is likely to crash. Note again that this should only happen if the file system is corrupt (not if individual files are corrupt). If this is not acceptable, you might want to consider running GNU libextractor itself also out-of-process (as done, for example, by doodle).

Next: Utility functions, Previous: Extracting meta data, Up: Top

5 Language bindings

GNU libextractor works immediately with C and C++ code. Bindings for Java, Mono, Ruby, Perl, PHP and Python are available for download from the main GNU libextractor website. Documentation for these bindings (if available) is part of the downloads for the respective binding. In all cases, a full installation of the C library is required before the binding can be installed.

5.1 Java

Compiling the GNU libextractor Java binding follows the usual process of running configure and make. The result will be a shared C library libextractor_java.so with the native code and a JAR file (installed to $PREFIX/share/java/libextractor.java).

A minimal example for using GNU libextractor's Java binding would look like this:

import org.gnu.libextractor.*;
import java.util.ArrayList;

public static void main(String[] args) {
  Extractor ex = Extractor.getDefault();
  for (int i=0;i<args.length;i++) {
    ArrayList keywords = ex.extract(args[i]);
    System.out.println("Keywords for " + args[i] + ":");
    for (int j=0;j<keywords.size();j++)
      System.out.println(keywords.get(j));
  }
}

The GNU libextractor library and the libextractor_java.so JNI binding have to be in the library search path for this to work. Furthermore, the libextractor.jar file should be on the classpath.

Note that the API does not use Java 5 style generics in order to work with older versions of Java.

5.2 Mono

his binding is undocumented at this point.

5.3 Perl

This binding is undocumented at this point.

5.4 Python

This binding is undocumented at this point.

5.5 PHP

This binding is undocumented at this point.

5.6 Ruby

This binding is undocumented at this point.

Next: Existing Plugins, Previous: Language bindings, Up: Top

6 Utility functions

This chapter describes various utility functions for GNU libextractor usage. All of the functions are reentrant.

Next: Meta data printing, Up: Utility functions

6.1 Utility Constants

The constant EXTRACTOR_VERSION is a hexadecimal representation of the version number of the installed libextractor header. The hexadecimal format is 0xAABBCCDD where AA is the major version (so far always 0), BB is the minor version, CC is the revision and DD the patch number. For example, for version 0.5.18, we would have AA=0, BB=5, CC=18 and DD=0. Minor releases such as 0.5.18a or significant changes in unreleased versions would be marked with DD=1 or higher.

Previous: Utility Constants, Up: Utility functions

6.2 Meta data printing

The EXTRACTOR_meta_data_print is a simple function which prints the meta data found with libextractor to a file. The function is mostly useful for debugging and as an example for how to manipulate the keyword list and can be passed as the ‘proc’ argument to EXTRACTOR_extract. The file to print to should be passed as ‘proc_cls’ (which must be of type FILE *), for example stdout.

Next: Writing new Plugins, Previous: Utility functions, Up: Top

7 Existing Plugins

ARCHIVE (using libarchive)
DVI
EXIV2 (using libexiv2, 0.23 or later preferred)
FLAC (using libFLAC)
GIF (using libgif)
GSTREAMER (using libgstreamer v1.0 or later)
HTML (using libtidy)
IT
JPEG (using libjpeg v8 or later)
MAN
MIDI (using libsmf)
MIME (using libmagic)
MPEG (using libmpeg2)
NSF
NSFE
ODF
OLE2 (with libgsf)
OGG (with libogg)
PNG
PS
RIFF
RPM (using librpm)
S3M
SID
ThumbnailFFMPEG (using libavformat and related libav-libraries, including libswscale)
ThumbnailGtk (using libgtk)
TIFF (with libtiff, tested with v4)
WAV
XM
ZIP

gzip and bzip2 compressed versions of these formats are also supported (as well as meta data embedded by gzip itself) if zlib or libbz2 are available.

Next: Internal utility functions, Previous: Existing Plugins, Up: Top

8 Writing new Plugins

Writing a new plugin for libextractor usually requires writing of or interfacing with an actual parser for a specific format. How this is can be accomplished depends on the format and cannot be specified in general. However, care should be taken for the code to be reentrant and highly fault-tolerant, especially with respect to malformed inputs.

Plugins should start by verifying that the header of the data matches the specific format and immediately return if that is not the case. Even if the header matches the expected file format, plugins must not assume that the remainder of the file is well formed.

The plugin library must be called libextractor_XXX.so, where XXX denotes the file format of the plugin. The library must export a method libextractor_XXX_extract_method, with the following signature:

void
EXTRACTOR_XXX_extract_method (struct EXTRACTOR_ExtractContext *ec);

‘ec’ contains various information the plugin may need for its execution. Most importantly, it contains functions for reading (“read”) and seeking (“seek”) the input data and for returning extracted data (“proc”). The “config” member can contain additional configuration options. “proc” should be called on each meta data item found. If “proc” returns non-zero, processing should be aborted (if possible).

In order to test new plugins, the extract command can be run with the options “-ni” and “-l XXX” . This will run the plugin in-process (making it easier to debug) and without any of the other plugins.

8.1 Example for a minimal extract method

The following example shows how a plugin can return the mime type of a file.

     void
     EXTRACTOR_mymime_extract (struct EXTRACTOR_ExtractContext *ec)
     {
       void *data;
       ssize_t data_size,
     
       if (-1 == (data_size = ec->read (ec->cls, &data, 4)))
         return; /* read error */
       if (data_size < 4)
         return; /* file too small */
       if (0 != memcmp (data, "\177ELF", 4))
         return; /* not ELF */
       if (0 != ec->proc (ec->cls, 
                          "mymime",
                          EXTRACTOR_METATYPE_MIMETYPE,
                          EXTRACTOR_METAFORMAT_UTF8,
                          "text/plain",
                          "application/x-executable",
                          1 + strlen("application/x-executable")))
         return;
       /* more calls to 'proc' here as needed */
     }

Next: Reporting bugs, Previous: Writing new Plugins, Up: Top

9 Internal utility functions

Some plugins link against the libextractor_common library which provides common abstractions needed by many plugins. This section documents this internal API for plugin developers. Note that the headers for this library are (intentionally) not installed: we do not consider this API stable and it should hence only be used by plugins that are build and shipped with GNU libextractor. Third-party plugins should not use it.

convert_numeric.h defines various conversion functions for numbers (in particular, byte-order conversion for floating point numbers).

unzip.h defines an API for accessing compressed files.

pack.h provides an interpreter for unpacking structs of integer numbers from streams and converting from big or little endian to host byte order at the same time.

convert.h provides a function for character set conversion described below.

— Function: char * EXTRACTOR_common_convert_to_utf8 (const char *input, size_t len, const char *charset)

Various GNU libextractor plugins make use of the internal convert.h header which defines a function
EXTRACTOR_common_convert_to_utf8 which can be used to easily convert text from any character set to UTF-8. This conversion is important since the linked list of keywords that is returned by GNU libextractor is expected to contain only UTF-8 strings. Naturally, proper conversion may not always be possible since some file formats fail to specify the character set. In that case, it is often better to not convert at all.
The arguments to EXTRACTOR_common_convert_to_utf8 are the input string (which does not have to be zero-terminated), the length of the input string, and the character set (which must be zero-terminated). Which character sets are supported depends on the platform, a list can generally be obtained using the iconv -l command. The return value from EXTRACTOR_common_convert_to_utf8 is a zero-terminated string in UTF-8 format. The responsibility to free the string is with the caller, so storing the string in the keyword list is acceptable.

Next: Copying, Previous: Internal utility functions, Up: Top

10 Reporting bugs

GNU libextractor uses the Mantis bugtracking system. If possible, please report bugs there. You can also e-mail the GNU libextractor mailinglist at libextractor@gnu.org.

Next: Concept Index, Previous: Reporting bugs, Up: Top

Appendix A GNU GENERAL PUBLIC LICENSE

Version 2, June 1991

     Copyright © 1989, 1991 Free Software Foundation, Inc.
     59 Temple Place – Suite 330, Boston, MA 02111-1307, USA
     
     Everyone is permitted to copy and distribute verbatim copies
     of this license document, but changing it is not allowed.

A.0.1 Preamble

The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change free software—to make sure the software is free for all its users. This General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Library General Public License instead.) You can apply it to your programs, too.

When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things.

To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it.

For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.

We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software.

Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations.

Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all.

The precise terms and conditions for copying, distribution and modification follow.

This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The “Program”, below, refers to any such program or work, and a “work based on the Program” means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modifications and/or translated into another language. (Hereinafter, translation is included without limitation in the term “modification”.) Each licensee is addressed as “you”.
Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). Whether that is true depends on what the Program does.
You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program.
You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee.
You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions:
1. You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change.
2. You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License.
3. If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.)
These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it.
Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Program.
In addition, mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License.
You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following:
1. Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or,
2. Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or,
3. Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.)
The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable.
If distribution of executable or object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code.
You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance.
You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Program or works based on it.
Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License.
If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program.
If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances.
It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice.
This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License.
If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License.
The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns.
Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and “any later version”, you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you may choose any version ever published by the Free Software Foundation.
If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally.
BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

How to Apply These Terms to Your New Programs

If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms.

To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty; and each file should have at least the “copyright” line and a pointer to where the full notice is found.

     one line to give the program's name and an idea of what it does.
     Copyright (C) 19yy  name of author
     
     This program is free software; you can redistribute it and/or
     modify it under the terms of the GNU General Public License
     as published by the Free Software Foundation; either version 2
     of the License, or (at your option) any later version.
     
     This program is distributed in the hope that it will be useful,
     but WITHOUT ANY WARRANTY; without even the implied warranty of
     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
     GNU General Public License for more details.
     
     You should have received a copy of the GNU General Public License along
     with this program; if not, write to the Free Software Foundation, Inc.,
     59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

Also add information on how to contact you by electronic and paper mail.

If the program is interactive, make it output a short notice like this when it starts in an interactive mode:

     Gnomovision version 69, Copyright (C) 19yy name of author
     Gnomovision comes with ABSOLUTELY NO WARRANTY; for details
     type `show w'.  This is free software, and you are welcome
     to redistribute it under certain conditions; type `show c'
     for details.

The hypothetical commands ‘show w’ and ‘show c’ should show the appropriate parts of the General Public License. Of course, the commands you use may be called something other than ‘show w’ and ‘show c’; they could even be mouse-clicks or menu items—whatever suits your program.

You should also get your employer (if you work as a programmer) or your school, if any, to sign a “copyright disclaimer” for the program, if necessary. Here is a sample; alter the names:

     Yoyodyne, Inc., hereby disclaims all copyright
     interest in the program `Gnomovision'
     (which makes passes at compilers) written
     by James Hacker.
     
     signature of Ty Coon, 1 April 1989
     Ty Coon, President of Vice

This General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Library General Public License instead of this License.

Next: Function and Data Index, Previous: Copying, Up: Top

Concept Index

bug: Reporting bugs
bus error: Extracting
character set: Internal utility functions
concurrency: Utility functions
concurrency: Extracting
concurrency: Plugin management
directory structure: Preparation
environment variables: Preparation
error handling: Introduction
gettext: Meta types
GPL, GNU General Public License: Copying
internationalization: Meta types
Java: Language bindings
license: Introduction
Mono: Language bindings
packageing: Preparation
Perl: Language bindings
PHP: Language bindings
plugin: Preparation
plugin: Introduction
Python: Language bindings
reentrant: Utility functions
reentrant: Extracting
reentrant: Plugin management
Ruby: Language bindings
SIGBUS: Extracting
thread-safety: Utility functions
thread-safety: Extracting
thread-safety: Plugin management
threads: Utility functions
threads: Extracting
threads: Plugin management
UTF-8: Internal utility functions

Next: Type Index, Previous: Concept Index, Up: Top

Function and Data Index

Previous: Function and Data Index, Up: Top

Type Index

enum EXTRACTOR_MetaFormat: Meta formats
enum EXTRACTOR_MetaType: Meta types
enum EXTRACTOR_Options: Plugin management
EXTRACTOR_MetaDataProcessor: Extracting
EXTRACTOR_PluginList: Plugin management
struct EXTRACTOR_PluginList: Plugin management

Footnotes

[1] Some distributions ship extract in a seperate package.

The GNU libextractor Reference Manual

Short Contents

Table of Contents

The GNU libextractor Reference Manual

1 Introduction

2 Preparation

2.1 Installation on GNU/Linux

2.2 Installation on FreeBSD

2.3 Installation on OpenBSD

2.4 Installation on NetBSD

2.5 Installation using MinGW

2.6 Installation on OS X

2.6.1 Installing and uninstalling the framework

2.6.2 Using the framework

2.6.3 Example for using the framework

2.6.4 Example for using the dynamic library

2.7 Note to package maintainers

3 Generalities

3.1 Introduction to the “extract” command

3.2 Common usage examples for “extract”

3.3 Introduction to the libextractor library

4 Extracting meta data

4.1 Plugin management

4.2 Meta types

4.3 Meta formats

4.4 Extracting

5 Language bindings

5.1 Java

5.2 Mono

5.3 Perl

5.4 Python

5.5 PHP

5.6 Ruby

6 Utility functions

6.1 Utility Constants

6.2 Meta data printing

7 Existing Plugins

8 Writing new Plugins

8.1 Example for a minimal extract method

9 Internal utility functions

10 Reporting bugs

Appendix A GNU GENERAL PUBLIC LICENSE

A.0.1 Preamble

How to Apply These Terms to Your New Programs

Concept Index

Function and Data Index

Type Index

Footnotes