Command-line interface

The crossmap software is a single command-line interface that can be used for several distinct tasks, including building new instances and for performing queries.

The interface is invoked with the following pattern:

python crossmap.py ACTION --config config.yaml \
                   --PARAM_1 VAL_1 --PARAM_2 VAL_2 ... \
                   --FLAG_1 ...

The first argument is always an action code and other settings are provided in parameter/value pairs or as flags.

One of the parameters is usually --config and its value is a path to a configuration file (above, the file is assumed to be config.yaml). The remaining arguments change depending on the desired action.

Building instances

Building a new instance requires a configuration file in a yaml format. Assuming this file is called config.yaml, the build command is

python crossmap.py build --config config.yaml

By default, this provides a moderate level of logging. It is possible to adjust the level of logging via the --logging argument.

Search & decomposition

Once an instance is created, all its files are stored in a directory adjacent to its configuration file. Search and decomposition read an external data file and compare objects therein to the contents of the instance.

Data file must be prepared in a yaml format, for example using one of the data preparation/conversion utilities. Assuming the data file is called data.yaml, search and decomposition are performed as follows

python crossmapy.py search --config config.yaml \
                    --data data.yaml
python crossmapy.py decompose --config config.yaml \
                    --data data.yaml

The outputs are json-formatted arrays with results for all the documents in the data file.

Because the above commands are minimalistic, the search and decomposition analyses are performed using a number of assumptions. In practice, several additional arguments help to adjust the analysis.

  • --dataset [dataset label] - specifies the data collection to search against. The provided value must match a collection name from the configuration file. The default behavior is to use the first data collection in the configuration file.
  • --n [integer] - specifies the number of hits to report for each object in the data file. The default is 1.
  • --pretty [flag, no value necessary] - formats the output using human-readable spacing.
  • --tsv [flag, no value necessary] - format the output into a tab-separated table instead of json.
  • --diffusion [character string] - specifies the type of diffusion process to apply onto to the query before search/decomposition. The string must be provided as a json-formatted dictionary, without any spaces, mapping data collections to numbers. The default is "{}", which disables diffusion.

Using all these arguments, and assuming the instance has a data collection named collection, a complete search query might be as follows

python crossmap.py search --config config.yaml \
                   --data data.yaml \
                   --pretty --n 5 \
                   --dataset collection \
                   --diffusion "{\"collection\":0.5}"

Unfortunately, specifying the diffusion setting on the command line is tedious. However, the formatting is convenient when the search is performed programmatically.

Adding new data

The add action enables users to insert new data items into a running instance. This action

New items must be prepared in the same yaml format as used during the build stage. The action requires two pieces of information.

  • --dataset [dataset label] - specifies a new data collection to create with the supplied data, or an existing collection to augment.
  • --data [path to file] - new data items in yaml format.

Example commands:

python crossmap.py add --config config.yaml \
                   --data data.yaml --dataset new_collection

The add action has an important constraint. It is only possible to insert new data items into a new dataset (a new dataset label) or into an existing dataset that was created at run-time, i.e. after the build stage. In other words, datasets processed during the build stage remain static and unchanged. The reason for this constrain is partly for performance reasons, and partly to separate background datasets and user-specific datasets.

Distances and matrix-breakdowns

While search and decomposition compare external data to entire collections in the crossmap instance, it is also possible to query how external data related to specific objects. Two relevant actions are distances and matrix.

Here, let’s assume the instance has a data collection called ‘collection’, which contains objects ‘obj:1’ and ‘obj:2’. The distance utility is executed as follows

python crossmap.py distances --config config.yaml \
                   --data data.yaml \
                   --dataset collection --ids obj:1,obj:2 \
                   --pretty --diffusion "{\"collection\":1}"

The first two lines of this command provide the essential components; the third line tunes the calculation and output (see above). The output is a json-formatted object with distance values.

The matrix utility has a similar syntax, but provides a detailed breakdown of the the features that participate in the calculation of distances.

Note: the distance and matrix utilities only process the first object defined in the external data file.

Diffusion

The diffuse action provides a means to extract before-diffusion and after-diffusion data representations. Inputs can be specified as plain text or in data files.

  • --data [path to file] specifies a path to a data file.
  • --text [character string] comma-separated list of inputs, but limited
    to strings without spaces and special characters.

Example queries are as follows

python crossmap.py diffuse --config config.yaml --text abcd \
                       --pretty --diffusion "{}"
python crossmap.py diffuse --config config.yaml --text abcd \
                       --pretty --diffusion "{\"collection\":0.5}"

The outputs are json-formatted tables that describe how each text input is broken into features, and how those features are weighted.

Removing datasets

The remove action deletes a whole dataset from a crossmap instance. This action removes data from the database as well as dataset-specific files in the instance data directory.

  • --dataset [dataset label] specifies the databset to remove

Assuming an instance has a dataset calld ‘collection’, removing those data is achieved with the following command.

python crossmap.py remove --config config.yaml --dataset collection

Removing instances

The delete action deletes all datasets, the whole crossmap database, and the disk directory. In contrast to remove, this action therefore affects all datasets in the instance.

The delete action only requires the crossmap configuration file, e.g.

python crossmap.py delete --config config.yaml