Command-line interface¶
The crossmap software is a single command-line interface that can be
used for several distinct tasks, including building new instances and
for performing queries.
The interface is invoked with the following pattern:
python crossmap.py ACTION --config config.yaml \
--PARAM_1 VAL_1 --PARAM_2 VAL_2 ... \
--FLAG_1 ...
The first argument is always an action code and other settings are provided in parameter/value pairs or as flags.
One of the parameters is usually --config and its value is a path to a
configuration file (above, the file is assumed to be config.yaml).
The remaining arguments change depending on the desired action.
Building instances¶
Building a new instance requires a configuration file in a yaml format.
Assuming this file is called config.yaml, the build command is
python crossmap.py build --config config.yaml
By default, this provides a moderate level of logging. It is
possible to adjust the level of logging via the --logging argument.
Search & decomposition¶
Once an instance is created, all its files are stored in a directory adjacent to its configuration file. Search and decomposition read an external data file and compare objects therein to the contents of the instance.
Data file must be prepared in a yaml format, for example using one of the
data preparation/conversion utilities. Assuming the data file is called
data.yaml, search and decomposition are performed as follows
python crossmapy.py search --config config.yaml \
--data data.yaml
python crossmapy.py decompose --config config.yaml \
--data data.yaml
The outputs are json-formatted arrays with results for all the documents in the data file.
Because the above commands are minimalistic, the search and decomposition analyses are performed using a number of assumptions. In practice, several additional arguments help to adjust the analysis.
--dataset[dataset label] - specifies the data collection to search against. The provided value must match a collection name from the configuration file. The default behavior is to use the first data collection in the configuration file.--n[integer] - specifies the number of hits to report for each object in the data file. The default is 1.--pretty[flag, no value necessary] - formats the output using human-readable spacing.--tsv[flag, no value necessary] - format the output into a tab-separated table instead of json.--diffusion[character string] - specifies the type of diffusion process to apply onto to the query before search/decomposition. The string must be provided as a json-formatted dictionary, without any spaces, mapping data collections to numbers. The default is"{}", which disables diffusion.
Using all these arguments, and assuming the instance has a data collection
named collection, a complete search query might be as follows
python crossmap.py search --config config.yaml \
--data data.yaml \
--pretty --n 5 \
--dataset collection \
--diffusion "{\"collection\":0.5}"
Unfortunately, specifying the diffusion setting on the command line is tedious. However, the formatting is convenient when the search is performed programmatically.
Adding new data¶
The add action enables users to insert new data items into a running
instance. This action
New items must be prepared in the same yaml format as used during the build stage. The action requires two pieces of information.
--dataset[dataset label] - specifies a new data collection to create with the supplied data, or an existing collection to augment.--data[path to file] - new data items in yaml format.
Example commands:
python crossmap.py add --config config.yaml \
--data data.yaml --dataset new_collection
The add action has an important constraint. It is only possible to insert new data items into a new dataset (a new dataset label) or into an existing dataset that was created at run-time, i.e. after the build stage. In other words, datasets processed during the build stage remain static and unchanged. The reason for this constrain is partly for performance reasons, and partly to separate background datasets and user-specific datasets.
Distances and matrix-breakdowns¶
While search and decomposition compare external data to entire collections in
the crossmap instance, it is also possible to query how external data
related to specific objects. Two relevant actions are distances and
matrix.
Here, let’s assume the instance has a data collection called ‘collection’, which contains objects ‘obj:1’ and ‘obj:2’. The distance utility is executed as follows
python crossmap.py distances --config config.yaml \
--data data.yaml \
--dataset collection --ids obj:1,obj:2 \
--pretty --diffusion "{\"collection\":1}"
The first two lines of this command provide the essential components; the third line tunes the calculation and output (see above). The output is a json-formatted object with distance values.
The matrix utility has a similar syntax, but provides a detailed
breakdown of the the features that participate in the calculation of
distances.
Note: the distance and matrix utilities only process the first object
defined in the external data file.
Diffusion¶
The diffuse action provides a means to extract before-diffusion and
after-diffusion data representations. Inputs can be specified as plain
text or in data files.
--data[path to file] specifies a path to a data file.--text[character string] comma-separated list of inputs, but limited- to strings without spaces and special characters.
Example queries are as follows
python crossmap.py diffuse --config config.yaml --text abcd \
--pretty --diffusion "{}"
python crossmap.py diffuse --config config.yaml --text abcd \
--pretty --diffusion "{\"collection\":0.5}"
The outputs are json-formatted tables that describe how each text input is broken into features, and how those features are weighted.
Removing datasets¶
The remove action deletes a whole dataset from a crossmap instance. This
action removes data from the database as well as dataset-specific files in
the instance data directory.
--dataset[dataset label] specifies the databset to remove
Assuming an instance has a dataset calld ‘collection’, removing those data is achieved with the following command.
python crossmap.py remove --config config.yaml --dataset collection
Removing instances¶
The delete action deletes all datasets, the whole crossmap database, and
the disk directory. In contrast to remove, this action therefore affects
all datasets in the instance.
The delete action only requires the crossmap configuration file, e.g.
python crossmap.py delete --config config.yaml