crossmap¶

crossmap is proof-of-concept software for the exploration of text-based datasets.

The software manages a database and indexes of nearest neighbors, and provides protocols for querying data using unstructured queries. It can be used with any text-based data, but its feature set is influenced by the needs in terms of data exploration in the biological sciences.

As an example, a crossmap instance can be loaded with text documents containing gene pathways, diseases phenotypes, and disease descriptions. Its protocols can then be used to search this knowledge-base using gene sets, phenotypes, keywords, or any combination at once.

Features¶

Querying heterogeneous data. crossmap can manage multiple datasets at once. Queries can be performed against each dataset individually. Protocols includes classical search and data decomposition.
Data diffusion. Data diffusion is an approach for data imputation. It uses information in one dataset to adjust augment user queries before carrying out classical search or data decomposition.
User-driven learning. crossmap can record data items from users (in batch and using an interactive interface) and use these data to fine-tune search results. This allows users to train the search mechanism in real-time.
Interactive user interface. The software provides a graphical user interface for interactive data exploration. The interface is based on chat and is accessible via a web browser.
Programmatic interface. All software features are accesible through a command-line interface that is suitable for batch processing.

Getting started¶

There are a few steps to start using crossmap in a practical project. The brief bullets summarize these stages, and other documentation pages provide in-depth information on each topic.

Installation. The software and dependencies are installed. This may require installation of python, javascript, and docker. [more]
Instance configuration. A configuration file is prepared to instruct the software what data to load, and how the data should be indexed. Configuration files offer many settings, but a minimal/typical file consists of just a few lines. [more]
Instance build. Prepared data are copied into an ‘instance’. During this stage, the software transfer data into a database, encodes data items into numeric representations, and constructs indexes for efficient querying. After this stage, the instance is ready for use. [more]
Querying. A crossmap instance is queried using a command-line or graphical user interface. Queries can consist of simple searches, searches enhanced by diffusion processes, and decomposition queries.
Training. Optionally, an existing instance can be trained with additional data. New data can be added in batch using the command-line interface, or one item at a time using the graphical interface. The new data can then be used to modify and fine-tune search results.

crossmap processes text-based data in yaml format. Data must be prepared in this format during an initial project setup and throughout querying.

Data preparation. Raw data are prepared into a format that can be processed by the crossmap software. This data format is not highly structured, so it can accommodate many types of text-based data. [more]

Many features are available through a graphical user interface.

Chat-based interface. Optionally, instances can be explored via a graphical user interface that runs within a web browser. The interface supports data querying and training. [more]

Documentation:

Preparing datasets:

Interactive use: