crossmap¶
crossmap is proof-of-concept software for the exploration of text-based datasets.
The software manages a database and indexes of nearest neighbors, and provides protocols for querying data using unstructured queries. It can be used with any text-based data, but its feature set is influenced by the needs in terms of data exploration in the biological sciences.
As an example, a crossmap instance can be loaded with text documents containing gene pathways, diseases phenotypes, and disease descriptions. Its protocols can then be used to search this knowledge-base using gene sets, phenotypes, keywords, or any combination at once.
Features¶
- Querying heterogeneous data. crossmap can manage multiple datasets at once. Queries can be performed against each dataset individually. Protocols includes classical search and data decomposition.
- Data diffusion. Data diffusion is an approach for data imputation. It uses information in one dataset to adjust augment user queries before carrying out classical search or data decomposition.
- User-driven learning. crossmap can record data items from users (in batch and using an interactive interface) and use these data to fine-tune search results. This allows users to train the search mechanism in real-time.
- Interactive user interface. The software provides a graphical user interface for interactive data exploration. The interface is based on chat and is accessible via a web browser.
- Programmatic interface. All software features are accesible through a command-line interface that is suitable for batch processing.
Getting started¶
There are a few steps to start using crossmap in a practical project. The brief bullets summarize these stages, and other documentation pages provide in-depth information on each topic.
- Installation. The software and dependencies are installed. This may require installation of python, javascript, and docker. [more]
- Instance configuration. A configuration file is prepared to instruct the software what data to load, and how the data should be indexed. Configuration files offer many settings, but a minimal/typical file consists of just a few lines. [more]
- Instance build. Prepared data are copied into an ‘instance’. During this stage, the software transfer data into a database, encodes data items into numeric representations, and constructs indexes for efficient querying. After this stage, the instance is ready for use. [more]
- Querying. A crossmap instance is queried using a command-line or graphical user interface. Queries can consist of simple searches, searches enhanced by diffusion processes, and decomposition queries.
- Training. Optionally, an existing instance can be trained with additional data. New data can be added in batch using the command-line interface, or one item at a time using the graphical interface. The new data can then be used to modify and fine-tune search results.
crossmap processes text-based data in yaml format. Data must be prepared in this format during an initial project setup and throughout querying.
- Data preparation. Raw data are prepared into a format that can be processed by the crossmap software. This data format is not highly structured, so it can accommodate many types of text-based data. [more]
Many features are available through a graphical user interface.
- Chat-based interface. Optionally, instances can be explored via a graphical user interface that runs within a web browser. The interface supports data querying and training. [more]
Documentation:
Preparing datasets: