Data conversion from other formats¶
crossmap requires input in a specific format. At the same time, it is meant to integrate many types of data. To reconcile these two factors, the repository provides a suite of scripts to convert data in existing formats into a form that can be loaded into crossmap instances.
The suite is implemented as module crossprep.py, which is located in directory
crossprep. This is a script that can be executed with the following pattern.
python crossprep.py COMPONENT [...]
Here, COMPONENT is a type of dataset to prepare and [...] are arguments
that pertain to that component.
Ontology definitions¶
Ontologies store concept definitions that are relevant to a domain, along with
relations between them. A common file format to encode ontology data is ‘obo’.
crossprep obo is a utility for parsing ‘obo’ files and preparing their
contents for loading into a crossmap instance.
Ontology files must be downloaded separately, for example from the OBO Foundry. The utility can then process the local file,
python crossprep.py obo --obo file.obo --name obo
Optional settings can tune the data transferred from the ‘obo’ file to
the crossmap data file. One of these, --obo_root, sets the root node
for the output dataset. By default, the utility processes the whole
ontology hierarchy. The --obo_root argument can instead create a dataset
for a particular branch of the ontology.
python crossprep.py obo --obo file.obo --name obo \
--obo_root NODE:00001
Another setting, –obo_aux, determines what kind of information is transferred from the ontology definition into the output dataset. The allowed values are parents, ancestors, children, siblings, or combinations thereof (separated by commas). By default, the utility incorporates data about a node’s parent within the definition of that node. As an example, to incorporate information about parents and siblings, the command would be:
python crossprep.py obo --obo file.obo \
--name obo_parents_siblings \
--obo_aux parents,siblings
Pubmed abstracts¶
Pubmed is an NCBI service that indexes scientific articles published in the life sciences.
crossprep pubmed_baseline is a utility for downloading article data from
pubmed,
and crossprep pubmed is an associated utility for processing that data.
The first utility downloads baseline article data. An example call to this utility is as follows:
python crossprep.py pubmed_baseline --outdir /output/dir
This creates an output directory and a subdirectory baseline, then attempts
to download all baseline files from the NCBI servers. It is possible to
restrict the downloads via file indexes, e.g.
python crossprep.py baseline --outdir /output/dir \
--baseline_indexes 1-10
The crossprep pubmed utility scans the downloaded baseline files and builds yaml datasets.
python crossprep.py pubmed --outdir /output/dir --name pubmed-all
It is possible to tune the output dataset using year ranges, pattern matches, and size thresholds, e.g.
python crossprep.py pubmed --outdir /output/dir \
--name pubmed-recent-human \
--pubmed_year 2010-2019 \
--pubmed_pattern human \
--pubmed_length 500
This will create a dataset holding articles from the years 2010-2019, containing the text pattern ‘human’ and containing at least 500 characters in the title and abstract fields.
Gene sets¶
There are many file formats used to convey sets of genes. One of the simplest is the gmt format. This stores sets ‘horizontally’, with each set occupying one line of text.
The crossprep genesets utility converts gene sets in the gmt format into a dataset for crossmap. The utility can be used to filter gene sets by size.
python crossprep.py genesets --outdir /output/dir \
--name geneset \
--gmt path-to-gmt.gmt.gz \
--gmt_min_size 5 --gmt_max_size 100
This will read gene sets specified via argument --gmt create a dataset
geneset.yaml.gz. The output will contain genesets of size in the range
given by –gmt_min_size and –gmt_max_size.
Orphanet diseases¶
Orphanet is a curated knowledge-base on diseases , including their phenotypes and associated genetic causes. Parts of their database are available for download as xml files.
The orphanet utility parses these files and prepare diseases summaries.
python crossprep.py orphanet --outdir /output/dir \
--name orphanet \
--orphanet_nomenclature en_product1.xml \
--orphanet_phenotypes en_product4.xml \
--orphanet_genes en_product6.xml
Wiktionary¶
Wiktionary is an online dictionary that is part of Wikimedia. It provides bulk downloads of all the word definitions in its database.
The wiktionary utility parses the word definitions and constructs data
files for crossmap.
python crossprep.py wiktionary --outdir /output/dir \
--name wiktionary \
--wiktionary enwiktionary-pages-articles.xml.bz2 \
--wiktionary_length 10
This command processes compressed xml files, as provided by the wiktionary download page. The last argument is a numerical factor that instructs the utility to skip some words if the length of the definition (measured in characters) is not longer than 10 times the length of the word itself; this is a mechanism to skip stub entries in the dictionary.