Data format¶
Data files must be prepared in yaml format. Given that crossmap is meant
for integration of many different types of data, it may seem ironic that the
software only accepts one data format. However, this data format is actually
quite accommodating and content from other file formats can be transferred
into yaml.
Primary data¶
Yaml files are expected to be lists mapping item identifiers to associated
data. As an example, assuming we have two data items with identifiers
item:1 and item:2, a dataset file might look as follows:
item:1:
data: content for item 1
item:2:
data: content for item 2
The keywords data instructs crossmap to convert those strings into
the primary data associated with the item identifiers. Thus, after loading
these items into a crossmap instance, it will be possible to retrieve the
items via queries such as ‘item’, ‘content’, or the numbers ‘1’ and ‘2’.
Metadata¶
Each item can be associated with metadata field. The content of this field is recorded in the crossmap database but is not used for any calculations. The field can serve to enhance readability, or for secondary analysis.
Examples:
item:3:
data: A B C
metadata:
id: item:3
source: item source
item:4:
data: D E F
metadata: description for item 4
There are no constraints on the form or the content of metadata. One of the examples above, for example, uses a dictionary to organize information into key/value pairs.
Structured and nested data¶
The content of the data field can be structured as arbitrary objects.
In particular, dictionaries of key/value pairs and arrays are valid.
Examples:
item:nested:
data:
key1: value1
key2: value2
key3:
key4: value4
item:array:
data: [value1, value2]
It is important that when data is a dictionary, its keys are not
transferred into the object representations. Thus, searching the above data
collection for the string ‘key1’ would not produce a hit. However, nested
objects are stringified and their keys can become part of the object
representations. Thus, searching for ‘key4’ would match item:nested.
Weighting of text-based data¶
Text under the data fields is split into k-mers, and each k-mer is
weighted according to a strategy determined during the instance build process.
However, there is some flexibility to increase or decrease the weight
associated with certain features / k-mers. Weights can be increased
through repetition. Negative weights can be assigned through a data_neg
field.
Examples:
item:single:
data: A B
item:increase:
data: A B B B
item:negative:
data_pos: A
data_neg: B
In these examples, the numeric representation of item:repeated will be
skewed in favor of feature B compared to the representation of
item:single. The numeric representation of item:negative will include
a negative value for feature B.
Numeric weighting¶
To achieve more control over the feature weighting, items
can be specified through a field value instead of data.
Examples:
item:equal:
value:
A: 1.0
B: 1.0
item:skewed:
value:
A: 1.0
B: 2.5
item:biased:
data: A B
value:
X: 2.0
Specifying values for each feature gives full control over the relative weighting between features.
Note that whereas text under data is parsed automatically into k-mers
and then used to to construct a numeric representation, features
under value are used as-is.
Note that the data and value fields can be specified together.
crossmap will then use both fields to construct a joint numeric
representation of the items. This representation will arise from a
deterministic procedure, but the relative weighting of the various
features will not be obvious from the data file alone.