CFood-Definition#
Note
This page has been migrated from the old documentation, and has not yet been fully revised. There might be inconsistencies or errors when using with current LinkAhead versions.
CFoods specify how data from a file hierarchy is mapped to LinkAhead Records.
In the simplest case, the CFood is just one yaml file with a single document including at least a converter tree specification, as will be explained in example 1).
If metadata and macro definitions are provided, there must be a second document with these
definitions preceeding the converter tree specification. This second document may be in the same
yaml file, in which two separate yaml documents can be defined using the --- syntax.
It is highly recommended to specify the version of the LinkAhead crawler for which the cfood is written in the metadata section, see below.
There may be some examples in which the custom converter definition is included in the converter tree document for historical reasons, see example 2. This feature is deprecated, and the converter definition should instead be included in the metadata and macro document (see below).
Examples#
A single document with a converter tree specification:
extroot:
type: Directory
match: ^extroot$
subtree:
DataAnalysis:
type: Directory
match: DataAnalysis
# (...)
A single document with a converter tree specification and a custom converters section:
Converters:
CustomConverter_1:
package: mypackage.converters
converter: CustomConverter1
CustomConverter_2:
package: mypackage.converters
converter: CustomConverter2
extroot:
type: Directory
match: ^extroot$
subtree:
DataAnalysis:
type: Directory
match: DataAnalysis
# (...)
A yaml multi-document, defining metadata and some macros in the first document and declaring two custom converters in the second document. Using this syntax is not recommended, the preferred syntax for this can be seen in Example 4).
---
metadata:
name: Datascience CFood
description: CFood for data from the local data science work group
crawler-version: 0.2.1
macros:
- !defmacro
name: SimulationDatasetFile
params:
match: null
recordtype: null
nodename: null
definition:
# (...)
---
Converters:
CustomConverter_1:
package: mypackage.converters
converter: CustomConverter1
CustomConverter_2:
package: mypackage.converters
converter: CustomConverter2
extroot:
type: Directory
match: ^extroot$
subtree:
DataAnalysis:
type: Directory
match: DataAnalysis
# (...)
The recommended way of defining metadata, custom converters, macros and the main cfood specification is shown in the following code example:
---
metadata:
name: Datascience CFood
description: CFood for data from the local data science work group
crawler-version: 0.2.1
macros:
- !defmacro
name: SimulationDatasetFile
params:
match: null
recordtype: null
nodename: null
definition:
# (...)
Converters:
CustomConverter_1:
package: mypackage.converters
converter: CustomConverter1
CustomConverter_2:
package: mypackage.converters
converter: CustomConverter2
---
extroot:
type: Directory
match: ^extroot$
subtree:
DataAnalysis:
type: Directory
match: DataAnalysis
# (...)
List Mode#
Specifying values of properties can make use of two special characters, in order to automatically create lists or multi properties instead of single values:
Experiment1:
Measurement: +Measurement # Element in List (list is cleared before run)
*Measurement # Multi Property (properties are removed before run)
Measurement # Overwrite
Values and units#
Property values can be specified as a simple strings (as above) or as a dictionaries that may also specify the collection mode. Strings starting with a “$” will be replaced by a corresponding variable if there is any. See the tutorials chapter of this documentation for more elaborate examples on how the variable replacement works exactly. A simple example could look the following.
ValueElt:
type: TextElement
match_name: ^my_prop$
match_value: "(?P<value>.*)" # Anything in here is stored in the variable "value"
records:
MyRecord:
MyProp: $value # will be replace by whatever is stored in the "value" variable set above.
If not given explicitly, the collection mode will be determined from the first character of the property value as explained above. This means the following three definitions are all equivalent:
MyProp: +$value
MyProp:
value: +$value
and
MyProp:
value: $value
collection_mode: list
Units of numeric values can be set by providing a property value as a mapping with two entries,
which has the value and unit keys, as shown in this example:
ValueWithUnitElt:
type: TextElement
match_name: ^my_prop$
match_value: "^(?P<number>\\d+\\.?\\d*)\\s+(?P<unit>.+)" # Extract value and unit from a string which
# has a number followed by at least one whitespace
# character followed by a unit.
records:
MyRecord:
MyProp:
value: $number
unit: $unit
File Entities#
In order to use File Entities, you must set the appropriate role: File.
Additionally, the path and file keys have to be given, with values that set the paths remotely and
locally, respectively. You can use the variable <converter name>_path, which is automatically
created by converters dealing with file system related StructureElements.
The file object itself is stored in a variable with the same name, as is the case for other Records.
somefile:
type: SimpleFile
match: ^params.*$ # match any file that starts with "params"
records:
fileEntity:
role: File # necessary to create a File Entity
path: somefile.path # defines the path in LinkAhead
file: somefile.path # path where the file is found locally
SomeRecord:
ParameterFile: $fileEntity # creates a reference to the file
Transform Functions#
You can use transform functions to alter variable values that the crawler consumes (e.g. a string that was matched with a regular expression). For more information, refer to the Converter and Transform Functions tutorials.
You can define your own transform functions by adding them the same way you add custom converters:
Transformers:
transform_foo:
package: some.package
function: some_foo
Automatically generated keys#
Some variable names are automatically generated and can be used with the $<variable name> syntax.
These include:
<converter name>: access the path of converter names to the current converter<converter name>.path: defined only for file system related converters, contains the file system path to the structure element. You need curly brackets to use them:${<converter name>.path}<Record key>: all entities created in therecordssection are available under the same key as used in that section