Transform Functions#
Note
This page has been migrated from the old documentation, and has not yet been fully revised. There might be inconsistencies or errors when using with current LinkAhead versions.
At times, you might not be able to use a value as it is found, and need to post-process it: Maybe an integer should be increased by an offset or a string should be split into a list of substrings. In order to make such simple conversions possible, transform functions can be used in a converter definition to modify variable values, by specifying the function to use, and the input and output variables it should be given as input and to write the output to, respectively.
<NodeName>:
type: <ConverterName>
match: ".*"
transform:
<TransformNodeName>:
in: $<in_var_name>
out: $<out_var_name>
functions:
- <func_name>: # name of the function to be applied
<func_arg1>: <func_arg1_value> # key value pairs that are passed as parameters
<func_arg2>: <func_arg2_value>
# ...
An example that splits the variable a and puts the generated list in b is the following:
Experiment:
type: Dict
match: ".*"
transform:
param_split:
in: $a
out: $b
functions:
- split: # split is a function that is defined by default
marker: "|" # its only parameter is the marker that is used to split the string
records:
Report:
tags: $b
In this example, the transformer splits the string in ‘$a’ and stores the resulting list in ‘$b’,
which is then added to the Report Record as a list valued property
Note that from LinkAhead Crawler 0.11.0 onwards, the value of marker parameter in the
above example can also be read in from a variable in the usual $ notation:
# ... variable ``separator`` is defined somewhere above this part, e.g.,
# by reading a config file.
Experiment:
type: Dict
match: ".*"
transform:
param_split:
in: $a
out: $b
functions:
- split:
marker: $separator # Now the separator is read in from a
# variable, so we can, e.g., change from
# '|' to ';' without changing the cfood
# definition.
records:
Report:
tags: $b
There are a number of transform functions that are defined by the crawler itself and therefore
available by default (see src/caoscrawler/default_transformers.yml). You can define custom
transform functions by adding them to the cfood definition.
Custom Transformers#
Custom transformers are implemented as python functions adhering to the transformer function signature. They need to be registered in the cfood definition in order to be available during the scanning process.
Let’s assume we want to implement a transformer that replaces all occurrences of single letters
in the value of a variable with a different letter each. So passing “abc” as in_letters and
“xyz” as out_letters would transform the string “scan started” into “szxn stxrted”. We could
implement this in python using the following code:
def replace_letters(in_value: Any, in_parameters: dict) -> Any:
"""
Replace letters in variables
"""
# The arguments to the transformer (as given by the definition in the cfood)
# are contained in `in_parameters`. We need to make sure they are set or
# set their defaults otherwise:
if "in_letters" not in in_parameters:
raise RuntimeError("Parameter `in_letters` missing.")
if "out_letters" not in in_parameters:
raise RuntimeError("Parameter `out_letters` missing.")
l_in = in_parameters["in_letters"]
l_out = in_parameters["out_letters"]
if len(l_in) != len(l_out):
raise RuntimeError("`in_letters` and `out_letters` must have the same length.")
for l1, l2 in zip(l_in, l_out):
in_value = in_value.replace(l1, l2)
return in_value
This code needs to be put into a module that can be found during runtime of the crawler. One possibility is to install the package into the same virtual environment that is used to run the crawler.
Then, the transfomer needs to be registered in the cfood. In this example, the function
replace_letters would be in a file called replace_letters.py, which is stored in a package
called utilities.
---
metadata:
crawler-version: 0.10.2
macros:
---
Converters: # put custom converters here
Transformers:
replace_letters: # This name will be made available in the cfood
function: replace_letters
package: utilities.replace_letters
The transformer can then be used in a converter:
Experiment:
type: Dict
match: ".*"
transform:
replace_letters:
in: $a
out: $b
functions:
- replace_letters: # This is the name of our custom transformer
in_letters: "abc"
out_letters: "xyz"
records:
Report:
tags: $b