--- last_review: "2025-03-13" last_reviewer: "-" documented_code: [] --- ```{tags} tutorial, crawler, advanced-user ``` # Custom Converters :::{note} This page has been migrated from the old documentation, and has not yet been fully revised. There might be inconsistencies or errors when using with current LinkAhead versions. ::: % TODO: Issue: https://gitlab.indiscale.com/caosdb/src/linkahead-docs/-/issues/81 % TODO: Tone, split into tutorial + other documents, Archive documentation if for old crawler As mentioned in the previous tutorials, it is possible to create custom converters. These custom converters can be used to integrate arbitrary data extraction and ETL capabilities into the LinkAhead {term}`crawler ` and make these extensions available to any yaml specification. ## Tell the crawler about a custom converter To use a custom crawler, it must be defined in the `Converters` section of the {term}`CFood` yaml file. The basic syntax for adding a custom converter to a definition file is: ```yaml Converters: : package: .. converter: ``` The Converters section can be either put into the first or the second document of the cfood yaml file. It can be also part of a single-document yaml cfood file. Please refer to [the cfood tutorial](./cfood) for more details. Details: - **\**: This is the name of the converter as it is going to be used in the present yaml file. - **\.\.\**: The name of the module where the converter class resides. - **\**: Within this specified module there must be a class inheriting from base class {py:class}`caoscrawler.converters.converters.Converter`. ## Implementing a custom converter Converters inherit from the {py:class}`~caoscrawler.converters.converters.Converter` class. The following methods are abstract and need to be overwritten by your custom converter: - {py:meth}`~caoscrawler.converters.converters.Converter.create_children`: Return a list of child {term}`StructureElement` objects. - {py:meth}`~caoscrawler.converters.converters.Converter.match` - {py:meth}`~caoscrawler.converters.converters.Converter.typecheck` ## Example In the following, we will explain the process of adding a custom converter to a yaml file using a SourceResolver that is able to attach a source element to another {term}`entity `. First we will create our package and module structure, which might be: ``` scifolder_package/ README.md setup.cfg setup.py Makefile tox.ini src/ scifolder/ `__init__.py` converters/ `__init__.py` sources.py # <- the actual file containing # the converter class doc/ unittests/ ``` Now we need to create a class called "SourceResolver" in the file "sources.py". In this more advanced example, our converter will inherit from {py:class}`~caoscrawler.converters.converters.TextElementConverter` rather than directly from {py:class}`~caoscrawler.converters.converters.Converter`. This converter already implements {py:meth}`~caoscrawler.converters.converters.Converter.match` and {py:meth}`~caoscrawler.converters.converters.Converter.typecheck`, so only an implementation for {py:meth}`~caoscrawler.converters.converters.Converter.create_children` has to be provided by us. Furthermore, we will customize the method {py:meth}`~caoscrawler.converters.converters.Converter.create_records`, which allows us to specify a more complex {term}`record ` generation procedure than provided in the standard implementation. One specific limitation of the standard implementation is that only a fixed number of records can be generated by the yaml definition, so for any applications which require an arbitrary number of records to be created, a customized implementation of {py:meth}`~caoscrawler.converters.converters.Converter.create_records` is needed. This can be implemented using the {func}`caoscrawler.converters.converters.create_records` function, which creates records from python dictionaries with the same structure as would be given in a yaml definition, see next section below. ```python import re from caoscrawler.stores import GeneralStore, RecordStore from caoscrawler.converters import TextElementConverter, create_records from caoscrawler.structure_elements import StructureElement, TextElement class SourceResolver(TextElementConverter): """ This resolver uses a source list element (e.g. from the markdown readme file) to link sources correctly. """ def __init__(self, definition: dict, name: str, converter_registry: dict): """ Initialize a new directory converter. """ super().__init__(definition, name, converter_registry) def create_children(self, generalStore: GeneralStore, element: StructureElement): # The source resolver does not create children: return [] def create_records(self, values: GeneralStore, records: RecordStore, element: StructureElement, file_path_prefix): if not isinstance(element, TextElement): raise RuntimeError() # This function must return a list containing tuples, each one for a modified # property: (name_of_entity, name_of_property) keys_modified = [] # This is the name of the entity where the source is going to be attached: attach_to_scientific_activity = self.definition["scientific_activity"] rec = records[attach_to_scientific_activity] # The "source" is a path to a source project, so it should have the form: # //// # obtain these information from the structure element: val = element.value regexp = (r'/(?P(SimulationData)|(ExperimentalData)|(DataAnalysis))' '/(?P.*?)_(?P.*)' '/(?P[0-9]{4,4}-[0-9]{2,2}-[0-9]{2,2})(_(?P.*))?/') res = re.match(regexp, val) if res is None: raise RuntimeError("Source cannot be parsed correctly.") # Mapping of categories on the file system to corresponding record types in CaosDB: cat_map = { "SimulationData": "Simulation", "ExperimentalData": "Experiment", "DataAnalysis": "DataAnalysis"} linkrt = cat_map[res.group("category")] keys_modified.extend(create_records(values, records, { "Project": { "date": res.group("project_date"), "identifier": res.group("project_identifier"), }, linkrt: { "date": res.group("date"), "identifier": res.group("identifier"), "project": "$Project" }, attach_to_scientific_activity: { "sources": "+$" + linkrt }}, file_path_prefix)) # Process the records section of the yaml definition: keys_modified.extend( super().create_records(values, records, element, file_path_prefix)) # The create_records function must return the modified keys to make it compatible # to the crawler functions: return keys_modified ``` If the recommended (python) package structure is used, the package containing the converter definition can be installed using `pip install .` or `pip install -e .` from the `scifolder_package` directory. The following yaml block will register the converter in a yaml file: ```yaml Converters: SourceResolver: package: scifolder.converters.sources converter: SourceResolver ``` ## Using the `create_records` API function The function {func}`caoscrawler.converters.converters.create_records` mentioned above is the recommended way to create new records from custom converters. Let's have a look at the function signature: ```python def create_records(values: GeneralStore, # <- pass the current variables store here records: RecordStore, # <- pass the current store of CaosDB records here def_records: dict): # <- This is the actual definition of new records! ``` `def_records` is the actual definition of new records according to the yaml cfood specification. With it, you can do everything you could do in the yaml document, using python source code. Let's have a look at a few examples: ```yaml DirConverter: type: Directory match: (?P.*) records: Experiment: identifier: $dir_name ``` This block will create a new record with {term}`parent ` `Experiment` and one {term}`property ` `identifier` with a value derived from the matching regular expression. Let's formulate that using `create_records`: ```python dir_name = "directory name" record_def = { "Experiment": { "identifier": dir_name } } keys_modified = create_records(values, records, record_def) ``` The `dir_name` is set explicitly here, everything else is identical to the yaml statements. ## The role of `keys_modified` You probably have noticed already that {func}`caoscrawler.converters.converters.create_records` returns `keys_modified`, which is a list of tuples. Each element of `keys_modified` has two elements: - Element 0 is the name of the record that is modified (as used in the record store `records`). - Element 1 is the name of the property that is modified. It is important that the correct list of modified keys is returned by {py:meth}`~caoscrawler.converters.converters.Converter.create_records`. So, a sketch of a typical implementation within a custom converter could look like this: ```python def create_records(self, values: GeneralStore, records: RecordStore, element: StructureElement, file_path_prefix: str): # Modify some records: record_def = { # ... } keys_modified = create_records(values, records, record_def) # You can of course do it multiple times: keys_modified.extend(create_records(values, records, record_def)) # You can also process the records section of the yaml definition: keys_modified.extend( super().create_records(values, records, element, file_path_prefix)) # This essentially allows users of your converter to customize the creation of records # by providing a custom "records" section additionally to the modifications provided # in this implementation of the Converter. # Important: Return the list of modified keys! return keys_modified ``` ## More complex example Let's have a look at a more complex examples, defining multiple records: ```yaml DirConverter: type: Directory match: (?P.*) records: Project: identifier: project_name Experiment: identifier: $dir_name Project: $Project ProjectGroup: projects: +$Project ``` This block will create two new records: - A project with a constant identifier - An experiment with an identifier, derived from a regular expression and a reference to the new project. Furthermore, a record `ProjectGroup` will be edited (its initial definition is not given in the yaml block): The project that was just created will be added as a list element to the property `projects`. Let's formulate that using `create_records` (again, `dir_name` is constant here): ```python dir_name = "directory name" record_def = { "Project": { "identifier": "project_name", } "Experiment": { "identifier": dir_name, "Project": "$Project", } "ProjectGroup": { "projects": "+$Project", } } keys_modified = create_records(values, records, record_def) ``` ## Debugging You can add the key `debug_match` to the definition of a Converter in order to create debugging output for the match step. The following snippet illustrates this: ```yaml DirConverter: type: Directory match: (?P.*) debug_match: True records: Project: identifier: project_name ``` Whenever this Converter tries to match a StructureElement, it logs both elements being matched and the result.