--- last_review: "2025-01-01" last_reviewer: "-" documented_code: [] --- ```{tags} tutorial ``` # Further converters :::{note} This page has been migrated from the old documentation, and has not yet been fully revised. There might be inconsistencies or errors when using with current LinkAhead versions. ::: % TODO: Issue: https://gitlab.indiscale.com/caosdb/src/linkahead-docs/-/issues/81 % TODO: Much of this information can probably be moved to the docstrings. More converters, together with cfood definitions and examples can be found in the [LinkAhead Crawler Extensions Subgroup](https://gitlab.com/linkahead/crawler-extensions) on GitLab. In the following, we list converters that are shipped with the {term}`crawler ` library itself but are not part of the set of standard converters and may require this library to be installed with additional optional dependencies. ## HDF5 Converters For treating [HDF5 Files](https://www.hdfgroup.org/solutions/hdf5/), there are in total four individual converters corresponding to the internal structure of HDF5 files: the [H5FileConverter](#h5fileconverter) opens the file itself and creates further structure elements from HDF5 groups, datasets, and included multidimensional arrays. These are in turn treated by the [H5GroupConverter](#h5groupconverter), the [H5DatasetConverter](#h5datasetconverter), and the [H5NdarrayConverter](#h5ndarrayconverter), respectively. You need to install the LinkAhead crawler with its `h5-crawler` dependency option for using these converters (`pip install caoscrawler.[h5-crawler]`). The basic idea when crawling HDF5 files is to treat them very similar to [dictionaries](standard_converters.md#dictelement-converter) in which the attributes on root, group, or dataset level are essentially treated like `BooleanElement`, `TextElement`, `FloatElement`, and `IntegerElement` in a dictionary: They are appended as children and can be accessed via the subtree. The file itself and the groups within may contain further groups and datasets, which can have their own attributes, subgroups, and datasets, very much like `DictElements` within a dictionary. The main difference to any other dictionary type is the presence of multidimensional arrays within HDF5 datasets. Since LinkAhead doesn't have any datatype corresponding to these, and since it isn't desirable to store these arrays directly within LinkAhead for reasons of performance and searchability, we wrap them within a {term}`Record` as explained [below](#h5ndarrayconverter), together with their metadata and their internal path within the HDF5 file. This means users can query for datasets and their arrays according to their metadata within LinkAhead and then use the internal path information to access the dataset within the file directly. The type of this record and the {term}`property ` for storing the internal path need to be reflected in the schema. Using the default names, you would need a schema like ```yaml H5Ndarray: obligatory_properties: internal_hdf5-path: datatype: TEXT ``` although the names of both property and record type can be configured within the CFood definition. A simple example of a cfood definition for HDF5 files can be found in the [unit tests](https://gitlab.com/linkahead/linkahead-crawler/-/blob/main/unittests/h5_cfood.yml?ref_type=heads) and shows how the individual converters are used in order to crawl a [simple example file](https://gitlab.com/linkahead/linkahead-crawler/-/blob/main/unittests/hdf5_dummy_file.hdf5?ref_type=heads) containing groups, subgroups, and datasets, together with their respective attributes. ### H5FileConverter This is an extension of the {py:class}`~caoscrawler.converters.converters.SimpleFileConverter` class. It opens the HDF5 file and creates children for any contained group or dataset. Additionally, the root-level attributes of the HDF5 file are accessible as children. ### H5GroupConverter This is an extension of the {py:class}`~caoscrawler.converters.converters.DictElementConverter` class. Children are created for all subgroups and datasets in this HDF5 group. Additionally, the group-level attributes are accessible as children. ### H5DatasetConverter This is an extension of the {py:class}`~caoscrawler.converters.converters.DictElementConverter` class. Most importantly, it stores the array data in HDF5 dataset into {py:class}`~caoscrawler.converters.hdf5_converter.H5NdarrayElement` which is added to its children, as well as the dataset attributes. ### H5NdarrayConverter This converter creates a wrapper record for the contained dataset. The name of this record needs to be specified in the cfood definition of this converter via the `recordname` option. The {term}`RecordType` of this record can be configured with the `array_recordtype_name` option and defaults to `H5Ndarray`. Via the given `recordname`, this record can be used within the cfood. Most importantly, this record stores the internal path of this array within the HDF5 file in a text property, the name of which can be configured with the `internal_path_property_name` option which defaults to `internal_hdf5_path`. ## ROCrateConverter The ROCrateConverter unpacks ro-crate files, and creates one instance of the `ROCrateEntity` structure element for each contained object. Currently only zipped ro-crate files are supported. The created ROCrateEntities wrap a `rocrate.model.entity.Entity` with a path to the folder the ROCrate data is saved in. They are appended as children and can then be accessed via the subtree and treated using the [ROCrateEntityConverter](#rocrateentityconverter). To use the ROCrateConverter, you need to install the LinkAhead crawler with its optional `rocrate` dependency. ### ELNFileConverter As .eln files are zipped ro-crate files, the ELNFileConverter works analogously to the ROCrateConverter and also creates ROCrateEntities for contained objects. ### ROCrateEntityConverter The ROCrateEntityConverter unpacks the `rocrate.model.entity.Entity` wrapped within a ROCrateEntity, and appends all properties, contained files, and parts as children. Properties are converted to a basic element matching their value (`BooleanElement`, `IntegerElement`, etc.) and can be matched using match_properties. Each `rocrate.model.file.File` is converted to a crawler File object, which can be matched with SimpleFile. And each subpart of the ROCrateEntity is also converted to a ROCrateEntity, which can then again be treated using this converter. The `match_entity_type` keyword can be used to match a ROCrateEntity using its entity_type. With the `match_properties` keyword, properties of a ROCrateEntity can be either matched or extracted, as seen in the example cfood below: ### Example cfood One short cfood to generate records for each .eln file in a directory and their metadata files could be: ```yaml --- metadata: crawler-version: 0.9.0 --- Converters: ELNFile: converter: ELNFileConverter package: caoscrawler.converters.rocrate ROCrateEntity: converter: ROCrateEntityConverter package: caoscrawler.converters.rocrate ParentDirectory: type: Directory match: (.*) subtree: ELNFile: type: ELNFile match: (?P.*)\.eln records: ELNExampleRecord: filename: $filename subtree: ROCrateEntity: type: ROCrateEntity match_properties: "@id": ro-crate-metadata.json dateCreated: (?P.*) records: MDExampleRecord: parent: $ELNFile filename: ro-crate-metadata.json time: $dateCreated ``` With `match_properties: "@id": ro-crate-metadata.json` the ROCrateEntities can be filtered to only match the metadata json files. With `match_properties: dateCreated: (?P.*)` the `dateCreated` entry of that metadata json file is extracted and accessible through the `dateCreated` variable. The example could then be extended to use any other entry present in the metadata json to filter the results, or insert the extracted information into generated records.