--- last_review: "2025-01-01" last_reviewer: "-" documented_code: [ ] --- ```{tags} how-to ``` # Crawler Concept :::{note} This new documentation page has not yet been fully reviewed and may be incomplete. ::: % TODO: Issue: https://gitlab.indiscale.com/caosdb/src/linkahead-docs/-/issues/81 % TODO: This is an explanation page. There is currently no real good place to put this with the % TODO: current folder structure, but it should be moved as soon as we organize the crawler docs. ## The datamodel For each component in eLabFTW, a corresponding {term}`RecordType` with fitting {term}`properties ` is created, with related components inheriting from shared base classes. Most {term}`Records ` are identified using their eLab ID and the eLab instance they were crawled from, so that several eLab instances can be synchronized to the same LinkAhead instance. ### Overview of relevant RecordTypes The following class diagram provides an overview over the most relevant RecordTypes created in LinkAhead. Some names may be shortened for easier display, and the eLab-specific prefix has been removed. ```mermaid :zoom: classDiagram BaseItem <|-- Compound BaseItem <|-- ComplexItem ComplexItem <|-- Experiment ComplexItem <|-- Resource ComplexItem <|-- Template BaseItem <|-- Status Status <|-- ExpStatus Status <|-- ResStatus Status <|-- ExpCategory BaseItem <|-- Category Category <|-- ResCategory BaseItem : eLabInstance Instance BaseItem : INTEGER externalID ComplexItem : TEXT url ComplexItem : TEXT mainText ComplexItem : LIST experiments ComplexItem : LIST resources ComplexItem : LIST compounds ComplexItem : LIST tags ComplexItem : LIST extraFields ComplexItem : LIST steps ComplexItem : LIST files ComplexItem : INTEGER rating Resource : TEXT UniqueID Resource : ResStatus Status Resource : ResCategory Category Experiment : TEXT UniqueID Experiment : ExpStatus Status Experiment : ExpCategory Category Template : ExpStatus Status Template : ExpCategory Category Status : TEXT color Status : BOOLEAN is_default Category : TEXT color ResCategory : ResourceStatus default_status Compound : DATETIME date namespace eLab Items { class Compound class Experiment class Resource } namespace eLab Structure { class Template class ExpStatus class ResStatus class ExpCategory class ResCategory } ``` ## The crawl process Crawling data from eLabFTW to LinkAhead is done in three steps. First, the data from eLab is retrieved using their {term}`API`, accessed through a python client. This data includes the following eLab objects: Experiments, Resources, Templates, Experiment categories, Resource categories, Experiment status, Resource status, Compounds, Tags, and Users. This is done in the crawl_elab method of crawl.py. The retrieved objects are then written saved as json data to a set of files in a shared data folder using the save_data method. In the second step, the created json files are crawled, and from them corresponding LinkAhead {term}`entities ` for the eLab objects are generated. In the last step these are then uploaded to the server in the sync_to_linkahead method, using the LinkAhead python client library. As all used RecordTypes inherit from BaseItem and can be uniquely identified using their RecordType, eLabInstance and externalID, existing entries are updated instead of duplicated. While in general the content of the new Record is copied from the corresponding eLab objects unchanged, there is some processing done for certain properties. Most notably, the HTML content of the Experiment main text property is cleaned by stripping images and script tags as well as style attributes. Additionally, the current {term}`crawler ` version has some limitations, for example custom fields being synchronized as text data rather than their native data type. This means that bidirectional syncing is not yet possible using the current version of the crawler. ## Possible extensions If the content of the crawled Experiment main text properties always adheres to a common structure, it can also be crawled using the inbuilt XML converter, by adjusting the used `cfood.yml` and `model.yml`. If, for example, the experiment description always adheres to the default eLab structure and only has the headers `Goal`, `Procedure`, and `Results`, it would be possible to split the description into three separate corresponding properties, so that searching for specific keywords only in `Procedure`, for example, becomes possible.