Crawler Concept#

Note

This new documentation page has not yet been fully reviewed and may be incomplete.

The datamodel#

For each component in eLabFTW, a corresponding RecordType with fitting properties is created, with related components inheriting from shared base classes. Most Records are identified using their eLab ID and the eLab instance they were crawled from, so that several eLab instances can be synchronized to the same LinkAhead instance.

Overview of relevant RecordTypes#

The following class diagram provides an overview over the most relevant RecordTypes created in LinkAhead. Some names may be shortened for easier display, and the eLab-specific prefix has been removed.

        classDiagram
    BaseItem <|-- Compound
    BaseItem <|-- ComplexItem
    ComplexItem <|-- Experiment
    ComplexItem <|-- Resource
    ComplexItem <|-- Template
    BaseItem <|-- Status
    Status <|-- ExpStatus
    Status <|-- ResStatus
    Status <|-- ExpCategory
    BaseItem <|-- Category
    Category <|-- ResCategory

    BaseItem : eLabInstance Instance
    BaseItem : INTEGER externalID
    ComplexItem : TEXT url
    ComplexItem : TEXT mainText
    ComplexItem : LIST<Experiment> experiments
    ComplexItem : LIST<Resource> resources
    ComplexItem : LIST<Compound> compounds
    ComplexItem : LIST<Tag> tags
    ComplexItem : LIST<CustomField> extraFields
    ComplexItem : LIST<ExperimentStep> steps
    ComplexItem : LIST<FILE> files
    ComplexItem : INTEGER rating
    Resource : TEXT UniqueID
    Resource : ResStatus Status
    Resource : ResCategory Category
    Experiment : TEXT UniqueID
    Experiment : ExpStatus Status
    Experiment : ExpCategory Category
    Template : ExpStatus Status
    Template : ExpCategory Category
    
    Status : TEXT color
    Status : BOOLEAN is_default
    Category : TEXT color
    ResCategory : ResourceStatus default_status
    Compound : DATETIME date
namespace eLab Items {
    class Compound
    class Experiment
    class Resource
}
namespace eLab Structure {
    class Template
    class ExpStatus
    class ResStatus
    class ExpCategory
    class ResCategory
}

The crawl process#

Crawling data from eLabFTW to LinkAhead is done in three steps.

First, the data from eLab is retrieved using their API, accessed through a python client. This data includes the following eLab objects: Experiments, Resources, Templates, Experiment categories, Resource categories, Experiment status, Resource status, Compounds, Tags, and Users. This is done in the crawl_elab method of crawl.py. The retrieved objects are then written saved as json data to a set of files in a shared data folder using the save_data method.

In the second step, the created json files are crawled, and from them corresponding LinkAhead entities for the eLab objects are generated. In the last step these are then uploaded to the server in the sync_to_linkahead method, using the LinkAhead python client library. As all used RecordTypes inherit from BaseItem and can be uniquely identified using their RecordType, eLabInstance and externalID, existing entries are updated instead of duplicated.

While in general the content of the new Record is copied from the corresponding eLab objects unchanged, there is some processing done for certain properties. Most notably, the HTML content of the Experiment main text property is cleaned by stripping images and script tags as well as style attributes. Additionally, the current crawler version has some limitations, for example custom fields being synchronized as text data rather than their native data type. This means that bidirectional syncing is not yet possible using the current version of the crawler.

Possible extensions#

If the content of the crawled Experiment main text properties always adheres to a common structure, it can also be crawled using the inbuilt XML converter, by adjusting the used cfood.yml and model.yml. If, for example, the experiment description always adheres to the default eLab structure and only has the headers Goal, Procedure, and Results, it would be possible to split the description into three separate corresponding properties, so that searching for specific keywords only in Procedure, for example, becomes possible.