Crawler Concept#
Note
This new documentation page has not yet been fully reviewed and may be incomplete.
The datamodel#
For each component in eLabFTW, a corresponding RecordType with fitting properties is created, with related components inheriting from shared base classes. Most Records are identified using their eLab ID and the eLab instance they were crawled from, so that several eLab instances can be synchronized to the same LinkAhead instance.
Overview of relevant RecordTypes#
The following class diagram provides an overview over the most relevant RecordTypes created in LinkAhead. Some names may be shortened for easier display, and the eLab-specific prefix has been removed.
classDiagram
BaseItem <|-- Compound
BaseItem <|-- ComplexItem
ComplexItem <|-- Experiment
ComplexItem <|-- Resource
ComplexItem <|-- Template
BaseItem <|-- Status
Status <|-- ExpStatus
Status <|-- ResStatus
Status <|-- ExpCategory
BaseItem <|-- Category
Category <|-- ResCategory
BaseItem : eLabInstance Instance
BaseItem : INTEGER externalID
ComplexItem : TEXT url
ComplexItem : TEXT mainText
ComplexItem : LIST<Experiment> experiments
ComplexItem : LIST<Resource> resources
ComplexItem : LIST<Compound> compounds
ComplexItem : LIST<Tag> tags
ComplexItem : LIST<CustomField> extraFields
ComplexItem : LIST<ExperimentStep> steps
ComplexItem : LIST<FILE> files
ComplexItem : INTEGER rating
Resource : TEXT UniqueID
Resource : ResStatus Status
Resource : ResCategory Category
Experiment : TEXT UniqueID
Experiment : ExpStatus Status
Experiment : ExpCategory Category
Template : ExpStatus Status
Template : ExpCategory Category
Status : TEXT color
Status : BOOLEAN is_default
Category : TEXT color
ResCategory : ResourceStatus default_status
Compound : DATETIME date
namespace eLab Items {
class Compound
class Experiment
class Resource
}
namespace eLab Structure {
class Template
class ExpStatus
class ResStatus
class ExpCategory
class ResCategory
}
The crawl process#
Crawling data from eLabFTW to LinkAhead is done in three steps.
First, the data from eLab is retrieved using their API, accessed through a python client. This data includes the following eLab objects: Experiments, Resources, Templates, Experiment categories, Resource categories, Experiment status, Resource status, Compounds, Tags, and Users. This is done in the crawl_elab method of crawl.py. The retrieved objects are then written saved as json data to a set of files in a shared data folder using the save_data method.
In the second step, the created json files are crawled, and from them corresponding LinkAhead entities for the eLab objects are generated. In the last step these are then uploaded to the server in the sync_to_linkahead method, using the LinkAhead python client library. As all used RecordTypes inherit from BaseItem and can be uniquely identified using their RecordType, eLabInstance and externalID, existing entries are updated instead of duplicated.
While in general the content of the new Record is copied from the corresponding eLab objects unchanged, there is some processing done for certain properties. Most notably, the HTML content of the Experiment main text property is cleaned by stripping images and script tags as well as style attributes. Additionally, the current crawler version has some limitations, for example custom fields being synchronized as text data rather than their native data type. This means that bidirectional syncing is not yet possible using the current version of the crawler.
Possible extensions#
If the content of the crawled Experiment main text properties always adheres to a common structure,
it can also be crawled using the inbuilt XML converter, by adjusting the used cfood.yml and
model.yml. If, for example, the experiment description always adheres to the default eLab
structure and only has the headers Goal, Procedure, and Results, it would be possible to
split the description into three separate corresponding properties, so that searching for specific
keywords only in Procedure, for example, becomes possible.