--- last_review: "2025-01-01" last_reviewer: "-" documented_code: [] --- ```{tags} tutorial, crawler ``` # Crawler Tutorial: Scientific Folder Structure :::{note} This page has been migrated from the old documentation, and has not yet been fully revised. There might be inconsistencies or errors when using with current LinkAhead versions. ::: % TODO: Issue: https://gitlab.indiscale.com/caosdb/src/linkahead-docs/-/issues/81 % TODO: Archive documentation if for old crawler. The scifolder_tutorial.tar.gz file is missing, % TODO: link is to an empty file - replace! ## The SciFolder structure Let's walk through a more elaborate example of using the LinkAhead {term}`Crawler`, this time making use of a simple directory structure. We assume the directory structure to have the following form: ```text ExperimentalData/ 2022_ProjectA/ 2022-02-17_TestDataset/ file1.dat file2.dat ... ... 2023_ProjectB/ ... ... ``` This file structure is described in the article "Guidelines for a Standardized Filesystem Layout for Scientific Data" (). As a simplified example we want to write a crawler that creates "Project" and "Measurement" {term}`records ` in LinkAhead and set some reasonable {term}`properties ` stemming from the file and directory names. Furthermore, we want to link the data files to the measurement records. Let's first clarify the terms we are using: ```text ExperimentalData/ <--- Category level (level 0) 2022_ProjectA/ <--- Project level (level 1) 2022-02-17_TestDataset/ <--- Activity / Measurement level (level 2) file1.dat <--- Files on level 3 file2.dat ... ... 2023_ProjectB/ <--- Project level (level 1) ... ... ``` So we can see that this follows the three-level folder structure described in the paper. We use the term "Activity level" here, instead of the terms used in the article, as it can be used in a more general way. ## A CFood for SciFolder The following YAML {term}`CFood` is able to match and insert / update the records accordingly, with a detailed explanation of the YAML definitions: :::{figure} /.assets/images/tutorials/crawler/example_crawler.svg ::: ## See for yourself If you want to try this out for yourself, you will need the following content: - Data files in a SciFolder structure. - A data model which describes the data. - An identifiables definition which describes how data {term}`Entities ` can be identified. - A CFood definition which the crawler uses to map from the folder structure to entities in LinkAhead. You can download all the necessarily files, packed in [scifolder_tutorial.tar.gz](/.assets/data/scifolder_tutorial.tar.gz). After storing this archive file, unpack it and go into the `scifolder` directory, then follow these steps: % Create inline shell role with correct syntax highlighting ```{eval-rst} .. role:: shell(code) :language: shell ``` 1. Copy the data files folder to the `extroot` directory of your LinkAhead installation: {shell}`cp -r scifolder_data ../..//`. 2. Load the content of the data folder into LinkAhead: {shell}`python -m caosadvancedtools.loadFiles /opt/caosdb/mnt/extroot/scifolder_data`. The path to loadfiles is the one that the LinkAhead server sees, which is not necessarily the same as the one on your local machine. The prefix `/opt/caosdb/mnt/extroot/` is correct for all LinkAhead instances. If you are in doubt, please ask your administrator for the correct path. For more information on `loadFiles`, call {shell}`python -m caosadvancedtools.loadFiles --help`. :::{note} If the Records that are created will be referenced by LinkAhead File Entities, you (currently) need to make them accessible in LinkAhead in advance. For example, this applies if you have a folder with data files, and you want those files to be referenced by an Experiment Record. ::: 3. Teach the server about the data model: {shell}`python -m caosadvancedtools.models.parser model.yml --sync` 4. Run the crawler on the local `scifolder_data` folder, using the identifiables and CFood definition files: {shell}`caosdb-crawler -s update -i identifiables.yml scifolder_cfood.yml scifolder_data`