Snakemake storage plugin: sharepoint

https://img.shields.io/badge/repository-github-blue?color=%23022c22 https://img.shields.io/badge/author-Hugo%20Lapre-purple?color=%23064e3b PyPI - Version PyPI - License

A Snakemake storage plugin for reading and writing files on Microsoft Sharepoint sites. For now only tested with Sharepoint 2016 on premise, so if any issues arise with your SharePoint site, please file an issue on the GitHub repository.

Installation

Install this plugin by installing it with pip or mamba, e.g.:

pip install snakemake-storage-plugin-sharepoint

Usage

Queries

Queries to this storage should have the following format:

Query type

Query

Description

input

mssp://Documents/data.csv

A file data.csv in a SharePoint library called Documents.

input

mssp://library/folder/file.txt

A file file.txt in a folder named folder under a SharePoint library called library.

output

mssp://Documents/output.csv

A file target.csv in a SharePoint library called Documents.

As default provider

If you want all your input and output (which is not explicitly marked to come from another storage) to be written to and read from this storage, you can use it as a default provider via:

snakemake --default-storage-provider sharepoint --default-storage-prefix ...

with ... being the prefix of a query under which you want to store all your results. You can also pass custom settings via command line arguments:

snakemake --default-storage-provider sharepoint --default-storage-prefix ... \
    --storage-sharepoint-max-requests-per-second ... \        --storage-sharepoint-auth ... \        --storage-sharepoint-allow-redirects ... \        --storage-sharepoint-site-url ... \        --storage-sharepoint-allow-overwrite ... \        --storage-sharepoint-upload-timeout ...

Within the workflow

If you want to use this storage plugin only for specific items, you can register it inside of your workflow:

# register storage provider (not needed if no custom settings are to be defined here)
storage:
    provider="sharepoint",
    # optionally add custom settings here if needed
    # alternatively they can be passed via command line arguments
    # starting with --storage-sharepoint-..., see
    # snakemake --help
    # Maximum number of requests per second for this storage provider. If nothing is specified, the default implemented by the storage plugin is used.
    max_requests_per_second=...,
    # HTTP(S) authentication. AUTH_TYPE is the class name of requests.auth (e.g. HTTPBasicAuth), ARG1,ARG2,... are the arguments required by the specified type. PACKAGE is the full path to the module from which to import the class (semantically this  does 'from PACKAGE import AUTH_TYPE').
    auth=...,
    # Follow redirects when retrieving files.
    allow_redirects=...,
    # The URL of the SharePoint site.
    site_url=...,
    # Allow overwriting files in the SharePoint site.
    allow_overwrite=...,
    # The timeout in milliseconds for uploading files.
    upload_timeout=...,

rule example:
    input:
        storage.sharepoint(
            # define query to the storage backend here
            ...
        ),
    output:
        "example.txt"
    shell:
        "..."

Using multiple entities of the same storage plugin

In case you have to use this storage plugin multiple times, but with different settings (e.g. to connect to different storage servers), you can register it multiple times, each time providing a different tag:

# register shared settings
storage:
    provider="sharepoint",
    # optionally add custom settings here if needed
    # alternatively they can be passed via command line arguments
    # starting with --storage-sharepoint-..., see below
    # Maximum number of requests per second for this storage provider. If nothing is specified, the default implemented by the storage plugin is used.
    max_requests_per_second=...,
    # HTTP(S) authentication. AUTH_TYPE is the class name of requests.auth (e.g. HTTPBasicAuth), ARG1,ARG2,... are the arguments required by the specified type. PACKAGE is the full path to the module from which to import the class (semantically this  does 'from PACKAGE import AUTH_TYPE').
    auth=...,
    # Follow redirects when retrieving files.
    allow_redirects=...,
    # The URL of the SharePoint site.
    site_url=...,
    # Allow overwriting files in the SharePoint site.
    allow_overwrite=...,
    # The timeout in milliseconds for uploading files.
    upload_timeout=...,

# register multiple tagged entities
storage foo:
    provider="sharepoint",
    # optionally add custom settings here if needed
    # alternatively they can be passed via command line arguments
    # starting with --storage-sharepoint-..., see below.
    # To only pass a setting to this tagged entity, prefix the given value with
    # the tag name, i.e. foo:max_requests_per_second=...
    # Maximum number of requests per second for this storage provider. If nothing is specified, the default implemented by the storage plugin is used.
    max_requests_per_second=...,
    # HTTP(S) authentication. AUTH_TYPE is the class name of requests.auth (e.g. HTTPBasicAuth), ARG1,ARG2,... are the arguments required by the specified type. PACKAGE is the full path to the module from which to import the class (semantically this  does 'from PACKAGE import AUTH_TYPE').
    auth=...,
    # Follow redirects when retrieving files.
    allow_redirects=...,
    # The URL of the SharePoint site.
    site_url=...,
    # Allow overwriting files in the SharePoint site.
    allow_overwrite=...,
    # The timeout in milliseconds for uploading files.
    upload_timeout=...,

rule example:
    input:
        storage.foo(
            # define query to the storage backend here
            ...
        ),
    output:
        "example.txt"
    shell:
        "..."

Settings

The storage plugin has the following settings (which can be passed via command line, the workflow or environment variables, if provided in the respective columns):

CLI setting

Workflow setting

Envvar setting

Description

Default

Choices

Required

Type

--storage-sharepoint-max-requests-per-second VALUE

max_requests_per_second

Maximum number of requests per second for this storage provider. If nothing is specified, the default implemented by the storage plugin is used.

None

str

--storage-sharepoint-auth [PACKAGE.]AUTH_TYPE[=ARG1,ARG2,...]

auth

SNAKEMAKE_STORAGE_SHAREPOINT_AUTH

HTTP(S) authentication. AUTH_TYPE is the class name of requests.auth (e.g. HTTPBasicAuth), ARG1,ARG2,… are the arguments required by the specified type. PACKAGE is the full path to the module from which to import the class (semantically this does ‘from PACKAGE import AUTH_TYPE’).

None

str

--storage-sharepoint-allow-redirects VALUE

allow_redirects

Follow redirects when retrieving files.

True

str

--storage-sharepoint-site-url VALUE

site_url

SNAKEMAKE_STORAGE_SHAREPOINT_SITE_URL

The URL of the SharePoint site.

None

str

--storage-sharepoint-allow-overwrite VALUE

allow_overwrite

Allow overwriting files in the SharePoint site.

False

str

--storage-sharepoint-upload-timeout VALUE

upload_timeout

The timeout in milliseconds for uploading files.

1000

str

Further details

For now, the site_url setting is a required setting on the storage provider. This is because the URL to a document cannot uniquely be parsed into the separate components necessary for downloading and uploading on SharePoint (which are: site collection, library, and filename).

Also, overwriting files on SharePoint is disabled by default, and needs to be enabled on the storage provider using the allow_overwrite setting.

Finally, removing files from the remote location is not implemented at all, follow this issue for the current status. Contributions to implement this in a way such that not the entire version history is removed are welcome.