LogoLogo
  • What is LABDRIVE
  • Concepts
    • Architecture and overview
    • Organize your content
    • OAIS and ISO 16363
      • Understanding OAIS and ISO 16363
      • LABDRIVE support for OAIS Conformance
      • Benefits of preserving research data
      • Planning for preservation
      • ISO 16363 certification guide
      • LABDRIVE support for FAIRness
  • Get started
    • Create a data container
    • Upload content
    • Download content
    • Introduction to metadata
    • Search
    • File versioning and recovery
    • Work with data containers
    • Functions
    • Storage mode transitions
    • Jupyter Notebooks
  • Configuration
    • Archive organization
    • Container templates
    • Configure metadata
    • Users and Permissions
    • Running on premises
  • DATA CURATION AND PRESERVATION
    • Introduction
    • Information Lifecycles
    • Collecting Information needed for Re-Use and Preservation
    • Planning and Using Additional Information in LABDRIVE
    • How to deal with Additional Information
      • Representation Information
      • Provenance Information
      • Context Information
      • Reference Information
      • Descriptive Information
      • Packaging Information
      • Definition of the Designated Community(ies)
      • Preservation Objectives
      • Transformational Information Properties
    • Preservation Activities
      • Adding Representation Information
        • Semantic Representation Information
        • Structural Representation Information
        • Other Representation Information
          • Software as part of the RIN
            • Preserving simple software
              • Jupyter Notebooks as Other RepInfo
            • Preserving complex software
              • Emulation/Virtualisation
                • Virtual machines as Other RepInfo
                • Docker and other containers as Other RepInfo
              • Use of ReproZip
      • Transforming the Digital Object
      • Handing over to another archive
    • Reproducing research
    • Exploiting preserved information
  • DEVELOPER'S GUIDE
    • Introduction
    • Functions
    • Scripting
    • API Extended documentation
  • COOKBOOK
    • LABDRIVE Functions gallery
    • AWS CLI with LABDRIVE
    • Using S3 Browser
    • Using FileZilla Pro
    • Getting your S3 bucket name
    • Getting your S3 storage credentials
    • Advanced API File Search
    • Tips for faster uploads
    • File naming recommendations
    • Configuring Azure SAML-based authentication
    • Exporting OAIS AIP Packages
  • File Browser
    • Supported formats for preview
    • Known issues and limitations
  • Changelog and Release Notes
Powered by GitBook
On this page
  • Create a new digital notebook
  • Upload an existing Jupyter notebook
  • Open an existing Jupyter notebook
  • How to use them

Was this helpful?

  1. Get started

Jupyter Notebooks

PreviousStorage mode transitionsNextArchive organization

Last updated 2 years ago

Was this helpful?

LABDRIVE is integrated with Jupyter Notebooks. Jupyter notebooks are documents that containing an ordered list of input/output cells which can contain code (Python usually, but other languages can be used), text (using Markdown), mathematics, plots and rich media, that can be executed step by step or in full, in a very easy to use environment, in a LABDRIVE-integrated computational environment.

The source code used to create, read and analyze scientific and research data is usually created by the researchers as Jupyter notebooks, and must also be preserved, along with the datasets. It is usually the best existing Provenance and Structure metadata for the dataset.

LABDRIVE allows users to keep the Jupyter notebooks in which they have the code that reads and "understands" their data as part of the dataset they are creating.

Before using the Jupyter notebooks feature, make sure that your user has an active API key and S3 credentials already generated. If not, a 403 Forbidden error will be shown while trying to access a notebook.

Create a new digital notebook

When in the Explore Content tab of a Data container, right-click over an empty space in the files area. Select New and then Dynamic Notebook, to create a new notebook.

Upload an existing Jupyter notebook

You can upload any existing Jupyter notebook like any other file, using a file transfer protocol or simply dragging and dropping your file to the LABDRIVE Data Container.

Open an existing Jupyter notebook

To open a Jupyter Notebook, double click the notebook icon you would like to open.

How to use them

For example, lets say you would like to create a function that hashes your files with a new algorithm you would like to use.

First, you should initialize your function, loading the LIBNOVA LABDRIVE libraries:

#!/usr/bin/env python
# coding: utf-8

import json
import hashlib

from libnova                           import Labdrive, Util
from libnova.Labdrive                  import Nuclio
from libnova.Labdrive.Nuclio           import Request
from libnova.Labdrive.Api              import Driver, Container, File, Job, JobMessage
from libnova.Labdrive.Filesystem       import S3
from libnova.Labdrive.Filesystem.S3    import File as S3File, Storage

If your function is going to be called from a LABDRIVE Function, you will receive some parameters from LABDRIVE every time your function is called, but if you plan to use it inside a Jupyter notebook, you should initialize it by your own:

json_sample = {
	"api": {
		"url": "http://labdrive.libnova.com",
		"key_user": "1234567890abcdefghijklmnopqrstuvwxyz",
		"key_root": "1234567890abcdefghijklmnopqrstuvwxyz"
	},
	"labdrive": {
		"container": {
			"id": "1"
		},
		"user": {
			"id": "1"
		},
		"files": {
			"ids":   [ ],
			"paths": [ ]
		},
		"job": {
			"id": "299"
		},
		"trigger": {
			"id": "0",
			"type": "",
			"regex": ""
		},
		"function": {
			"id": "0",
			"key": ""
		}
	},
	"function_params": {
		"your_custom_parameter": "custom_parameter_value"
	}
}

# Initialize the Request parser
#
# This will automatically parse the data sent by labdrive to this function, like the File ID,
# the Job ID, or the User ID who triggered this function.
#
# It will also initialize the API Driver using the user API Key
request_helper = Labdrive.Nuclio.Request.Request(
    None,
    type('',(object,),{"body": json.dumps(json_sample)})()
)

Every function executes in relation to a (Execution) Job, that is really useful for logging the execution progress. You should initialize it with:

# This will set the current function Job to the status "RUNNING"
request_helper.job_init()

And you can log to it using:

# This will write a new Job Message related with the current function Job
request_helper.log("Sample message", JobMessage.JobMessageType.INFO)

And then, you would usually have your payload. In this example:

# This will iterate over all the files related with this function execution
for request_file in request_helper.Files:
	# This will retrieve the current function File metadata
	file_metadata = File.get_metadata(request_file.id, True)
	if file_metadata is not None:
		# We log the metadata
		request_helper.log(Util.format_json_item(file_metadata), JobMessage.JobMessageType.INFO)
	else:
		request_helper.log("File " + request_file.id + " has no metadata", JobMessage.JobMessageType.INFO)

	# This will retrieve a seekable S3 file stream that can be used like a native file stream reader
	file_stream = S3.File.get_stream(
		# The storage is needed to set the source bucket of the file
		request_helper.Storage,
		request_file
	)
	if file_stream is not None:
		file_hash_md5 = hashlib.md5()
		file_hash_sha1 = hashlib.sha1()
		file_hash_sha256 = hashlib.sha256()

		# Hashing the blocks with a stream buffer read we can hash multiple algorithms at once
		file_data_stream_buffer = file_stream.read(8 * 1024 * 1024)
		while file_data_stream_buffer:
			file_hash_md5.update(file_data_stream_buffer)
			file_hash_sha1.update(file_data_stream_buffer)
			file_hash_sha256.update(file_data_stream_buffer)

			file_data_stream_buffer = file_stream.read(8 * 1024 * 1024)

		# We log some messages related to the result of the function
		request_helper.log("File hash calculated: MD5    - " + file_hash_md5.hexdigest(),
						   JobMessage.JobMessageType.INFO, request_file.id)
		request_helper.log("File hash calculated: SHA1   - " + file_hash_sha1.hexdigest(),
						   JobMessage.JobMessageType.INFO, request_file.id)
		request_helper.log("File hash calculated: SHA256 - " + file_hash_sha256.hexdigest(),
						   JobMessage.JobMessageType.INFO, request_file.id)

		# We can also store the calculated hashes in the database
		File.set_hash(request_file.id, "md5", file_hash_md5.hexdigest())
		File.set_hash(request_file.id, "sha1", file_hash_sha1.hexdigest())
		File.set_hash(request_file.id, "sha256", file_hash_sha256.hexdigest())

And finally, we must let LABDRIVE that our function has finished, with the result status:

# This will finalize the current function Job
# The parameter is a boolean that determines if the function Job was successful or not
#
# If the parameter is True,  the result will be "COMPLETED",
# else,
# If the parameter is False, the result will be "FAILED"
request_helper.job_end(True)

The full code sample:

#!/usr/bin/env python
# coding: utf-8

import json
import hashlib

from libnova                           import Labdrive, Util
from libnova.Labdrive                  import Nuclio
from libnova.Labdrive.Nuclio           import Request
from libnova.Labdrive.Api              import Driver, Container, File, Job, JobMessage
from libnova.Labdrive.Filesystem       import S3
from libnova.Labdrive.Filesystem.S3    import File as S3File, Storage

json_sample = {
	"api": {
		"url": "http://labdrive.libnova.com",
		"key_user": "1234567890abcdefghijklmnopqrstuvwxyz",
		"key_root": "1234567890abcdefghijklmnopqrstuvwxyz"
	},
	"labdrive": {
		"container": {
			"id": "1"
		},
		"user": {
			"id": "1"
		},
		"files": {
			"ids":   [ ],
			"paths": [ ]
		},
		"job": {
			"id": "299"
		},
		"trigger": {
			"id": "0",
			"type": "",
			"regex": ""
		},
		"function": {
			"id": "0",
			"key": ""
		}
	},
	"function_params": {
		"your_custom_parameter": "custom_parameter_value"
	}
}

# Initialize the Request parser
#
# This will automatically parse the data sent by labdrive to this function, like the File ID,
# the Job ID, or the User ID who triggered this function.
#
# It will also initialize the API Driver using the user API Key
request_helper = Labdrive.Nuclio.Request.Request(
    None,
    type('',(object,),{"body": json.dumps(json_sample)})()
)

# This will set the current function Job to the status "RUNNING"
request_helper.job_init()

# This will write a new Job Message related with the current function Job
request_helper.log("Sample message", JobMessage.JobMessageType.INFO)

# This will iterate over all the files related with this function execution
for request_file in request_helper.Files:
	# This will retrieve the current function File metadata
	file_metadata = File.get_metadata(request_file.id, True)
	if file_metadata is not None:
		# We log the metadata
		request_helper.log(Util.format_json_item(file_metadata), JobMessage.JobMessageType.INFO)
	else:
		request_helper.log("File " + request_file.id + " has no metadata", JobMessage.JobMessageType.INFO)

	# This will retrieve a seekable S3 file stream that can be used like a native file stream reader
	file_stream = S3.File.get_stream(
		# The storage is needed to set the source bucket of the file
		request_helper.Storage,
		request_file
	)
	if file_stream is not None:
		file_hash_md5 = hashlib.md5()
		file_hash_sha1 = hashlib.sha1()
		file_hash_sha256 = hashlib.sha256()

		# Hashing the blocks with a stream buffer read we can hash multiple algorithms at once
		file_data_stream_buffer = file_stream.read(8 * 1024 * 1024)
		while file_data_stream_buffer:
			file_hash_md5.update(file_data_stream_buffer)
			file_hash_sha1.update(file_data_stream_buffer)
			file_hash_sha256.update(file_data_stream_buffer)

			file_data_stream_buffer = file_stream.read(8 * 1024 * 1024)

		# We log some messages related to the result of the function
		request_helper.log("File hash calculated: MD5    - " + file_hash_md5.hexdigest(),
						   JobMessage.JobMessageType.INFO, request_file.id)
		request_helper.log("File hash calculated: SHA1   - " + file_hash_sha1.hexdigest(),
						   JobMessage.JobMessageType.INFO, request_file.id)
		request_helper.log("File hash calculated: SHA256 - " + file_hash_sha256.hexdigest(),
						   JobMessage.JobMessageType.INFO, request_file.id)

		# We can also store the calculated hashes in the database
		File.set_hash(request_file.id, "md5", file_hash_md5.hexdigest())
		File.set_hash(request_file.id, "sha1", file_hash_sha1.hexdigest())
		File.set_hash(request_file.id, "sha256", file_hash_sha256.hexdigest())

# This will finalize the current function Job
# The parameter is a boolean that determines if the function Job was successful or not
#
# If the parameter is True,  the result will be "COMPLETED",
# else,
# If the parameter is False, the result will be "FAILED"
request_helper.job_end(True)

You can use your Jupyter Notebooks in the same way you would use them in any other platform but, if you plan to work with the data you have in a LABDRIVE container, we have created a that simplifies many actions and makes your programming easier.

The JobMessage.JobMessageType defines the type of message. You can see a list of the available types .

Python library
here