Jupyter Notebooks

LABDRIVE is integrated with Jupyter Notebooks. Jupyter notebooks are documents that containing an ordered list of input/output cells which can contain code (Python usually, but other languages can be used), text (using Markdown), mathematics, plots and rich media, that can be executed step by step or in full, in a very easy to use environment, in a LABDRIVE-integrated computational environment.

The source code used to create, read and analyze scientific and research data is usually created by the researchers as Jupyter notebooks, and must also be preserved, along with the datasets. It is usually the best existing Provenance and Structure metadata for the dataset.

LABDRIVE allows users to keep the Jupyter notebooks in which they have the code that reads and "understands" their data as part of the dataset they are creating.

Before using the Jupyter notebooks feature, make sure that your user has an active API key and S3 credentials already generated. If not, a 403 Forbidden error will be shown while trying to access a notebook.

Create a new digital notebook

When in the Explore Content tab of a Data container, right-click over an empty space in the files area. Select New and then Dynamic Notebook, to create a new notebook.

Upload an existing Jupyter notebook

You can upload any existing Jupyter notebook like any other file, using a file transfer protocol or simply dragging and dropping your file to the LABDRIVE Data Container.

Open an existing Jupyter notebook

To open a Jupyter Notebook, double click the notebook icon you would like to open.

How to use them

You can use your Jupyter Notebooks in the same way you would use them in any other platform but, if you plan to work with the data you have in a LABDRIVE container, we have created a Python library that simplifies many actions and makes your programming easier.

For example, lets say you would like to create a function that hashes your files with a new algorithm you would like to use.

First, you should initialize your function, loading the LIBNOVA LABDRIVE libraries:

#!/usr/bin/env python
# coding: utf-8

import json
import hashlib

from libnova                           import Labdrive, Util
from libnova.Labdrive                  import Nuclio
from libnova.Labdrive.Nuclio           import Request
from libnova.Labdrive.Api              import Driver, Container, File, Job, JobMessage
from libnova.Labdrive.Filesystem       import S3
from libnova.Labdrive.Filesystem.S3    import File as S3File, Storage

If your function is going to be called from a LABDRIVE Function, you will receive some parameters from LABDRIVE every time your function is called, but if you plan to use it inside a Jupyter notebook, you should initialize it by your own:

json_sample = {
	"api": {
		"url": "http://labdrive.libnova.com",
		"key_user": "1234567890abcdefghijklmnopqrstuvwxyz",
		"key_root": "1234567890abcdefghijklmnopqrstuvwxyz"
	},
	"labdrive": {
		"container": {
			"id": "1"
		},
		"user": {
			"id": "1"
		},
		"files": {
			"ids":   [ ],
			"paths": [ ]
		},
		"job": {
			"id": "299"
		},
		"trigger": {
			"id": "0",
			"type": "",
			"regex": ""
		},
		"function": {
			"id": "0",
			"key": ""
		}
	},
	"function_params": {
		"your_custom_parameter": "custom_parameter_value"
	}
}

# Initialize the Request parser
#
# This will automatically parse the data sent by labdrive to this function, like the File ID,
# the Job ID, or the User ID who triggered this function.
#
# It will also initialize the API Driver using the user API Key
request_helper = Labdrive.Nuclio.Request.Request(
    None,
    type('',(object,),{"body": json.dumps(json_sample)})()
)

Every function executes in relation to a (Execution) Job, that is really useful for logging the execution progress. You should initialize it with:

# This will set the current function Job to the status "RUNNING"
request_helper.job_init()

And you can log to it using:

# This will write a new Job Message related with the current function Job
request_helper.log("Sample message", JobMessage.JobMessageType.INFO)

The JobMessage.JobMessageType defines the type of message. You can see a list of the available types here.

And then, you would usually have your payload. In this example:

# This will iterate over all the files related with this function execution
for request_file in request_helper.Files:
	# This will retrieve the current function File metadata
	file_metadata = File.get_metadata(request_file.id, True)
	if file_metadata is not None:
		# We log the metadata
		request_helper.log(Util.format_json_item(file_metadata), JobMessage.JobMessageType.INFO)
	else:
		request_helper.log("File " + request_file.id + " has no metadata", JobMessage.JobMessageType.INFO)

	# This will retrieve a seekable S3 file stream that can be used like a native file stream reader
	file_stream = S3.File.get_stream(
		# The storage is needed to set the source bucket of the file
		request_helper.Storage,
		request_file
	)
	if file_stream is not None:
		file_hash_md5 = hashlib.md5()
		file_hash_sha1 = hashlib.sha1()
		file_hash_sha256 = hashlib.sha256()

		# Hashing the blocks with a stream buffer read we can hash multiple algorithms at once
		file_data_stream_buffer = file_stream.read(8 * 1024 * 1024)
		while file_data_stream_buffer:
			file_hash_md5.update(file_data_stream_buffer)
			file_hash_sha1.update(file_data_stream_buffer)
			file_hash_sha256.update(file_data_stream_buffer)

			file_data_stream_buffer = file_stream.read(8 * 1024 * 1024)

		# We log some messages related to the result of the function
		request_helper.log("File hash calculated: MD5    - " + file_hash_md5.hexdigest(),
						   JobMessage.JobMessageType.INFO, request_file.id)
		request_helper.log("File hash calculated: SHA1   - " + file_hash_sha1.hexdigest(),
						   JobMessage.JobMessageType.INFO, request_file.id)
		request_helper.log("File hash calculated: SHA256 - " + file_hash_sha256.hexdigest(),
						   JobMessage.JobMessageType.INFO, request_file.id)

		# We can also store the calculated hashes in the database
		File.set_hash(request_file.id, "md5", file_hash_md5.hexdigest())
		File.set_hash(request_file.id, "sha1", file_hash_sha1.hexdigest())
		File.set_hash(request_file.id, "sha256", file_hash_sha256.hexdigest())

And finally, we must let LABDRIVE that our function has finished, with the result status:

# This will finalize the current function Job
# The parameter is a boolean that determines if the function Job was successful or not
#
# If the parameter is True,  the result will be "COMPLETED",
# else,
# If the parameter is False, the result will be "FAILED"
request_helper.job_end(True)

The full code sample:

#!/usr/bin/env python
# coding: utf-8

import json
import hashlib

from libnova                           import Labdrive, Util
from libnova.Labdrive                  import Nuclio
from libnova.Labdrive.Nuclio           import Request
from libnova.Labdrive.Api              import Driver, Container, File, Job, JobMessage
from libnova.Labdrive.Filesystem       import S3
from libnova.Labdrive.Filesystem.S3    import File as S3File, Storage

json_sample = {
	"api": {
		"url": "http://labdrive.libnova.com",
		"key_user": "1234567890abcdefghijklmnopqrstuvwxyz",
		"key_root": "1234567890abcdefghijklmnopqrstuvwxyz"
	},
	"labdrive": {
		"container": {
			"id": "1"
		},
		"user": {
			"id": "1"
		},
		"files": {
			"ids":   [ ],
			"paths": [ ]
		},
		"job": {
			"id": "299"
		},
		"trigger": {
			"id": "0",
			"type": "",
			"regex": ""
		},
		"function": {
			"id": "0",
			"key": ""
		}
	},
	"function_params": {
		"your_custom_parameter": "custom_parameter_value"
	}
}

# Initialize the Request parser
#
# This will automatically parse the data sent by labdrive to this function, like the File ID,
# the Job ID, or the User ID who triggered this function.
#
# It will also initialize the API Driver using the user API Key
request_helper = Labdrive.Nuclio.Request.Request(
    None,
    type('',(object,),{"body": json.dumps(json_sample)})()
)

# This will set the current function Job to the status "RUNNING"
request_helper.job_init()

# This will write a new Job Message related with the current function Job
request_helper.log("Sample message", JobMessage.JobMessageType.INFO)

# This will iterate over all the files related with this function execution
for request_file in request_helper.Files:
	# This will retrieve the current function File metadata
	file_metadata = File.get_metadata(request_file.id, True)
	if file_metadata is not None:
		# We log the metadata
		request_helper.log(Util.format_json_item(file_metadata), JobMessage.JobMessageType.INFO)
	else:
		request_helper.log("File " + request_file.id + " has no metadata", JobMessage.JobMessageType.INFO)

	# This will retrieve a seekable S3 file stream that can be used like a native file stream reader
	file_stream = S3.File.get_stream(
		# The storage is needed to set the source bucket of the file
		request_helper.Storage,
		request_file
	)
	if file_stream is not None:
		file_hash_md5 = hashlib.md5()
		file_hash_sha1 = hashlib.sha1()
		file_hash_sha256 = hashlib.sha256()

		# Hashing the blocks with a stream buffer read we can hash multiple algorithms at once
		file_data_stream_buffer = file_stream.read(8 * 1024 * 1024)
		while file_data_stream_buffer:
			file_hash_md5.update(file_data_stream_buffer)
			file_hash_sha1.update(file_data_stream_buffer)
			file_hash_sha256.update(file_data_stream_buffer)

			file_data_stream_buffer = file_stream.read(8 * 1024 * 1024)

		# We log some messages related to the result of the function
		request_helper.log("File hash calculated: MD5    - " + file_hash_md5.hexdigest(),
						   JobMessage.JobMessageType.INFO, request_file.id)
		request_helper.log("File hash calculated: SHA1   - " + file_hash_sha1.hexdigest(),
						   JobMessage.JobMessageType.INFO, request_file.id)
		request_helper.log("File hash calculated: SHA256 - " + file_hash_sha256.hexdigest(),
						   JobMessage.JobMessageType.INFO, request_file.id)

		# We can also store the calculated hashes in the database
		File.set_hash(request_file.id, "md5", file_hash_md5.hexdigest())
		File.set_hash(request_file.id, "sha1", file_hash_sha1.hexdigest())
		File.set_hash(request_file.id, "sha256", file_hash_sha256.hexdigest())

# This will finalize the current function Job
# The parameter is a boolean that determines if the function Job was successful or not
#
# If the parameter is True,  the result will be "COMPLETED",
# else,
# If the parameter is False, the result will be "FAILED"
request_helper.job_end(True)

PreviousStorage mode transitions NextArchive organization

Last updated 3 years ago

Was this helpful?