Jupyter Notebooks

LIBSAFE Go is integrated with Jupyter Notebooks. Jupyter Notebooks are documents that contain an organized list of input/output cells which can contain code (Python usually, but other languages can be used), text (using Markdown), mathematics, plots and rich media, that can be executed step by step or in full, in a very easy to use environment, in a LIBSAFE Go-integrated computational environment.

The source code used to create, read and analyze scientific and research data is usually created by the researchers as Jupyter Notebooks, and must also be preserved, along with the datasets. It is usually the best existing Provenance and Structure metadata for the dataset.

LIBSAFE Go allows users to keep the Jupyter Notebooks in which they have the code that reads and "understands" their data as part of the dataset they are creating.

Before using the Jupyter notebooks feature, make sure that your user has an active API key and S3 credentials already generated. If not, a 403 Forbidden error will be shown while trying to access a notebook.

Create a new digital notebook

When in the Explore Content tab of a Data container, right-click over an empty space in the files area. Select New and then Dynamic Notebook, to create a new notebook.

Upload an existing Jupyter Notebook

You can upload any existing Jupyter Notebook like any other file, using a file transfer protocol or simply dragging and dropping your file to the LIBSAFE Go Data Container.

Open an existing Jupyter Notebook

To open a Jupyter Notebook, double click the notebook icon you would like to open.

How to use them

You can use your Jupyter Notebooks in the same way you would use them in any other platform but, if you plan to work with the data you have in a LIBSAFE Go container, we have created a Python library that simplifies many actions and makes your programming easier.

For example, let's say you would like to create a function that hashes your files with a new algorithm you would like to use.

First, you should initialize your function, loading the LIBNOVA LIBSAFE Go libraries:

#!/usr/bin/env python
# coding: utf-8

import json
import hashlib

from libnova                           import com, Util
from libnova.com                       import Nuclio
from libnova.com.Nuclio                import Request
from libnova.com.Api                   import Driver, Container, File, Job, JobMessage
from libnova.com.Filesystem            import S3
from libnova.com.Filesystem.S3         import File as S3File, Storage

If your Function is going to be called from a LIBSAFE Go Function, you will receive some parameters from LIBSAFE Go every time your Function is called, but if you plan to use it inside a Jupyter Notebook, you should initialize it on your own:

json_sample = {
    "api": {
        "url": "http://go.libnova.com",
        "key_user": "1234567890abcdefghijklmnopqrstuvwxyz",
        "key_root": "1234567890abcdefghijklmnopqrstuvwxyz"
    },
    "function_data": {
        "container": {
            "id": "1"
        },
        "user": {
            "id": "1"
        },
        "files": {
            "ids":   [ ],
            "paths": [ ]
        },
        "job": {
            "id": "299"
        },
        "trigger": {
            "id": "0",
            "type": "",
            "regex": ""
        },
        "function": {
            "id": "0",
            "key": ""
        }
    },
    "function_params": {
        "your_custom_parameter": "custom_parameter_value"
    }
}

# Initialize the Request parser
#
# This will automatically parse the data sent by the platform to this function, like the File ID,
# the Job ID, or the User ID who triggered this function.
#
# It will also initialize the API Driver using the user API Key
request_helper = com.Nuclio.Request.Request(
    None,
    type('',(object,),{"body": json.dumps(json_sample)})()
)

Every function executes in relation to an (Execution) Job, that is really useful for logging the execution progress. You should initialize it with:

# This will set the current function Job to the status "RUNNING"
request_helper.job_init()

And you can log to it using:

# This will write a new Job Message related with the current function Job
request_helper.log("Sample message", JobMessage.JobMessageType.INFO)

The JobMessage.JobMessageType defines the type of message. You can see a list of the available types here.

And then, you would usually have your payload. In this example:

# This will iterate over all the files related with this function execution
for request_file in request_helper.Files:
    # This will retrieve the current function File metadata
    file_metadata = File.get_metadata(request_file.id, True)
    if file_metadata is not None:
        # We log the metadata
        request_helper.log(Util.format_json_item(file_metadata), JobMessage.JobMessageType.INFO)
    else:
        request_helper.log("File " + request_file.id + " has no metadata", JobMessage.JobMessageType.INFO)

    # This will retrieve a seekable S3 file stream that can be used like a native file stream reader
    file_stream = S3.File.get_stream(
        # The storage is needed to set the source bucket of the file
        request_helper.Storage,
        request_file
    )
    if file_stream is not None:
        file_hash_md5 = hashlib.md5()
        file_hash_sha1 = hashlib.sha1()
        file_hash_sha256 = hashlib.sha256()

        # Hashing the blocks with a stream buffer read we can hash multiple algorithms at once
        file_data_stream_buffer = file_stream.read(8 * 1024 * 1024)
        while file_data_stream_buffer:
            file_hash_md5.update(file_data_stream_buffer)
            file_hash_sha1.update(file_data_stream_buffer)
            file_hash_sha256.update(file_data_stream_buffer)

            file_data_stream_buffer = file_stream.read(8 * 1024 * 1024)

        # We log some messages related to the result of the function
        request_helper.log("File hash calculated: MD5    - " + file_hash_md5.hexdigest(),
                           JobMessage.JobMessageType.INFO, request_file.id)
        request_helper.log("File hash calculated: SHA1   - " + file_hash_sha1.hexdigest(),
                           JobMessage.JobMessageType.INFO, request_file.id)
        request_helper.log("File hash calculated: SHA256 - " + file_hash_sha256.hexdigest(),
                           JobMessage.JobMessageType.INFO, request_file.id)

        # We can also store the calculated hashes in the database
        File.set_hash(request_file.id, "md5", file_hash_md5.hexdigest())
        File.set_hash(request_file.id, "sha1", file_hash_sha1.hexdigest())
        File.set_hash(request_file.id, "sha256", file_hash_sha256.hexdigest())

And finally, we must let LIBSAFE Go know that our function has finished, with the result status:

# This will finalize the current function Job
# The parameter is a boolean that determines if the function Job was successful or not
#
# If the parameter is True,  the result will be "COMPLETED",
# else,
# If the parameter is False, the result will be "FAILED"
request_helper.job_end(True)

The full code sample:

#!/usr/bin/env python
# coding: utf-8

import json
import hashlib

from libnova                           import com, Util
from libnova.com                       import Nuclio
from libnova.com.Nuclio                import Request
from libnova.com.Api                   import Driver, Container, File, Job, JobMessage
from libnova.com.Filesystem            import S3
from libnova.com.Filesystem.S3         import File as S3File, Storage

json_sample = {
    "api": {
        "url": "http://go.libnova.com",
        "key_user": "1234567890abcdefghijklmnopqrstuvwxyz",
        "key_root": "1234567890abcdefghijklmnopqrstuvwxyz"
    },
    "function_data": {
        "container": {
            "id": "1"
        },
        "user": {
            "id": "1"
        },
        "files": {
            "ids":   [ ],
            "paths": [ ]
        },
        "job": {
            "id": "299"
        },
        "trigger": {
            "id": "0",
            "type": "",
            "regex": ""
        },
        "function": {
            "id": "0",
            "key": ""
        }
    },
    "function_params": {
        "your_custom_parameter": "custom_parameter_value"
    }
}

# Initialize the Request parser
#
# This will automatically parse the data sent by the platform to this function, like the File ID,
# the Job ID, or the User ID who triggered this function.
#
# It will also initialize the API Driver using the user API Key
request_helper = com.Nuclio.Request.Request(
    None,
    type('',(object,),{"body": json.dumps(json_sample)})()
)

# This will set the current function Job to the status "RUNNING"
request_helper.job_init()

# This will write a new Job Message related with the current function Job
request_helper.log("Sample message", JobMessage.JobMessageType.INFO)

# This will iterate over all the files related with this function execution
for request_file in request_helper.Files:
    # This will retrieve the current function File metadata
    file_metadata = File.get_metadata(request_file.id, True)
    if file_metadata is not None:
        # We log the metadata
        request_helper.log(Util.format_json_item(file_metadata), JobMessage.JobMessageType.INFO)
    else:
        request_helper.log("File " + request_file.id + " has no metadata", JobMessage.JobMessageType.INFO)

    # This will retrieve a seekable S3 file stream that can be used like a native file stream reader
    file_stream = S3.File.get_stream(
        # The storage is needed to set the source bucket of the file
        request_helper.Storage,
        request_file
    )
    if file_stream is not None:
        file_hash_md5 = hashlib.md5()
        file_hash_sha1 = hashlib.sha1()
        file_hash_sha256 = hashlib.sha256()

        # Hashing the blocks with a stream buffer read we can hash multiple algorithms at once
        file_data_stream_buffer = file_stream.read(8 * 1024 * 1024)
        while file_data_stream_buffer:
            file_hash_md5.update(file_data_stream_buffer)
            file_hash_sha1.update(file_data_stream_buffer)
            file_hash_sha256.update(file_data_stream_buffer)

            file_data_stream_buffer = file_stream.read(8 * 1024 * 1024)

        # We log some messages related to the result of the function
        request_helper.log("File hash calculated: MD5    - " + file_hash_md5.hexdigest(),
                           JobMessage.JobMessageType.INFO, request_file.id)
        request_helper.log("File hash calculated: SHA1   - " + file_hash_sha1.hexdigest(),
                           JobMessage.JobMessageType.INFO, request_file.id)
        request_helper.log("File hash calculated: SHA256 - " + file_hash_sha256.hexdigest(),
                           JobMessage.JobMessageType.INFO, request_file.id)

        # We can also store the calculated hashes in the database
        File.set_hash(request_file.id, "md5", file_hash_md5.hexdigest())
        File.set_hash(request_file.id, "sha1", file_hash_sha1.hexdigest())
        File.set_hash(request_file.id, "sha256", file_hash_sha256.hexdigest())

# This will finalize the current function Job
# The parameter is a boolean that determines if the function Job was successful or not
#
# If the parameter is True,  the result will be "COMPLETED",
# else,
# If the parameter is False, the result will be "FAILED"
request_helper.job_end(True)

Last updated