Datasets

These APIs allow you to create, read, update, and delete datasets owned by the currently authenticated user.

See also: help.julialang.org on datasets, DataSets.jl.

Dataset types

JuliaHub currently has two distinct types of datasets:

  1. Blob: a single file; or, more abstractly, a collection of bytes
  2. BlobTree: a directory or a file; more abstractly a tree-like collection of Blobs, indexed by file system paths

These types mirror the concepts in DataSets.jl

JuliaHub.jl APIs do not rely that much on the dataset type for anything, except when downloading or uploading. In that case, a local file always corresponds to a Blob, and a local directory corresponds to a BlobTree. For example, when trying to upload a file as a new version of a BlobTree-type dataset will fail, because the dataset type can not change.

The upload_dataset function uses information filesystem to determine whether the created dataset is a Blob or a BlobTree, and similarly download_dataset will always download a Blob into a file, and a BlobTree as a directory.

Dataset versions

A JuliaHub dataset can have zero or more versions. A newly created dataset usually has at least one version, but it may have zero versions if, for example, the upload did not finish. The versions are indexed with a linear list of integers starting from 1.

Reference

JuliaHub.DatasetType
struct Dataset

Information about a dataset stored on JuliaHub, and the following fields are considered to be public API:

  • uuid :: UUID: dataset UUID
  • owner :: String: username of the dataset owner
  • name :: String: dataset name
  • dtype :: String: generally either Blob or BlobTree, but additional values may be added in the future
  • versions :: Vector{DatasetVersion}: an ordered list of DatasetVersion objects, one for each dataset version, sorted from oldest to latest (i.e. you can use last to get the newest version).
  • size :: Int: total size of the whole dataset (including all the dataset versions) in bytes
  • Fields to access user-provided dataset metadata:
    • description :: String: dataset description
    • tags :: Vector{String}: a list of tags
Canonical fully qualified dataset name

In some contexts, like when accessing JuliaHub datasets with DataSets.jl, the .owner-.name tuple constitutes the fully qualifed dataset name, uniquely identifying a dataset on a JuliaHub instance. I.e. for a dataset object dataset, it can be constructed as "$(dataset.owner)/$(dataset.name)".

Non-dynamic dataset objects

Dataset objects represents the dataset metadata when the Julia object was created (e.g. with dataset), and are not automatically kept up to date. To refresh the dataset metadata, you can pass the existing Dataset to JuliaHub.dataset.

No public constructors

Objects of this type should not be constructed explicitly. The contructor methods are not considered to be part of the public API.

source
JuliaHub.DatasetVersionType
struct DatasetVersion

Represents one version of a dataset.

Objects have the following properties:

  • .id: unique dataset version identifier (used e.g. in download_dataset to identify the dataset version).
  • .size :: Int: size of the dataset version in bytes
  • .timestamp :: ZonedDateTime: dataset version timestamp
julia> JuliaHub.datasets()

See also: Dataset, datasets, dataset.

No public constructors

Objects of this type should not be constructed explicitly. The contructor methods are not considered to be part of the public API.

source
JuliaHub.datasetsFunction
JuliaHub.datasets([username::AbstractString]; shared::Bool=false, [auth::Authentication]) -> Vector{Dataset}

List all datasets owned by username, returning a list of Dataset objects.

If username is omitted, it returns the datasets owned by the currently authenticated user. If username is different from the currently authenticated user, it only returns the datasets that are readable to (i.e. somehow shared with) the currently authenticated user.

If shared = true, it also returns datasets that belong to other users that have that have been shared with the currently authenticated user. In this case, username is effectively ignored.

julia> JuliaHub.datasets()
2-element Vector{JuliaHub.Dataset}:
 JuliaHub.dataset(("username", "example-dataset"))
 JuliaHub.dataset(("username", "blobtree/example"))

julia> JuliaHub.datasets(shared=true)
3-element Vector{JuliaHub.Dataset}:
 JuliaHub.dataset(("username", "example-dataset"))
 JuliaHub.dataset(("anotheruser", "publicdataset"))
 JuliaHub.dataset(("username", "blobtree/example"))

julia> JuliaHub.datasets("anotheruser")
1-element Vector{JuliaHub.Dataset}:
 JuliaHub.dataset(("anotheruser", "publicdataset"))
Non-dynamic dataset objects

Dataset objects represents the dataset metadata when the Julia object was created (e.g. with dataset), and are not automatically kept up to date. To refresh the dataset metadata, you can pass the existing Dataset to JuliaHub.dataset.

source
JuliaHub.DatasetReferenceType
const DatasetReference :: Type

Type constraint on the first argument of most of the datasets-related functions, that is used to uniquely specify the dataset that the operation will affect.

There are three different objects that can be passed as a dataset reference (dsref::DatasetReference):

  • (owner::AbstractString, dataset_name::AbstractString)::Tuple{AbstractString,AbstractString}

    A tuple of the owner's username and the dataset's name.

  • dataset_name::AbstractString

    Just a string with the dataset name; in this case the dataset's owner will be assumed to be the currently authenticated user (with the username determined from the Authentication objects passed via the auth keyword).

  • dataset::Dataset

    Uses the owner and dataset name information from a Dataset object.

No UUID mismatch checks

When using the third option (i.e. passing a Dataset), the dataset UUID will not be checked. So if the dataset with the same owner and username has been deleted and re-created as a new dataset (potentially of a different dtype etc), the functions will then act on the new dataset.

source
JuliaHub.datasetFunction
JuliaHub.dataset(dataset::DatasetReference; throw::Bool=true, [auth::Authentication]) -> Dataset

Looks up a dataset based on the dataset reference dataset. Returns the Dataset object corresponding to dataset_name, or throws a InvalidRequestError if the dataset can not be found (if throw=false is passed, returns nothing instead).

By passing a Dataset object as dataset, this can be used to update the Dataset object.

julia> dataset = JuliaHub.dataset("example-dataset")
Dataset: example-dataset (Blob)
 owner: username
 description: An example dataset
 versions: 2
 size: 388 bytes
 tags: tag1, tag2

julia> JuliaHub.dataset(dataset)
Dataset: example-dataset (Blob)
 owner: username
 description: An example dataset
 versions: 2
 size: 388 bytes
 tags: tag1, tag2

If the specifed username is not the currently authenticated user, the dataset must be shared with the currently authenticated user (i.e. contained in datasets(; shared=true)).

Note

This will call datasets every time, which might become a problem if you are processing a large number of datasets. In that case, you should call datasets and process the returned list yourself.

Non-dynamic dataset objects

Dataset objects represents the dataset metadata when the Julia object was created (e.g. with dataset), and are not automatically kept up to date. To refresh the dataset metadata, you can pass the existing Dataset to JuliaHub.dataset.

source
JuliaHub.download_datasetFunction
download_dataset(
    dataset::DatasetReference, local_path::AbstractString;
    replace::Bool = false, [version::Integer],
    [quiet::Bool = false], [auth::Authentication]
) -> String

Downloads the dataset specified by the dataset reference dataset to local_path (which must not exist, unless replace = true), returning the absolute path to the downloaded file or directory. If the dataset is a Blob, then the created local_path will be a file, and if the dataset is a BlobTree the local_path will be a directory.

By default, it downloads the latest version, but an older version can be downloaded by specifying the version keyword argument. Caution: you should never assume that the index of the .versions property of Dataset matches the version number – always explicitly use the .id propert of the DatasetVersion object.

The function also prints download progress to standard output. This can be disabled by setting quiet=true. Any error output from the download is still printed.

Warning

Setting replace = true will recursively erase any existing data at local_path before replacing it with the dataset contents.

source
JuliaHub.upload_datasetFunction
JuliaHub.upload_dataset(dataset::DatasetReference, local_path; [auth,] kwargs...) -> Dataset

Uploads a new dataset or a new version of an existing dataset, with the dataset specified by the dataset reference dataset. The dataset type is determined from the local path (Blob if a file, BlobTree if a directory). If a Dataset object is passed, it attempts to update that dataset. Returns an updated Dataset object.

The following keyword arguments can be used to control the exact behavior of the function:

  • create :: Bool (default: true): Create the dataset, if it already does not exist.
  • update :: Bool (default: false): Upload the data as a new dataset version, if the dataset exists.
  • replace :: Bool (default: false): If a dataset exists, delete all existing data and create a new dataset with the same name instead. Excludes update = true, and only creates a completely new dataset if create=true as well.

In addition, the following keyword arguments can be passed to set or updated the dataset metadata when uploading:

  • description: description of the dataset (a string)
  • tags: an iterable of strings of all the tags of the dataset
  • visibility: a string with possible values public or private
  • license: a valid SPDX license identifier, or a tuple (:fulltext, license_text), where license_text is the full text string of a custom license
  • groups: an iterable of valid group names

If a dataset already exists, then these fields are updated as if update_dataset was called.

The function will throw an ArgumentError for invalid argument combinations.

Use the progress keyword argument to suppress upload progress from being printed.

Note

Presently, it is only possible to upload datasets for the currently authenticated user.

source
JuliaHub.update_datasetFunction
JuliaHub.update_dataset(dataset::DatasetReference; kwargs..., [auth]) -> Dataset

Updates the metadata of the dataset specified by the dataset reference dataset, as according to the keyword arguments keyword arguments. If the keywords are omitted, the metadata corresponding to it remains unchanged. Returns the Dataset object corresponding to the updated dataset.

The supported keywords are:

  • description: description of the dataset (a string)
  • tags: an iterable of strings of all the tags of the dataset
  • visibility: a string with possible values public or private
  • license: a valid SPDX license identifier, or a tuple (:fulltext, license_text), where license_text is the full text string of a custom license
  • groups: an iterable of valid group names

For example, to add a new tag to a dataset:

dataset = JuliaHub.dataset("my_dataset")
JuliaHub.update(dataset; tags = [dataset.tags..., "newtag"])
Note

Presently, it is only possible to update datasets for the currently authenticated user.

source
JuliaHub.delete_datasetFunction
JuliaHub.delete_dataset(dataset::DatasetReference; force::Bool=false, [auth::Authentication]) -> Nothing

Delete the dataset specified by the dataset reference dataset. Will return nothing if the delete was successful, or throws an error if it was not.

Normally, when the dataset to be deleted does not exist, the function throws an error. This can be overridden by setting force = true.

julia> JuliaHub.datasets()
2-element Vector{JuliaHub.Dataset}:
 JuliaHub.dataset(("username", "example-dataset"))
 JuliaHub.dataset(("username", "blobtree"))

julia> JuliaHub.delete_dataset("example-dataset")

julia> JuliaHub.datasets()
1-element Vector{JuliaHub.Dataset}:
 JuliaHub.dataset(("username", "blobtree"))
Note

Presently, it is only possible to delete datasets for the currently authenticated user.

source

Index