Datasets
These APIs allow you to create, read, update, and delete datasets owned by the currently authenticated user.
- You can use
datasets,dataset, anddownload_datasetto access datasets or their metadata. upload_dataset,update_dataset, anddelete_datasetcan be used to create, update, or delete datasets.
See also: help.julialang.org on datasets, DataSets.jl.
Dataset types
JuliaHub currently has two distinct types of datasets:
Blob: a single file; or, more abstractly, a collection of bytesBlobTree: a directory or a file; more abstractly a tree-like collection ofBlobs, indexed by file system paths
These types mirror the concepts in DataSets.jl
JuliaHub.jl APIs do not rely that much on the dataset type for anything, except when downloading or uploading. In that case, a local file always corresponds to a Blob, and a local directory corresponds to a BlobTree. For example, when trying to upload a file as a new version of a BlobTree-type dataset will fail, because the dataset type can not change.
The upload_dataset function uses information filesystem to determine whether the created dataset is a Blob or a BlobTree, and similarly download_dataset will always download a Blob into a file, and a BlobTree as a directory.
Dataset versions
A JuliaHub dataset can have zero or more versions. A newly created dataset usually has at least one version, but it may have zero versions if, for example, the upload did not finish. The versions are indexed with a linear list of integers starting from 1.
Reference
JuliaHub.datasets — FunctionJuliaHub.datasets([username::AbstractString]; shared::Bool=false, [auth::Authentication]) -> Vector{Dataset}List all datasets owned by username, returning a list of Dataset objects.
If username is omitted, it returns the datasets owned by the currently authenticated user. If username is different from the currently authenticated user, it only returns the datasets that are readable to (i.e. somehow shared with) the currently authenticated user.
If shared = true, it also returns datasets that belong to other users that have that have been shared with the currently authenticated user. In this case, username is effectively ignored.
julia> JuliaHub.datasets()
2-element Vector{JuliaHub.Dataset}:
JuliaHub.dataset(("username", "example-dataset"))
JuliaHub.dataset(("username", "blobtree/example"))
julia> JuliaHub.datasets(shared=true)
3-element Vector{JuliaHub.Dataset}:
JuliaHub.dataset(("username", "example-dataset"))
JuliaHub.dataset(("anotheruser", "publicdataset"))
JuliaHub.dataset(("username", "blobtree/example"))
julia> JuliaHub.datasets("anotheruser")
1-element Vector{JuliaHub.Dataset}:
JuliaHub.dataset(("anotheruser", "publicdataset"))Dataset objects represents the dataset metadata when the Julia object was created (e.g. with dataset), and are not automatically kept up to date. To refresh the dataset metadata, you can pass the existing Dataset to JuliaHub.dataset.
JuliaHub.DatasetReference — Typeconst DatasetReference :: TypeType constraint on the first argument of most of the datasets-related functions, that is used to uniquely specify the dataset that the operation will affect.
There are three different objects that can be passed as a dataset reference (dsref::DatasetReference):
(owner::AbstractString, dataset_name::AbstractString)::Tuple{AbstractString,AbstractString}A tuple of the owner's username and the dataset's name.
dataset_name::AbstractStringJust a string with the dataset name; in this case the dataset's owner will be assumed to be the currently authenticated user (with the username determined from the
Authenticationobjects passed via theauthkeyword).dataset::DatasetUses the owner and dataset name information from a
Datasetobject.
When using the third option (i.e. passing a Dataset), the dataset UUID will not be checked. So if the dataset with the same owner and username has been deleted and re-created as a new dataset (potentially of a different dtype etc), the functions will then act on the new dataset.
JuliaHub.dataset — FunctionJuliaHub.dataset(dataset::DatasetReference; throw::Bool=true, [auth::Authentication]) -> DatasetLooks up a dataset based on the dataset reference dataset. Returns the Dataset object corresponding to dataset_name, or throws a InvalidRequestError if the dataset can not be found (if throw=false is passed, returns nothing instead).
By passing a Dataset object as dataset, this can be used to update the Dataset object.
julia> dataset = JuliaHub.dataset("example-dataset")
Dataset: example-dataset (Blob)
owner: username
description: An example dataset
size: 57 bytes
tags: tag1, tag2
julia> JuliaHub.dataset(dataset)
Dataset: example-dataset (Blob)
owner: username
description: An example dataset
size: 57 bytes
tags: tag1, tag2If the specifed username is not the currently authenticated user, the dataset must be shared with the currently authenticated user (i.e. contained in datasets(; shared=true)).
Dataset objects represents the dataset metadata when the Julia object was created (e.g. with dataset), and are not automatically kept up to date. To refresh the dataset metadata, you can pass the existing Dataset to JuliaHub.dataset.
JuliaHub.download_dataset — Functiondownload_dataset(
dataset::DatasetReference, local_path::AbstractString;
replace::Bool = false, [version::Integer],
[quiet::Bool = false], [auth::Authentication]
) -> StringDownloads the dataset specified by the dataset reference dataset to local_path (which must not exist, unless replace = true), returning the absolute path to the downloaded file or directory. If the dataset is a Blob, then the created local_path will be a file, and if the dataset is a BlobTree the local_path will be a directory.
By default, it downloads the latest version, but an older version can be downloaded by specifying the version keyword argument.
The function also prints download progress to standard output. This can be disabled by setting quiet=true. Any error output from the download is still printed.
Setting replace = true will recursively erase any existing data at local_path before replacing it with the dataset contents.
JuliaHub.upload_dataset — FunctionJuliaHub.upload_dataset(dataset::DatasetReference, local_path; [auth,] kwargs...) -> DatasetUploads a new dataset or a new version of an existing dataset, with the dataset specified by the dataset reference dataset. The dataset type is determined from the local path (Blob if a file, BlobTree if a directory). If a Dataset object is passed, it attempts to update that dataset. Returns an updated Dataset object.
The following keyword arguments can be used to control the exact behavior of the function:
create :: Bool(default:true): Create the dataset, if it already does not exist.update :: Bool(default:false): Upload the data as a new dataset version, if the dataset exists.replace :: Bool(default:false): If a dataset exists, delete all existing data and create a new dataset with the same name instead. Excludesupdate = true, and only creates a completely new dataset ifcreate=trueas well.
In addition, the following keyword arguments can be passed to set or updated the dataset metadata when uploading:
description: description of the dataset (a string)tags: an iterable of strings of all the tags of the datasetvisibility: a string with possible valuespublicorprivatelicense: a valid SPDX license identifier, or a tuple(:fulltext, license_text), wherelicense_textis the full text string of a custom licensegroups: an iterable of valid group names
If a dataset already exists, then these fields are updated as if update_dataset was called.
The function will throw an ArgumentError for invalid argument combinations.
Presently, it is only possible to upload datasets for the currently authenticated user.
JuliaHub.update_dataset — FunctionJuliaHub.update_dataset(dataset::DatasetReference; kwargs..., [auth]) -> DatasetUpdates the metadata of the dataset specified by the dataset reference dataset, as according to the keyword arguments keyword arguments. If the keywords are omitted, the metadata corresponding to it remains unchanged. Returns the Dataset object corresponding to the updated dataset.
The supported keywords are:
description: description of the dataset (a string)tags: an iterable of strings of all the tags of the datasetvisibility: a string with possible valuespublicorprivatelicense: a valid SPDX license identifier, or a tuple(:fulltext, license_text), wherelicense_textis the full text string of a custom licensegroups: an iterable of valid group names
For example, to add a new tag to a dataset:
dataset = JuliaHub.dataset("my_dataset")
JuliaHub.update(dataset; tags = [dataset.tags..., "newtag"])Presently, it is only possible to update datasets for the currently authenticated user.
JuliaHub.delete_dataset — FunctionJuliaHub.delete_dataset(dataset::DatasetReference; force::Bool=false, [auth::Authentication]) -> NothingDelete the dataset specified by the dataset reference dataset. Will return nothing if the delete was successful, or throws an error if it was not.
Normally, when the dataset to be deleted does not exist, the function throws an error. This can be overridden by setting force = true.
julia> JuliaHub.datasets()
2-element Vector{JuliaHub.Dataset}:
JuliaHub.dataset(("username", "example-dataset"))
JuliaHub.dataset(("username", "blobtree"))
julia> JuliaHub.delete_dataset("example-dataset")
julia> JuliaHub.datasets()
1-element Vector{JuliaHub.Dataset}:
JuliaHub.dataset(("username", "blobtree"))Presently, it is only possible to delete datasets for the currently authenticated user.
JuliaHub.Dataset — Typestruct DatasetInformation about a dataset stored on JuliaHub, and the following fields are considered to be public API:
uuid :: UUID: dataset UUIDowner :: String: username of the dataset ownername :: String: dataset namedtype :: String: generally eitherBloborBlobTree, but additional values may be added in the futuresize :: Int: total size of the whole dataset (including all the dataset versions) in bytes- Fields to access user-provided dataset metadata:
description :: String: dataset descriptiontags :: Vector{String}: a list of tags
In some contexts, like when accessing JuliaHub datasets with DataSets.jl, the .owner-.name tuple constitutes the fully qualifed dataset name, uniquely identifying a dataset on a JuliaHub instance. I.e. for a dataset object dataset, it can be constructed as "$(dataset.owner)/$(dataset.name)".
Dataset objects represents the dataset metadata when the Julia object was created (e.g. with dataset), and are not automatically kept up to date. To refresh the dataset metadata, you can pass the existing Dataset to JuliaHub.dataset.