Datasets
These APIs allow you to create, read, update, and delete datasets owned by the currently authenticated user.
- You can use
datasets
,dataset
, anddownload_dataset
to access datasets or their metadata. upload_dataset
,update_dataset
, anddelete_dataset
can be used to create, update, or delete datasets.
See also: help.julialang.org on datasets, DataSets.jl.
Dataset types
JuliaHub currently has two distinct types of datasets:
Blob
: a single file; or, more abstractly, a collection of bytesBlobTree
: a directory or a file; more abstractly a tree-like collection ofBlob
s, indexed by file system paths
These types mirror the concepts in DataSets.jl
JuliaHub.jl APIs do not rely that much on the dataset type for anything, except when downloading or uploading. In that case, a local file always corresponds to a Blob
, and a local directory corresponds to a BlobTree
. For example, when trying to upload a file as a new version of a BlobTree
-type dataset will fail, because the dataset type can not change.
The upload_dataset
function uses information filesystem to determine whether the created dataset is a Blob
or a BlobTree
, and similarly download_dataset
will always download a Blob
into a file, and a BlobTree
as a directory.
Dataset versions
A JuliaHub dataset can have zero or more versions. A newly created dataset usually has at least one version, but it may have zero versions if, for example, the upload did not finish. The versions are indexed with a linear list of integers starting from 1
.
Reference
JuliaHub.Dataset
— Typestruct Dataset
Information about a dataset stored on JuliaHub, and the following fields are considered to be public API:
uuid :: UUID
: dataset UUIDowner :: String
: username of the dataset ownername :: String
: dataset namedtype :: String
: generally eitherBlob
orBlobTree
, but additional values may be added in the futureversions :: Vector{DatasetVersion}
: an ordered list ofDatasetVersion
objects, one for each dataset version, sorted from oldest to latest (i.e. you can uselast
to get the newest version).size :: Int
: total size of the whole dataset (including all the dataset versions) in bytes- Fields to access user-provided dataset metadata:
description :: String
: dataset descriptiontags :: Vector{String}
: a list of tags
In some contexts, like when accessing JuliaHub datasets with DataSets.jl, the .owner
-.name
tuple constitutes the fully qualifed dataset name, uniquely identifying a dataset on a JuliaHub instance. I.e. for a dataset object dataset
, it can be constructed as "$(dataset.owner)/$(dataset.name)"
.
Dataset
objects represents the dataset metadata when the Julia object was created (e.g. with dataset
), and are not automatically kept up to date. To refresh the dataset metadata, you can pass the existing Dataset
to JuliaHub.dataset
.
Objects of this type should not be constructed explicitly. The contructor methods are not considered to be part of the public API.
JuliaHub.DatasetVersion
— Typestruct DatasetVersion
Represents one version of a dataset.
Objects have the following properties:
.id
: unique dataset version identifier (used e.g. indownload_dataset
to identify the dataset version)..size :: Int
: size of the dataset version in bytes.timestamp :: ZonedDateTime
: dataset version timestamp
julia> JuliaHub.datasets()
See also: Dataset
, datasets
, dataset
.
Objects of this type should not be constructed explicitly. The contructor methods are not considered to be part of the public API.
JuliaHub.datasets
— FunctionJuliaHub.datasets([username::AbstractString]; shared::Bool=false, [auth::Authentication]) -> Vector{Dataset}
List all datasets owned by username
, returning a list of Dataset
objects.
If username
is omitted, it returns the datasets owned by the currently authenticated user. If username
is different from the currently authenticated user, it only returns the datasets that are readable to (i.e. somehow shared with) the currently authenticated user.
If shared = true
, it also returns datasets that belong to other users that have that have been shared with the currently authenticated user. In this case, username
is effectively ignored.
julia> JuliaHub.datasets()
2-element Vector{JuliaHub.Dataset}:
JuliaHub.dataset(("username", "example-dataset"))
JuliaHub.dataset(("username", "blobtree/example"))
julia> JuliaHub.datasets(shared=true)
3-element Vector{JuliaHub.Dataset}:
JuliaHub.dataset(("username", "example-dataset"))
JuliaHub.dataset(("anotheruser", "publicdataset"))
JuliaHub.dataset(("username", "blobtree/example"))
julia> JuliaHub.datasets("anotheruser")
1-element Vector{JuliaHub.Dataset}:
JuliaHub.dataset(("anotheruser", "publicdataset"))
Dataset
objects represents the dataset metadata when the Julia object was created (e.g. with dataset
), and are not automatically kept up to date. To refresh the dataset metadata, you can pass the existing Dataset
to JuliaHub.dataset
.
JuliaHub.DatasetReference
— Typeconst DatasetReference :: Type
Type constraint on the first argument of most of the datasets-related functions, that is used to uniquely specify the dataset that the operation will affect.
There are three different objects that can be passed as a dataset reference (dsref::DatasetReference
):
(owner::AbstractString, dataset_name::AbstractString)::Tuple{AbstractString,AbstractString}
A tuple of the owner's username and the dataset's name.
dataset_name::AbstractString
Just a string with the dataset name; in this case the dataset's owner will be assumed to be the currently authenticated user (with the username determined from the
Authentication
objects passed via theauth
keyword).dataset::Dataset
Uses the owner and dataset name information from a
Dataset
object.
When using the third option (i.e. passing a Dataset
), the dataset UUID will not be checked. So if the dataset with the same owner and username has been deleted and re-created as a new dataset (potentially of a different dtype
etc), the functions will then act on the new dataset.
JuliaHub.dataset
— FunctionJuliaHub.dataset(dataset::DatasetReference; throw::Bool=true, [auth::Authentication]) -> Dataset
Looks up a dataset based on the dataset reference dataset
. Returns the Dataset
object corresponding to dataset_name
, or throws a InvalidRequestError
if the dataset can not be found (if throw=false
is passed, returns nothing
instead).
By passing a Dataset
object as dataset
, this can be used to update the Dataset
object.
julia> dataset = JuliaHub.dataset("example-dataset")
Dataset: example-dataset (Blob)
owner: username
description: An example dataset
versions: 2
size: 388 bytes
tags: tag1, tag2
julia> JuliaHub.dataset(dataset)
Dataset: example-dataset (Blob)
owner: username
description: An example dataset
versions: 2
size: 388 bytes
tags: tag1, tag2
If the specifed username is not the currently authenticated user, the dataset must be shared with the currently authenticated user (i.e. contained in datasets(; shared=true)
).
Dataset
objects represents the dataset metadata when the Julia object was created (e.g. with dataset
), and are not automatically kept up to date. To refresh the dataset metadata, you can pass the existing Dataset
to JuliaHub.dataset
.
JuliaHub.download_dataset
— Functiondownload_dataset(
dataset::DatasetReference, local_path::AbstractString;
replace::Bool = false, [version::Integer],
[quiet::Bool = false], [auth::Authentication]
) -> String
Downloads the dataset specified by the dataset reference dataset
to local_path
(which must not exist, unless replace = true
), returning the absolute path to the downloaded file or directory. If the dataset is a Blob
, then the created local_path
will be a file, and if the dataset is a BlobTree
the local_path
will be a directory.
By default, it downloads the latest version, but an older version can be downloaded by specifying the version
keyword argument. Caution: you should never assume that the index of the .versions
property of Dataset
matches the version number – always explicitly use the .id
propert of the DatasetVersion
object.
The function also prints download progress to standard output. This can be disabled by setting quiet=true
. Any error output from the download is still printed.
Setting replace = true
will recursively erase any existing data at local_path
before replacing it with the dataset contents.
JuliaHub.upload_dataset
— FunctionJuliaHub.upload_dataset(dataset::DatasetReference, local_path; [auth,] kwargs...) -> Dataset
Uploads a new dataset or a new version of an existing dataset, with the dataset specified by the dataset reference dataset
. The dataset type is determined from the local path (Blob
if a file, BlobTree
if a directory). If a Dataset
object is passed, it attempts to update that dataset. Returns an updated Dataset
object.
The following keyword arguments can be used to control the exact behavior of the function:
create :: Bool
(default:true
): Create the dataset, if it already does not exist.update :: Bool
(default:false
): Upload the data as a new dataset version, if the dataset exists.replace :: Bool
(default:false
): If a dataset exists, delete all existing data and create a new dataset with the same name instead. Excludesupdate = true
, and only creates a completely new dataset ifcreate=true
as well.
In addition, the following keyword arguments can be passed to set or updated the dataset metadata when uploading:
description
: description of the dataset (a string)tags
: an iterable of strings of all the tags of the datasetvisibility
: a string with possible valuespublic
orprivate
license
: a valid SPDX license identifier, or a tuple(:fulltext, license_text)
, wherelicense_text
is the full text string of a custom licensegroups
: an iterable of valid group names
If a dataset already exists, then these fields are updated as if update_dataset
was called.
The function will throw an ArgumentError
for invalid argument combinations.
Use the progress
keyword argument to suppress upload progress from being printed.
Presently, it is only possible to upload datasets for the currently authenticated user.
JuliaHub.update_dataset
— FunctionJuliaHub.update_dataset(dataset::DatasetReference; kwargs..., [auth]) -> Dataset
Updates the metadata of the dataset specified by the dataset reference dataset
, as according to the keyword arguments keyword arguments. If the keywords are omitted, the metadata corresponding to it remains unchanged. Returns the Dataset
object corresponding to the updated dataset.
The supported keywords are:
description
: description of the dataset (a string)tags
: an iterable of strings of all the tags of the datasetvisibility
: a string with possible valuespublic
orprivate
license
: a valid SPDX license identifier, or a tuple(:fulltext, license_text)
, wherelicense_text
is the full text string of a custom licensegroups
: an iterable of valid group names
For example, to add a new tag to a dataset:
dataset = JuliaHub.dataset("my_dataset")
JuliaHub.update(dataset; tags = [dataset.tags..., "newtag"])
Presently, it is only possible to update datasets for the currently authenticated user.
JuliaHub.delete_dataset
— FunctionJuliaHub.delete_dataset(dataset::DatasetReference; force::Bool=false, [auth::Authentication]) -> Nothing
Delete the dataset specified by the dataset reference dataset
. Will return nothing
if the delete was successful, or throws an error if it was not.
Normally, when the dataset to be deleted does not exist, the function throws an error. This can be overridden by setting force = true
.
julia> JuliaHub.datasets()
2-element Vector{JuliaHub.Dataset}:
JuliaHub.dataset(("username", "example-dataset"))
JuliaHub.dataset(("username", "blobtree"))
julia> JuliaHub.delete_dataset("example-dataset")
julia> JuliaHub.datasets()
1-element Vector{JuliaHub.Dataset}:
JuliaHub.dataset(("username", "blobtree"))
Presently, it is only possible to delete datasets for the currently authenticated user.