Datasets
JuliaHub.jl offers a programmatic way to work with your JuliaHub datasets, and this section demonstrates a few common workflows you can use with these APIs.
See the datasets reference page for a detailed reference of the datasets-related functionality.
Accessing datasets
The datasets
function can be use to list all the datasets owned by the currently authenticated user, returning an array of Dataset
objects.
julia> JuliaHub.datasets()
2-element Vector{JuliaHub.Dataset}:
JuliaHub.dataset(("username", "example-dataset"))
JuliaHub.dataset(("username", "blobtree/example"))
If you know the name of the dataset, you can also directly access it with the dataset
function, and you can access the dataset metadata via the properties of the Dataset
object.
julia> ds = JuliaHub.dataset("example-dataset")
Dataset: example-dataset (Blob)
owner: username
description: An example dataset
versions: 2
size: 388 bytes
tags: tag1, tag2
julia> ds.owner
"username"
julia> ds.description
"An example dataset"
julia> ds.size
388
If you want to work with dataset that you do not own but is shared with you in JuliaHub, you can pass shared=true
to datasets
, or specify the username.
julia> JuliaHub.datasets(shared=true)
3-element Vector{JuliaHub.Dataset}:
JuliaHub.dataset(("username", "example-dataset"))
JuliaHub.dataset(("anotheruser", "publicdataset"))
JuliaHub.dataset(("username", "blobtree/example"))
julia> JuliaHub.datasets("anotheruser")
1-element Vector{JuliaHub.Dataset}:
JuliaHub.dataset(("anotheruser", "publicdataset"))
julia> JuliaHub.dataset(("anotheruser", "publicdataset"))
Dataset: publicdataset (Blob)
owner: anotheruser
description: An example dataset
versions: 1
size: 57 bytes
tags: tag1, tag2
Finally, JuliaHub.jl can also be used to download to your local machine with the download_dataset
function.
julia> JuliaHub.download_dataset("example-dataset", "mydata")
Transferred: 86.767 KiB / 86.767 KiB, 100%, 0 B/s, ETA -
Transferred: 1 / 1, 100%
Elapsed time: 2.1s
"/home/username/my-project/mydata"
As datasets can have multiple versions, the .versions
property of Dataset
can be used to see information about the individual versions (represented with DatasetVersion
objects). When downloading, you can also specify the version you wish to download (with the default being the newest version).
julia> ds.versions
2-element Vector{JuliaHub.DatasetVersion}:
JuliaHub.DatasetVersion(dataset = ("username", "example-dataset"), version = 1)
JuliaHub.DatasetVersion(dataset = ("username", "example-dataset"), version = 2)
julia> ds.versions[1]
DatasetVersion: example-dataset @ v1
owner: username
timestamp: 2022-10-13T01:39:42.963-04:00
size: 57 bytes
julia> JuliaHub.download_dataset("example-dataset", "mydata", version=ds.versions[1].id)
Transferred: 86.767 KiB / 86.767 KiB, 100%, 0 B/s, ETA -
Transferred: 1 / 1, 100%
Elapsed time: 2.1s
"/home/username/my-project/mydata"
The dataset version are sorted with oldest first. To explicitly access the newest dataset, you can use the last
function on the .versions
property.
julia> last(ds.versions)
DatasetVersion: example-dataset @ v2
owner: username
timestamp: 2022-10-14T01:39:43.237-04:00
size: 331 bytes
In JuliaHub jobs and Cloud IDEs you can also use the DataSets.jl package to access and work with datasets. See the help.julialang.org section on datasets for more information.
Create, update, or replace
The upload_dataset
function can be used to programmatically create new datasets on JuliaHub.
julia> JuliaHub.upload_dataset("example-dataset", "local-file")
Transferred: 86.767 KiB / 86.767 KiB, 100%, 0 B/s, ETA - Transferred: 1 / 1, 100% Elapsed time: 2.1s Dataset: example-dataset (Blob) owner: username description: An example dataset versions: 2 size: 388 bytes tags: tag1, tag2
The type of the dataset (Blob
or BlobTree
) depends on whether the uploaded object is a file or a directory. A directory will be stored as a BlobTree
-type dataset on JuliaHub.
julia> JuliaHub.upload_dataset("example-blobtree", "local-directory")
Transferred: 86.767 KiB / 86.767 KiB, 100%, 0 B/s, ETA - Transferred: 1 / 1, 100% Elapsed time: 2.1s Dataset: example-blobtree (BlobTree) owner: username description: An example dataset versions: 1 size: 57 bytes tags: tag1, tag2
The create
, update
, and replace
options control how upload_dataset
behaves with respect to existing datasets. By default, the function only creates brand new datasets, and trying to upload a dataset that already exists will fail with an error.
julia> JuliaHub.upload_dataset("example-dataset", "local-file")
ERROR: InvalidRequestError: Dataset 'example-dataset' for user 'username' already exists, but update=false and replace=false.
This behavior can be overridden by setting update=true
, which will then upload a new version of a dataset if it already exists. This is useful for jobs and workflows that are meant to be re-run, updating the dataset each time they run.
julia> JuliaHub.upload_dataset("example-dataset", "local-file"; update=true)
Transferred: 86.767 KiB / 86.767 KiB, 100%, 0 B/s, ETA - Transferred: 1 / 1, 100% Elapsed time: 2.1s Dataset: example-dataset (Blob) owner: username description: An example dataset versions: 2 size: 388 bytes tags: tag1, tag2
The replace=true
option can be used to erase earlier versions of a dataset. This will delete all information about the existing dataset and is a destructive, non-recoverable action. This may also lead to the dataset type being changed.
julia> JuliaHub.upload_dataset("example-dataset", "local-file"; replace=true)
Transferred: 86.767 KiB / 86.767 KiB, 100%, 0 B/s, ETA - Transferred: 1 / 1, 100% Elapsed time: 2.1s Dataset: example-dataset (Blob) owner: username description: An example dataset versions: 2 size: 388 bytes tags: tag1, tag2
Bulk updates
You can also use the package to perform bulk updates or deletions of datasets. The following example, adds a new tag to all the datasets where the name matches a particular pattern.
# Find all the datasets that have names that start with 'my-analysis-'
myanalysis_datasets = filter(
dataset -> startswith(dataset.name, r"my-analysis-.*"),
JuliaHub.datasets()
)
# .. and now add a 'new-tag' tag to each of them
for dataset in myanalysis_datasets
@info "Updating" dataset
# Note: tags = ... overrides the whole list, so you need to manually retain
# old tags.
new_tags = [dataset.tags..., "new-tag"]
JuliaHub.update_dataset(dataset, tags = new_tags)
end
While this example shows the update_dataset
, for example, the delete_dataset
function could be used in the same way.