Importing Datasets
Importing files and folders into JuliaHub datasets from other cloud providers directly in the JuliaHub UI.
This feature is currently only available on enterprise JuliaHub instances.
Getting Started
If you're using Enterprise account, you can access the feature by navigating to Datasets from the left menu bar.
Once you're on the datasets page click "Import Datasets" button on top of the Datasets page:
Before you set up your cloud provider credentials and transfer data, you must start up a special JuliaHub job that functions as a backend that will perform the data transfers. You can start one by clicking on the "Start" button on the dataset imports page:
Be mindful that the job has a time limit, and may expire and stop. Depending on how long things take, you may need to extend the job's time limit.
Once the backend operation is done and it is fully up and running (it takes a moment to start), the UI will become active, and you'll be able to configure remotes –- i.e. access to the other cloud providers (see also Concepts below). There a few different ways to configure a remote.
See the how-tos below to learn more about how to configure the different remotes:
Once configured, the remote should show up in the list under the "Remotes" tab.
To pick the file or directory you wish to import to JuliaHub, you can click on the "Import Data" button to explore the remote in the UI.
You can import the file or directory into JuliaHub either as a new dataset, or as a new version of an existing datasets. Once you have found the data you wish to import, click on "Import" to pick the destination dataset, or create a new one. Note that in the former case, you only see datasets of the appropriate type (Blobs for files, BlobTrees for directories).
Alternatively, if exploring the remote is not working (e.g. due to permissions) you can also initiate a data transfer from a known prefix directly on the "Data Transfers" tab.
When you click "Submit" it will start copying the data in the background. You will be able to keep an eye on the transfer on the "Data Transfers" tab. You will be able to keep an eye on the transfer on the "Data Imports" tab.
Once the data import has finished, you will be able to see it on the Datasets page.
If you are having issues, you can also check the logs of the backend job. You can find the backend job on the Jobs page under the name "Dataset Imports Backend". It may contain additional information about any errors your encounter.
Concepts
When using this feature, it may be helpful to better understand some of the terminology used:
Backend job. The feature currently requires a special JuliaHub app to run as a backend job, which can be started directly on the Dataset Imports page. The UI will not be active unless the job is running. If the UI has problems connecting to the job, you can always inspect and control the job on the Jobs page (look for one named "Dataset Imports Backend").
Remotes. A remote represents access to a cloud provider. It must specify the protocol or storage type, the provider or the location of the cloud resources (e.g. URL), and any credentials (passwords, tokens) necessary to access the resources. Examples include access to an AWS account's S3 buckets with an IAM role and temporary AWS credentials, or access to a Dropbox account with a username and a token.
The files accessible via a remote will be structured as a directory tree, and assuming your credentials allow you to list files, you are able to explore this directory tree. How your files are exactly laid out depends on which storage type we're talking about. In most cases, it will logically map to whichever directory tree you would expect to see from the provider. For AWS S3, the top level directories are the list S3 buckets your AWS account has access to.
Default prefix. Each remote can have a default prefix at which you can start exploring the remote. This is useful if your credentials do not have the permissions to list files in the whole directory tree. For example, this can be the case when your AWS S3 credentials do not have the permission to list all buckets, and only have access to just one bucket. In that case, you must use the bucket name a default prefix for your remote.
Data transfers. Once you find a file or a directory in a remote, you can initiate a data transfer, which will instruct the backend job to copy data over to JuliaHub. In general, you have to specify (1) the remote, (2) the path within the remote you want to transfer, and (3) the destination dataset. The destination dataset can either be a new dataset, or an existing dataset (in which case, the data will be uploaded as a new version of that dataset).
Dataset types. JuliaHub has two types of datasets: Blobs and BlobTrees, that roughly correspond to single files and directories (or multi-file archives). In short, if you are copying a directory, it will become a BlobTree, and if you are copying a file, it will become a Blob.
If you are familiar with the rclone software, then some these concepts might be familiar to you, as they match the language used by rclone. In fact, you should be able to use your rclone configuration files directly in JuliaHub.
How-to: AWS S3
To copy data from an AWS S3 bucket, you can set up a remote for that by clicking on the "Add AWS S3 remote" button. It will
You will need to have access to (temporary) AWS credentials that have permissions necessary to access the bucket data you wish to copy to JuliaHub:
- Access Key & Secret Access Key: the basic AWS user credentials
- Session Token: when using temporary credentials (strongly recommended), you also need to provide this.
You also need to make sure that the AWS region matches the region of the bucket.
You can also set up the Default prefix at which the the UI will start exploring the directory tree of the bucket. This is useful when your permissions are limited to a specific prefix (see below).
See the AWS documentation for more details about AWS credentials, or ask your organization's AWS administrators, if you need more help with generating the necessary credentials.
It is recommended that you use an IAM role and temporary credentials where possible (see below), since the credentials for IAM Users are long-lived.
Permissions
The credentials must allow the following AWS Actions on the data you wish to transfer to JuliaHub:
s3:GetObject
: mandatory; this allows the data to be copieds3:ListBucket
: optional; this will allow you to explore the files in the bucket within the JuliaHub interface; if this is not provided, you must provide the path to your data manuallys3:ListAllMyBuckets
: optional; this allows you to list the S3 remote at the top level; without this, you must use the bucket's name as the default prefix, since JuliaHub will not be able to fetch the list of buckets
The permissions can be combined into an AWS IAM policy, which can be attached to a role or an IAM user. You can then generate credentials for the role or IAM user, and use those to configure a remote on JuliaHub.
When working with credentials, the recommendation is to follow the Principle of Least Privilege, and only generate credentials that are time-limited and have the minimum necessary permissions to achieve the desired result. In other words, it is recommended you use the more restrictive policy whenever practical. It is not recommended that you use your AWS account credentials.
Default prefix
If you are restricting your credentials to a prefix, you should specify that prefix as the default prefix when setting up the remote. Note that the first path element of the prefix is the name of the bucket. For example, when accessing data in my-bucket
under the prefix path/in/bucket
, the default prefix should be my-bucket/path/in/bucket
.
Generating credentials with AWS CLI
One way to generate credentials is to create an AWS role with the necessary permissions, and then use the AWS CLI to generate the temporary credentials.
Step 1. Create a role in your AWS account that you can assume. See the AWS documentation on roles for more information on this step.
Step 2. Attach the appropriate policy to the role that gives the role access to the required data.
For example, the following policy only allows access to a particular key prefix (i.e. directory) in the specified S3 bucket. You need to replace $(BUCKET)
with the name of the bucket you wish to copy data from, and $(PREFIX)
with the key prefix within the bucket you wish to access.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::$(BUCKET)"
],
"Condition": {
"StringLike": {
"s3:prefix": ["$(PREFIX)/*"]
}
}
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::$(BUCKET)/$(PREFIX)/*"
]
}
]
}
This type of policy requires you to configure a default prefix set to $(BUCKET)/$(PREFIX)
, if you wish to explore the files in the JuliaHub interface. Depending on your security needs, you can also create a more liberal policy.
Step 3. Generate temporary credentials with AWS CLI.
For this, you must have the AWS CLI installed and authenticated. You can use then use the AWS CLI to generate temporary credentials on your computer with aws sts
:
aws sts assume-role --output=json --role-arn=ROLE_ARN --role-session-name=juliahub
where ROLE_ARN
should be replaced by the ARN of the role. This should generate a JSON blob with the following structure, from which you can extract the AWS credentials:
{
"Credentials": {
"AccessKeyId": "***",
"SecretAccessKey": "***",
"SessionToken": "***",
"Expiration": "***"
},
"AssumedRoleUser": {
"AssumedRoleId": "***:juliahub",
"Arn": "arn:aws:sts::$(ACCOUNT):assumed-role/$(ROLE)/juliahub"
}
}
You can use the values to configure the AWS remote on JuliaHub.
If you have the JSON representation of your credentials, you can quickly copy-paste it onto the AWS form by clicking on the "Paste from clipboard" button.
Temporary AWS credentials have a relatively short expiry times. Be mindful of that. Because of this you may see data transfer failures for long-running data transfers where you transferring a lot of data.
How-to: Rclone-providers
You can also transfer data from other storage providers supported by rclone. For that, you can fill out the interactive form that shows all the options available for each remote.
Or for more advanced usage, you can also configure the key-value pairs directly. If you have an existing rclone configuration file, you can also copy the configuration for a remote into the text box and click "Populate", to use the configuration file directly.
For a specific example, see Tutorial: copying from public HTTP below.
See the rclone website for more detailed documentation for each of the providers that are available. If you are having trouble getting a generic rclone provider to work on JuliaHub, try installing rclone locally and testing it there first, to rule out issues with the configuration.
The different providers work in different ways and have different security properties which also depend on your configuration (e.g. whether TLS is used to encrypt the data transfers). Be mindful and make sure to configure rclone to operate securely.
How-to: initiating a transfer with a known path
Sometimes your credentials do not allow you to explore the files in remote, but you know the path where the files are located and the credentials allow you to copy the data. In that case, you can use the data transfer modal to fully specify a transfer manually.
In the modal, you have to pick the remote and destination dataset. In addition, you have to specify the exact prefix path to copy the data from. Note that if your prefix reference a file, then the destination dataset should be of Blob-type, and if it references a directory, it should be a BlobTree.
Just like for data transfers initiated from the file explore modal, once the data transfer has finished, it should show up on the Dataset page.
Tutorial: copying from public HTTP
As one example of using a general rclone-supported provider, we can use the "HTTP" provider to copy data from a publicly accessible HTTP URL. As a demonstration, we can try to copy the the zoo.csv
file from the Datasets tutorial to JuliaHub over plain HTTP(S). In this case:
- The HTTP host should be:
https://help.juliahub.com
- And the file is hosted under the following prefix:
/juliahub/stable/tutorials/datasets_intro/zoo.csv
We'll first configure the HTTP remote:
Since the HTTP protocol does not allow listing files and directories, we need to explicitly configure the data transfer, to copy data from the URL to a Blob-type dataset. You should open the custom data transfer modal on the "Data Transfers" page, and specify the exact prefix there.
Once you submit the data transfer, it should copy the file into a dataset, and you should be able to a access it on the datasets page. As the file is quite small, it should happen pretty quickly.