-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create _child_data_from_HPC_to_AnVIL.Rmd #232
base: main
Are you sure you want to change the base?
Conversation
No broken url errors! 🎉 |
No spelling errors! 🎉 |
Re-rendered previews from the latest commit:
* note not all html features will be properly displayed in the "quick preview" but it will give you a rough idea. Updated at 2025-03-11 with changes from the latest commit 4694a93 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Why don't we have data modules already in AnVIL_Template, haha, we totally need this! Thanks for writing this up.
I left a few suggestions where I stumbled over some wording or had a question.
Seems to me like a good balance between linking out to docs vs. maintaining ourselves if we link out to Google's installation instructions (which I'd guess is the most likely thing to change).
I'm not currently working on a book that would use this module, so @ehumph may have a better sense of whether this would fit well into Data_on_AnVIL
or whether there are any tweaks that would make it better. But as a stand-alone child doc, LGTM!
First, you'll want to install Google Cloud SDK on your server. This is software that enables file transfer. Follow [these instructions](https://cloud.google.com/sdk/docs/install), or follow the example code below: | ||
|
||
``` | ||
wget https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-447.0.0-linux-x86_64.tar.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first two commands here (wget and tar) seem like they're too specific? Like, I'm guessing the 447.0.0
is a version number? And do we know that most HPCs use linux-x86_64
, or do different ones often use different OSs? (I've never done more than dabbling with an HPC so I don't have much experience here).
Might want to change the language here to emphasize that this is an example, but that people should follow the instructions from the Google's documentation to install it on their machine. Or just take these lines out, to avoid people skimming and then just copy-pasting the wrong thing for their system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most, if not all, HPC will be linux systems. Some older ones might not be 64 bit. I can clarify this is just an example, but I like having some specific text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Google link has different instructions for "Linux", "Debian/Ubuntu", and "RedHat/Fedora/CentOS". I know those are different flavors of Linux, I have no idea what the differences between them are. So I'm wondering
- Would the generic "Linux" instructions work on all the other systems?
- Is there a flavor of Linux that is commonly used on HPCs?
I'm just raising the question, I don't actually know enough to answer it. If you think the generic Linux instructions are appropriate for most people, then keeping them there sounds good 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, as far as I know, linux distributions will generally work on Ubuntu and other linux flavors but it's a good idea to point that out!
Confirm you are in the right place by typing in the following, replacing `terra-123abcefg` with your Google Project ID. The command should return the Bucket name. | ||
|
||
``` | ||
gsutil ls -p terra-123abcefg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gsutil is deprecated, right? Is there are reason to teach the deprecated commands, or should we switch to the new one?
Terra docs suggest gcloud storage ls -p PROJECT_NAME
to list buckets for a specific project. I haven't tried it myself to confirm what happens, I'm just copy pasting from Terra.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think people are more used to gsutil
if they've seen it before. But yes, gcloud
is faster and we should probably use it instead.
Now you're ready to copy data over with `gsutil cp`! Replace the `/data/user_name/data_folder` with the directory on your server. Replace `gs://fc-a1b2c3` with your Bucket name. Note that the `-r` flag is "recursive", which means all files in that directory will be moved over. | ||
|
||
``` | ||
gsutil cp -r /data/user_name/data_folder gs://fc-a1b2c3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gsutil cp -r /data/user_name/data_folder gs://fc-a1b2c3 | |
gcloud storage cp -r /data/user_name/data_folder gs://fc-a1b2c3 |
Again I haven't run this myself to confirm, just copying from Terra Docs
Co-authored-by: KatherineCox <katherinecox@jhu.edu>
Co-authored-by: KatherineCox <katherinecox@jhu.edu>
Co-authored-by: KatherineCox <katherinecox@jhu.edu>
Co-authored-by: KatherineCox <katherinecox@jhu.edu>
Thanks so much for the review @KatherineCox !! |
``` | ||
|
||
**Step 2:** Initialize `glcoud`. You might need to specify the path to the executable `gcloud`, e.g., `google-cloud-sdk/bin/gcloud init`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it glcloud
or gcloud
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Line 21 says glcloud
, line 24 says gcloud
. Terra documentation says gcloud
. I'm just going to make the change.)
Purpose/implementation Section
What changes are being implemented in this Pull Request?
Adding some details on reading in data from HPC. This could be borrowed for https://github.com/fhdsl/Data_on_AnVIL. Path will be changed prior to merge to borrow from the main branch.
What kind of feedback?
@KatherineCox @ehumph -- is this helpful? Trying to strike a balance between Terra docs and having our own instructions (which could be more rapidly deprecated).