Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create _child_data_from_HPC_to_AnVIL.Rmd #232

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Create _child_data_from_HPC_to_AnVIL.Rmd #232

wants to merge 13 commits into from

Conversation

avahoffman
Copy link
Contributor

@avahoffman avahoffman commented Jan 27, 2025

Purpose/implementation Section

What changes are being implemented in this Pull Request?

Adding some details on reading in data from HPC. This could be borrowed for https://github.com/fhdsl/Data_on_AnVIL. Path will be changed prior to merge to borrow from the main branch.

What kind of feedback?

@KatherineCox @ehumph -- is this helpful? Trying to strike a balance between Terra docs and having our own instructions (which could be more rapidly deprecated).

Copy link
Contributor

github-actions bot commented Jan 27, 2025

No broken url errors! 🎉
Comment updated at 2025-03-11-18:04:13 with changes from 4694a93

Copy link
Contributor

github-actions bot commented Jan 27, 2025

No spelling errors! 🎉
Comment updated at 2025-03-11-18:04:13 with changes from 4694a93

Copy link
Contributor

github-actions bot commented Jan 27, 2025

Re-rendered previews from the latest commit:

* note not all html features will be properly displayed in the "quick preview" but it will give you a rough idea.

Updated at 2025-03-11 with changes from the latest commit 4694a93

Copy link
Collaborator

@KatherineCox KatherineCox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Why don't we have data modules already in AnVIL_Template, haha, we totally need this! Thanks for writing this up.

I left a few suggestions where I stumbled over some wording or had a question.

Seems to me like a good balance between linking out to docs vs. maintaining ourselves if we link out to Google's installation instructions (which I'd guess is the most likely thing to change).

I'm not currently working on a book that would use this module, so @ehumph may have a better sense of whether this would fit well into Data_on_AnVIL or whether there are any tweaks that would make it better. But as a stand-alone child doc, LGTM!

First, you'll want to install Google Cloud SDK on your server. This is software that enables file transfer. Follow [these instructions](https://cloud.google.com/sdk/docs/install), or follow the example code below:

```
wget https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-447.0.0-linux-x86_64.tar.gz
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first two commands here (wget and tar) seem like they're too specific? Like, I'm guessing the 447.0.0 is a version number? And do we know that most HPCs use linux-x86_64, or do different ones often use different OSs? (I've never done more than dabbling with an HPC so I don't have much experience here).

Might want to change the language here to emphasize that this is an example, but that people should follow the instructions from the Google's documentation to install it on their machine. Or just take these lines out, to avoid people skimming and then just copy-pasting the wrong thing for their system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most, if not all, HPC will be linux systems. Some older ones might not be 64 bit. I can clarify this is just an example, but I like having some specific text.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Google link has different instructions for "Linux", "Debian/Ubuntu", and "RedHat/Fedora/CentOS". I know those are different flavors of Linux, I have no idea what the differences between them are. So I'm wondering

  1. Would the generic "Linux" instructions work on all the other systems?
  2. Is there a flavor of Linux that is commonly used on HPCs?

I'm just raising the question, I don't actually know enough to answer it. If you think the generic Linux instructions are appropriate for most people, then keeping them there sounds good 👍

Copy link
Contributor Author

@avahoffman avahoffman Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, as far as I know, linux distributions will generally work on Ubuntu and other linux flavors but it's a good idea to point that out!

Confirm you are in the right place by typing in the following, replacing `terra-123abcefg` with your Google Project ID. The command should return the Bucket name.

```
gsutil ls -p terra-123abcefg
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gsutil is deprecated, right? Is there are reason to teach the deprecated commands, or should we switch to the new one?

Terra docs suggest gcloud storage ls -p PROJECT_NAME to list buckets for a specific project. I haven't tried it myself to confirm what happens, I'm just copy pasting from Terra.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think people are more used to gsutil if they've seen it before. But yes, gcloud is faster and we should probably use it instead.

Now you're ready to copy data over with `gsutil cp`! Replace the `/data/user_name/data_folder` with the directory on your server. Replace `gs://fc-a1b2c3` with your Bucket name. Note that the `-r` flag is "recursive", which means all files in that directory will be moved over.

```
gsutil cp -r /data/user_name/data_folder gs://fc-a1b2c3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
gsutil cp -r /data/user_name/data_folder gs://fc-a1b2c3
gcloud storage cp -r /data/user_name/data_folder gs://fc-a1b2c3

Again I haven't run this myself to confirm, just copying from Terra Docs

avahoffman and others added 4 commits January 28, 2025 10:26
Co-authored-by: KatherineCox <katherinecox@jhu.edu>
Co-authored-by: KatherineCox <katherinecox@jhu.edu>
Co-authored-by: KatherineCox <katherinecox@jhu.edu>
Co-authored-by: KatherineCox <katherinecox@jhu.edu>
@avahoffman
Copy link
Contributor Author

Awesome! Why don't we have data modules already in AnVIL_Template, haha, we totally need this! Thanks for writing this up.

I left a few suggestions where I stumbled over some wording or had a question.

Seems to me like a good balance between linking out to docs vs. maintaining ourselves if we link out to Google's installation instructions (which I'd guess is the most likely thing to change).

I'm not currently working on a book that would use this module, so @ehumph may have a better sense of whether this would fit well into Data_on_AnVIL or whether there are any tweaks that would make it better. But as a stand-alone child doc, LGTM!

Thanks so much for the review @KatherineCox !!

```

**Step 2:** Initialize `glcoud`. You might need to specify the path to the executable `gcloud`, e.g., `google-cloud-sdk/bin/gcloud init`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it glcloud or gcloud?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Line 21 says glcloud, line 24 says gcloud. Terra documentation says gcloud. I'm just going to make the change.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants