Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH action to generate report #199

Draft
wants to merge 94 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
94 commits
Select commit Hold shift + click to select a range
d16abb2
Inital commit to add GH action to generate report
asmacdo Sep 25, 2024
713d64c
Assume Jupyterhub Provisioning Role
asmacdo Sep 25, 2024
519360c
Fixup: indent
asmacdo Sep 25, 2024
e6f4814
Rename job
asmacdo Sep 25, 2024
72496f4
Add assumed role to update-kubeconfig
asmacdo Sep 25, 2024
8428d3a
No need to add ProvisioningRole to masters
asmacdo Sep 25, 2024
e170b59
Deploy a pod to the cluster, and schedule with Karpenter
asmacdo Sep 25, 2024
bfce046
Fixup: correct path to pod manifest
asmacdo Sep 25, 2024
0993129
Fixup again ugh, rename file
asmacdo Sep 25, 2024
87027d2
Delete Pod even if previous step times out
asmacdo Sep 25, 2024
686f686
Hack out initial du
asmacdo Oct 11, 2024
ff52971
tmp comment out job deployment, test dockerhub build
asmacdo Nov 8, 2024
ca6db89
Fixup hyphens for image name
asmacdo Nov 8, 2024
d228f9d
Write file to output location
asmacdo Nov 8, 2024
68f707f
use kubectl cp to retrieve report
asmacdo Nov 8, 2024
ad6b589
Combine run blocks to use vars
asmacdo Nov 8, 2024
f18e8b7
Mount efs and pass arg to du script
asmacdo Nov 8, 2024
387cfc1
Comment out repo pushing, lets see if the report runs
asmacdo Nov 8, 2024
04b4193
Restrict job to asmacdo for testing
asmacdo Nov 8, 2024
a443081
Sanity check. Just list the directories
asmacdo Nov 8, 2024
99ac264
Job was deployed, but never assigned to node, back to sanity check
asmacdo Nov 8, 2024
6ee89b2
change from job to pod
asmacdo Nov 8, 2024
a8f6ed3
deploy pod to same namespace as pvc
asmacdo Nov 8, 2024
664853b
Use ns in action
asmacdo Nov 8, 2024
e35c974
increase timeout to 60s
asmacdo Nov 8, 2024
a8af5f2
fixup: image name in manifest
asmacdo Nov 8, 2024
024cf6e
increase timeout to 150
asmacdo Nov 8, 2024
49c346e
override entrypoint so i can debug with exec
asmacdo Nov 8, 2024
0191c85
bound /home actually meant path was /home/home/asmacdo
asmacdo Nov 8, 2024
3eb9157
Create output dir prior to writing report
asmacdo Nov 8, 2024
676a00e
pod back to job
asmacdo Nov 11, 2024
c085751
Fixup use the correct job api
asmacdo Nov 11, 2024
3e18a37
Add namespace to pod retrieval
asmacdo Nov 11, 2024
0fa5ece
write directly to pv to test job
asmacdo Nov 11, 2024
e1ecbc3
fixup script fstring
asmacdo Nov 11, 2024
082d3cc
no retry on failure, we were spinning up 5 pods, lets just fail 1 time
asmacdo Nov 11, 2024
d46ea44
Fixup backup limit job not template
asmacdo Nov 11, 2024
965a81e
Initial report
asmacdo Nov 11, 2024
7366d2d
disable report
asmacdo Nov 11, 2024
747f0a4
deploy ec2 instance directly
asmacdo Dec 2, 2024
6156e21
Update AMI image
asmacdo Dec 2, 2024
588892c
update sg and subnet
asmacdo Dec 2, 2024
958630b
terminate even if job fails
asmacdo Dec 2, 2024
e24a666
debug: print public ip
asmacdo Dec 2, 2024
0e58f10
explicitly allocate public ip for ec2 instance
asmacdo Dec 2, 2024
5c28c0e
Add WIP scripts
asmacdo Dec 6, 2024
21811dd
rm old unused
asmacdo Dec 6, 2024
97de713
initial commit of scripts
asmacdo Dec 6, 2024
644f8c3
clean up launch script
asmacdo Dec 6, 2024
e176592
make scripe executable
asmacdo Dec 6, 2024
a101f18
fixup cleanup script
asmacdo Dec 6, 2024
615baf2
add a name to elastic ip (for easier manual cleanup)
asmacdo Dec 6, 2024
bb8f25a
Exit on fail
asmacdo Dec 6, 2024
a8a615a
Add permission for aws ec2 wait instance-status-ok
asmacdo Dec 6, 2024
8157a12
Upload scripts to instance
asmacdo Dec 6, 2024
d3f6f52
explicitly return
asmacdo Dec 6, 2024
f1f687f
output session variables to file
asmacdo Dec 11, 2024
a10bc2a
modify cleanup script to retrieve instance from temporary file
asmacdo Dec 11, 2024
7fd340d
All ec2 persmissions granted
asmacdo Dec 11, 2024
1649b35
Add EFS mount (hardcoded)
asmacdo Dec 11, 2024
30aa60c
No pager for termination
asmacdo Dec 11, 2024
cc845d4
force pseudo-terminal, otherwise hangs after yum install
asmacdo Dec 11, 2024
7854124
Add doublequotes to variable usage for proper expansion
asmacdo Dec 11, 2024
9fbad37
Fixup -t goes on ssh, not scp
asmacdo Dec 11, 2024
4fc9dde
Mount as a single command, since we dont have access to pty
asmacdo Dec 11, 2024
86e645e
add todos for manual steps
asmacdo Dec 11, 2024
c614004
Disable job for now
asmacdo Dec 11, 2024
5a207bc
Update AMI to ubuntu
asmacdo Dec 12, 2024
8ce97ee
Roll back to AL 2023
asmacdo Dec 12, 2024
f7fe412
drop gzip, just write json
asmacdo Dec 13, 2024
e9904c8
include target dir in relative paths
asmacdo Dec 13, 2024
7da2aae
Second script will not produce user report, but directory stats json
asmacdo Dec 13, 2024
41a65ed
inital algorithm hackout
asmacdo Dec 13, 2024
8eb0f06
Clean up and refactor for simplicity
asmacdo Dec 13, 2024
40947ef
Add basic tests
asmacdo Dec 13, 2024
0e9c065
test multiple directories in root
asmacdo Dec 13, 2024
e4794de
comment about [:-1]
asmacdo Dec 13, 2024
ee2c3b1
support abspaths
asmacdo Dec 14, 2024
e1dcd63
[DATALAD RUNCMD] blacken
asmacdo Dec 14, 2024
541f1f3
test propagation with files in all dirs
asmacdo Dec 14, 2024
ac364fb
Write files to disk as they are inspected
asmacdo Dec 15, 2024
05609a1
Comment out column headers in output
asmacdo Dec 15, 2024
7560db2
Write all fields for every file
asmacdo Dec 15, 2024
502ff76
Convert to reading tsv
asmacdo Dec 15, 2024
639f279
Fixup: update test to match tsv-read data
asmacdo Dec 15, 2024
96490e5
update for renamed script
asmacdo Dec 15, 2024
64d69a7
install pip
asmacdo Dec 15, 2024
6f3dae5
install parallel
asmacdo Dec 15, 2024
b38be7d
install dependencies in launch script
asmacdo Dec 15, 2024
f0b0709
Output to tmp, accept only 1 arg, target dir
asmacdo Dec 15, 2024
326bb55
add up sizes
asmacdo Dec 16, 2024
c881287
print useful info as index is created
asmacdo Dec 16, 2024
a3505f9
dont fail if output dir exists
asmacdo Dec 16, 2024
fcd9531
Create a report dict with only relevant stats
asmacdo Dec 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 1 addition & 63 deletions .aws/terraform-jupyterhub-provisioning-policies.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,69 +4,7 @@
{
"Effect": "Allow",
"Action": [
"ec2:AllocateAddress",
"ec2:AssociateAddress",
"ec2:AssociateRouteTable",
"ec2:AssociateVpcCidrBlock",
"ec2:AttachInternetGateway",
"ec2:AttachNetworkInterface",
"ec2:AuthorizeSecurityGroupEgress",
"ec2:AuthorizeSecurityGroupIngress",
"ec2:CreateInternetGateway",
"ec2:CreateLaunchTemplate",
"ec2:CreateLaunchTemplateVersion",
"ec2:CreateNatGateway",
"ec2:CreateNetworkAcl",
"ec2:CreateNetworkAclEntry",
"ec2:CreateNetworkInterface",
"ec2:CreateNetworkInterfacePermission",
"ec2:CreateRoute",
"ec2:CreateRouteTable",
"ec2:CreateSecurityGroup",
"ec2:CreateSubnet",
"ec2:CreateTags",
"ec2:CreateVpc",
"ec2:DeleteInternetGateway",
"ec2:DeleteLaunchTemplate",
"ec2:DeleteLaunchTemplateVersions",
"ec2:DeleteNatGateway",
"ec2:DeleteNetworkAcl",
"ec2:DeleteNetworkAclEntry",
"ec2:DeleteNetworkInterface",
"ec2:DeleteRoute",
"ec2:DeleteRouteTable",
"ec2:DeleteSecurityGroup",
"ec2:DeleteSubnet",
"ec2:DeleteTags",
"ec2:DeleteVpc",
"ec2:DescribeAddresses",
"ec2:DescribeAddressesAttribute",
"ec2:DescribeAvailabilityZones",
"ec2:DescribeInternetGateways",
"ec2:DescribeLaunchTemplateVersions",
"ec2:DescribeLaunchTemplates",
"ec2:DescribeNatGateways",
"ec2:DescribeNetworkAcls",
"ec2:DescribeNetworkInterfacePermissions",
"ec2:DescribeNetworkInterfaces",
"ec2:DescribeRouteTables",
"ec2:DescribeSecurityGroupRules",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSubnets",
"ec2:DescribeVpcAttribute",
"ec2:DescribeVpcs",
"ec2:DetachInternetGateway",
"ec2:DetachNetworkInterface",
"ec2:DisassociateAddress",
"ec2:DisassociateRouteTable",
"ec2:DisassociateVpcCidrBlock",
"ec2:ModifyNetworkInterfaceAttribute",
"ec2:ModifyVpcAttribute",
"ec2:ReleaseAddress",
"ec2:ReplaceRoute",
"ec2:RevokeSecurityGroupEgress",
"ec2:RevokeSecurityGroupIngress",
"ec2:RunInstances",
"ec2:*",
"ecr-public:GetAuthorizationToken",
"eks:*",
"elasticfilesystem:CreateFileSystem",
Expand Down
35 changes: 35 additions & 0 deletions .github/manifests/disk-usage-report-job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
apiVersion: batch/v1
kind: Job
metadata:
name: disk-usage-report-job
namespace: jupyterhub
spec:
backoffLimit: 0 # No retry on failure
template:
metadata:
labels:
app: disk-usage-report
spec:
containers:
- name: disk-usage-report
image: dandiarchive/dandihub-report-generator:latest
args:
- "/home/"
volumeMounts:
- name: persistent-storage
mountPath: "/home"
subPath: "home"
restartPolicy: Never
nodeSelector:
NodeGroupType: default
NodePool: default
hub.jupyter.org/node-purpose: user
tolerations:
- key: "hub.jupyter.org/dedicated"
operator: "Equal"
value: "user"
effect: "NoSchedule"
volumes:
- name: persistent-storage
persistentVolumeClaim:
claimName: efs-persist
20 changes: 20 additions & 0 deletions .github/manifests/hello-world-pod.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# manifests/hello-world-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: hello-world-pod
spec:
containers:
- name: hello
image: busybox
command: ['sh', '-c', 'echo Hello, World! && sleep 30']
nodeSelector:
NodeGroupType: default
NodePool: default
hub.jupyter.org/node-purpose: user
tolerations:
- key: "hub.jupyter.org/dedicated"
operator: "Equal"
value: "user"
effect: "NoSchedule"

180 changes: 180 additions & 0 deletions .github/scripts/calculate-directory-stats.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
#!/usr/bin/env python3

import os
import csv
import json
import sys
import unittest
from collections import defaultdict
from pathlib import Path
from pprint import pprint
from typing import Iterable


def propagate_dir(stats, current_parent, previous_parent):
assert os.path.isabs(current_parent) == os.path.isabs(
previous_parent
), "current_parent and previous_parent must both be abspath or both be relpath"
highest_common = os.path.commonpath([current_parent, previous_parent])
assert highest_common, "highest_common must either be a target directory or /"

path_to_propagate = os.path.relpath(previous_parent, highest_common)
# leaves off last to avoid propagating to the path we are propagating from
nested_dir_list = path_to_propagate.split(os.sep)[:-1]
# Add each dir count to all ancestors up to highest common dir
while nested_dir_list:
working_dir = os.path.join(highest_common, *nested_dir_list)
stats[working_dir]["file_count"] += stats[previous_parent]["file_count"]
stats[working_dir]["total_size"] += stats[previous_parent]["total_size"]
nested_dir_list.pop()
previous_parent = working_dir
stats[highest_common]["file_count"] += stats[previous_parent]["file_count"]
stats[highest_common]["total_size"] += stats[previous_parent]["total_size"]


def generate_directory_statistics(data: Iterable[str]):
# Assumes dirs are listed depth first (files are listed prior to directories)

stats = defaultdict(lambda: {"total_size": 0, "file_count": 0})
previous_parent = ""
for filepath, size, modified, created, error in data:
# TODO if error is not None:
this_parent = os.path.dirname(filepath)
stats[this_parent]["file_count"] += 1
stats[this_parent]["total_size"] += int(size)

if previous_parent == this_parent:
continue
# going deeper
elif not previous_parent or previous_parent == os.path.dirname(this_parent):
previous_parent = this_parent
continue
else: # previous dir done
propagate_dir(stats, this_parent, previous_parent)
previous_parent = this_parent

# Run a final time with the root directory as this parent
# During final run, leading dir cannot be empty string, propagate_dir requires
# both to be abspath or both to be relpath
leading_dir = previous_parent.split(os.sep)[0] or "/"
propagate_dir(stats, leading_dir, previous_parent)
return stats


def iter_file_metadata(file_path):
"""
Reads a tsv and returns an iterable that yields one row of file metadata at
a time, excluding comments.
"""
file_path = Path(file_path)
with file_path.open(mode="r", newline="", encoding="utf-8") as file:
reader = csv.reader(file, delimiter="\t")
for row in reader:
# Skip empty lines or lines starting with '#'
if not row or row[0].startswith("#"):
continue
yield row

def update_stats(stats, directory, stat):
stats["total_size"] += stat["total_size"]
stats["file_count"] += stat["file_count"]

# Caches track directories, but not report as a whole
if stats.get("directories") is not None:
stats["directories"].append(directory)

def main():
if len(sys.argv) != 2:
print("Usage: python script.py <input_json_file>")
sys.exit(1)

input_tsv_file = sys.argv[1]
username = input_tsv_file.split("-index.tsv")[0]

data = iter_file_metadata(input_tsv_file)
stats = generate_directory_statistics(data)
cache_types = ["pycache", "user_cache", "yarn_cache", "pip_cache", "nwb_cache"]
report_stats = {
"total_size": 0,
"file_count": 0,
"caches": {
cache_type: {"total_size": 0, "file_count": 0, "directories": []}
for cache_type in cache_types
}
}
# print(f"{directory}: File count: {stat['file_count']}, Total Size: {stat['total_size']}")
for directory, stat in stats.items():
if directory.endswith("__pycache__"):
update_stats(report_stats["caches"]["pycache"], directory, stat)
elif directory.endswith(f"{username}/.cache"):
update_stats(report_stats["caches"]["user_cache"], directory, stat)
elif directory.endswith(".cache/yarn"):
update_stats(report_stats["caches"]["yarn_cache"], directory, stat)
elif directory.endswith(".cache/pip"):
update_stats(report_stats["caches"]["pip_cache"], directory, stat)
elif directory == username:
update_stats(report_stats, username, stat)

pprint(report_stats)


class TestDirectoryStatistics(unittest.TestCase):
def test_propagate_dir(self):
stats = defaultdict(lambda: {"total_size": 0, "file_count": 0})
stats["a/b/c"] = {"total_size": 100, "file_count": 3}
stats["a/b"] = {"total_size": 10, "file_count": 0}
stats["a"] = {"total_size": 1, "file_count": 0}

propagate_dir(stats, "a", "a/b/c")
self.assertEqual(stats["a"]["file_count"], 3)
self.assertEqual(stats["a/b"]["file_count"], 3)
self.assertEqual(stats["a"]["total_size"], 111)

def test_propagate_dir_abs_path(self):
stats = defaultdict(lambda: {"total_size": 0, "file_count": 0})
stats["/a/b/c"] = {"total_size": 0, "file_count": 3}
stats["/a/b"] = {"total_size": 0, "file_count": 0}
stats["/a"] = {"total_size": 0, "file_count": 0}

propagate_dir(stats, "/a", "/a/b/c")
self.assertEqual(stats["/a"]["file_count"], 3)
self.assertEqual(stats["/a/b"]["file_count"], 3)

def test_propagate_dir_files_in_all(self):
stats = defaultdict(lambda: {"total_size": 0, "file_count": 0})
stats["a/b/c"] = {"total_size": 0, "file_count": 3}
stats["a/b"] = {"total_size": 0, "file_count": 2}
stats["a"] = {"total_size": 0, "file_count": 1}

propagate_dir(stats, "a", "a/b/c")
self.assertEqual(stats["a"]["file_count"], 6)
self.assertEqual(stats["a/b"]["file_count"], 5)

def test_generate_directory_statistics(self):
sample_data = [
("a/b/file3.txt", 3456, "2024-12-01", "2024-12-02", "OK"),
("a/b/c/file1.txt", 1234, "2024-12-01", "2024-12-02", "OK"),
("a/b/c/file2.txt", 2345, "2024-12-01", "2024-12-02", "OK"),
("a/b/c/d/file4.txt", 4567, "2024-12-01", "2024-12-02", "OK"),
("a/e/file3.txt", 5678, "2024-12-01", "2024-12-02", "OK"),
("a/e/f/file1.txt", 6789, "2024-12-01", "2024-12-02", "OK"),
("a/e/f/file2.txt", 7890, "2024-12-01", "2024-12-02", "OK"),
("a/e/f/g/file4.txt", 8901, "2024-12-01", "2024-12-02", "OK"),
]
stats = generate_directory_statistics(sample_data)
self.assertEqual(stats["a/b/c/d"]["file_count"], 1)
self.assertEqual(stats["a/b/c"]["file_count"], 3)
self.assertEqual(stats["a/b"]["file_count"], 4)
self.assertEqual(stats["a/e/f/g"]["file_count"], 1)
self.assertEqual(stats["a/e/f"]["file_count"], 3)
self.assertEqual(stats["a/e"]["file_count"], 4)
self.assertEqual(stats["a"]["file_count"], 8)


if __name__ == "__main__":
if len(sys.argv) > 1 and sys.argv[1] == "test":
unittest.main(
argv=sys.argv[:1]
) # Run tests if "test" is provided as an argument
else:
main()
63 changes: 63 additions & 0 deletions .github/scripts/cleanup-ec2.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
#!/usr/bin/env bash

set -e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
set -e
set -eu


# Load environment variables from the file if they are not already set
ENV_FILE=".ec2-session.env"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ENV_FILE=".ec2-session.env"
ENV_FILE="/run/user/$(id -u)/ec2-session.env"

if [ -f "$ENV_FILE" ]; then
echo "Loading environment variables from $ENV_FILE..."
source "$ENV_FILE"
else
echo "Warning: Environment file $ENV_FILE not found."
fi

# Ensure required environment variables are set
if [ -z "$INSTANCE_ID" ]; then
echo "Error: INSTANCE_ID is not set. Cannot proceed with cleanup."
exit 1
fi

if [ -z "$ALLOC_ID" ]; then
echo "Error: ALLOC_ID is not set. Cannot proceed with cleanup."
exit 1
fi

# Check for AWS CLI and credentials
if ! command -v aws &>/dev/null; then
echo "Error: AWS CLI is not installed. Please install it and configure your credentials."
exit 1
fi

if ! aws sts get-caller-identity &>/dev/null; then
echo "Error: Unable to access AWS. Ensure your credentials are configured correctly."
exit 1
fi

# Terminate EC2 instance
echo "Terminating EC2 instance with ID: $INSTANCE_ID..."
if aws ec2 terminate-instances --instance-ids "$INSTANCE_ID" --no-cli-pager; then
echo "Instance termination initiated. Waiting for the instance to terminate..."
if aws ec2 wait instance-terminated --instance-ids "$INSTANCE_ID"; then
echo "Instance $INSTANCE_ID has been successfully terminated."
else
echo "Warning: Instance $INSTANCE_ID may not have terminated correctly."
fi
else
echo "Warning: Failed to terminate instance $INSTANCE_ID. It may already be terminated."
fi

# Release Elastic IP
echo "Releasing Elastic IP with Allocation ID: $ALLOC_ID..."
if aws ec2 release-address --allocation-id "$ALLOC_ID"; then
echo "Elastic IP with Allocation ID $ALLOC_ID has been successfully released."
else
echo "Warning: Failed to release Elastic IP with Allocation ID $ALLOC_ID. It may already be released."
fi

# Cleanup environment file
if [ -f "$ENV_FILE" ]; then
echo "Removing environment file $ENV_FILE..."
rm -f "$ENV_FILE"
fi

echo "Cleanup complete."
Loading