Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor fixes #10

Merged
merged 9 commits into from
Jul 25, 2022
42 changes: 28 additions & 14 deletions notebooks/tutorial/1 - Setup.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,6 @@
"In this tutorial, we will setup the enviornment, download a sample of data, create header tomes, explore some match data to create an engineering pipeline, apply that pipeline to the sample data to create a tome, and finally go through that tome to train a model."
]
},
{
"cell_type": "markdown",
"id": "b3fd27b4",
"metadata": {},
"source": [
"## Setup your AWS credentials\n",
"\n",
"Follow the [guide on setting up AWS credentials](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#configuration) to properly download things from the AWS Data Exchange."
]
},
{
"cell_type": "markdown",
"id": "937d17f2",
Expand All @@ -34,7 +24,7 @@
"Setup your enviornment to find all the yummy data science files on your computer box. We must set three things, \n",
"\n",
"1. Where you will save the data science files.\n",
"1. Where you will save your tomes (views of the data). Can be the same as step 1.\n",
"1. Where you will save your tomes (combinations of the data). Can be the same as step 1.\n",
"1. What to call your header tome.\n",
"\n",
"These will be saved in a `.env` file in the `notebooks` folder. \n",
Expand All @@ -54,11 +44,32 @@
"outputs": [],
"source": [
"import os\n",
"\n",
"## Edit this to point to where you want to save the csds files.\n",
"## Leaving this as-is puts the data in a folder named tmp at the\n",
"## top level of this repo.\n",
"ds_collection_path = os.path.join('..','..','tmp') \n",
"\n",
"## Edit this to point at where you want to save combinations\n",
"## of csds files which we call a \"tome\".\n",
"## Leaving this as-is puts the data in a folder named tmp at the\n",
"## top level of this repo\n",
"tome_collection_path = os.path.join('..','..','tmp')\n",
"ds_collection_path = os.path.join('..','..','tmp')\n",
"\n",
"## If you have not downloaded any data yet, keep this as is\n",
"## If you have downloaded the data, adjust the date ranges\n",
"## to match the range of dates you downloaded.\n",
"header_name = 'header_tome.2022-05-15,2022-05-15'"
]
},
{
"cell_type": "markdown",
"id": "598c6fb2",
"metadata": {},
"source": [
"#### You can always change these later by editing your `.env` file in the `notebooks` folder."
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -70,7 +81,7 @@
"import re\n",
"import warnings\n",
"\n",
"if not re.match(\"\\S+.\\d{4}-\\d{2}-\\d{2},\\d{4}-\\d{2}-\\d{2}.*\\S*\",header_name):\n",
"if not re.match(r\"\\S+.\\d{4}-\\d{2}-\\d{2},\\d{4}-\\d{2}-\\d{2}[.\\S*]*\",header_name):\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we export this warn function and use it here from the dsdk?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but that requires pulling in dsdk, releasing new version, then upgrading the version in here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raised as #11 for later

" warnings.warn(f'Header name of {header_name} does not match convention of tome_name.start-date,end-data.comment')\n",
"\n",
"ds_type = 'csds'\n",
Expand Down Expand Up @@ -98,7 +109,10 @@
" if system != 'Linux':\n",
" print('unknown OS. The .env file was saved using linux syntax. If that is not right please edit it yourself lol.')\n",
"with open(os.path.join('..','.env'), 'r') as f:\n",
" print(f.read())"
" print(f.read())\n",
"\n",
"os.makedirs(ds_collection_path, exist_ok=True)\n",
"os.makedirs(tome_collection_path, exist_ok=True)"
]
},
{
Expand Down
26 changes: 20 additions & 6 deletions notebooks/tutorial/3 - Make header tome.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,13 @@
"\n",
"Let's say you wanted to cluster smoke grenades on dust2. One match may have ~100 smokes which isn't enough to do clustering. To get a large enough dataset for clustering, you need hundreds of thousands or millions of smokes. You would have to loop through thousands of matches and read thousands of files. No matter the file size, reading so many files is time consuming and cumbersome.\n",
"\n",
"The solution to this is combining the data from many matches into a \"Tome\". Once created, a tome allows you to read in the data from thousands of matches without having to read in thousands of files."
"The solution to this is combining the data from many matches into a \"**tome**\". A tome contains \"**pages**\" which have concatenated dataframes, reducing the number of files to read. The maximum size one page can grow to is something you have control over. Making a tome is three basic steps:\n",
"\n",
"1. Determine which matches to include.\n",
"2. Loop through these matches and apply a transformation to a single dataframe.\n",
"3. Combine all of these dataframes into a tome.\n",
"\n",
"The tome \"**curator**\" manages steps 1 and 3 while step 2 happens in a loop you define. The first step is to decide which csds files to include, and you do that by pointing at a *special kind of tome*, a **header** or **subheader** tome. This notebook shows you how to make those special tomes. We discuss tomes in more details in later steps of the tutorial."
]
},
{
Expand All @@ -19,9 +25,13 @@
"source": [
"## Make header tome\n",
"\n",
"To start making tomes, we must make a *header tome*. This tome contains the path to all matches that will be considered as tome members. The header tome maker uses glob to find your files, read in the header channel, and stitch them all together. Each row corresponds to one match. CSDS files that are not in the main header tome are invisible in subsequent steps. Right now, the file search assumes your files are nested to the same level, but a more robust csds finder is possible, just open a PR.\n",
"The header tome contains the header channel data and path to all csds files it can find. This requires a special function within the tome creator called `create_header_tome`. It uses glob to find all your csds files, reads in the header channel of each one, then stitches them all together. \n",
"\n",
"The end result is a tome that contains a dataframe where each row corresponds to one match's header channel data and the path to the csds file (from glob). Since the path is included, we never need to use glob to find the files again. \n",
"\n",
"You can also make subheader tomes that filters out some of the header tome rows. An example of a subheader tome that selects only matches on dust2 is in the second to last cell. Subheader tomes are useful when exploring certain maps, skill ranges, or any info from a match that can be found in the header. \n",
"If you want to include all matches in a tome, point at the header tome to determine which matches to include (step 1 from previous cell). This is the default option anyway though...\n",
"\n",
"(Don't forget the subheader section below.)\n",
"\n",
"_**Run this notebook as-is.**_"
]
Expand Down Expand Up @@ -108,11 +118,15 @@
"id": "b35b54d9",
"metadata": {},
"source": [
"## Make subheaders too\n",
"## Make subheaders\n",
"\n",
"You can also make subheader tomes that don't include some of the header tome rows (remember, each row = one match). You might want to analyze players on a specific map, rank, or platform. You can create \"subheaders\" that are a filtered view of the main header. Then, when making a tome, you can point at a subheader to run your transformation and combination only on relevant matches.\n",
"\n",
"Subheader tomes are useful when exploring certain maps or skill ranges, but the filtering is limited to info you can find in a header channel. Subheader tomes do not use glob to find files, but instead just read in the header tome and apply a filter that goes through pandas' `loc` function.\n",
"\n",
"You might want to analyze players on a specific map, rank, or platform. You can create \"subheaders\" that are a filtered view of the main header. The `create_subheader_tome` will create the subheader with the specified filter applied to the header tome.\n",
"An example of making a subheader tome is below this cell. The `create_subheader_tome` will create the subheader with the specified filter applied to the header tome.\n",
"\n",
"Remember that the convention for the tome names are: `tome_name.start-date,end-data.comment`"
"Remember that the convention for the tome names are: `tome_name.start-date,end-date.comment`"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion notebooks/tutorial/5 - Create tome.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
"1. Repeat from step 2 until no more matches remain.\n",
"\n",
"\n",
"To make a tome, we use the `make_tome` function of the tome curator. This function handles reading the data science files, concatenating dataframes, writing tome pages, deciding when to write tome pages, and keeping track of matches included. We only need to provide the name of the tome we are making and the name of the header or subheader tome to serve as the index of matches. \n",
"To make a tome, we use the `make_tome` function of the tome curator. This function handles reading the data science files, concatenating dataframes, writing tome pages, and keeping track of matches included. We only need to provide the name of the tome we are making and the name of the header or subheader tome to serve as the index of matches. \n",
"\n",
"An important (but optional) parameter is the `ds_reading_instructions` where you only read in certain channels and columns for each match. This generally provides a drastic speed up because of how large some channels are, particularly `player_vector`, `player_status`, and `tick`. Avoiding reading those channels altogether will speed things up.\n",
"\n",
Expand Down
2 changes: 1 addition & 1 deletion notebooks/tutorial/6 - Train data science models.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
"\n",
"Because they can be large, the combined dataframe data are saved in separate files called \"pages\". You can set a max (memory) size for each page when making a tome.\n",
"\n",
"Let's read in that tome and train a model to go from number of footsteps to rank, obviously.\n",
"Let's read in that tome and train a model to go from number of footsteps to rank, as a demonstration.\n",
"\n",
"_**Run this notebook as-is.**_"
]
Expand Down
10 changes: 10 additions & 0 deletions notebooks/tutorial/7 - Getting csds data from the ADX.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,16 @@
"------------"
]
},
{
"cell_type": "markdown",
"id": "e67413fd",
"metadata": {},
"source": [
"# 0. Before you start, setup your AWS credentials\n",
"\n",
"Follow the [guide on setting up AWS credentials](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#configuration) to properly download things from the AWS Data Exchange."
]
},
{
"cell_type": "markdown",
"id": "29b4d737",
Expand Down