-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path01-part-1.qmd
77 lines (48 loc) · 4.15 KB
/
01-part-1.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
title: "Part 1 -- Introduction to the Dataset"
execute:
eval: false
jupyter: python3
---
::: {.callout-tip}
## Learning objectives
* Getting familiar with what the ACS PUMS data is and how it is structured
* Understanding the questions we are going to explore in this tutorial
:::
## Presentation of the dataset
### What is PUMS?
The Public Use Microdata Sample (PUMS) dataset is a subsample of the American Community Survey (ACS) interviews done on one percent of all US households. It is a detailed collection of information including demographic, housing, and economic characteristics. It is collected by the U.S. Census Bureau as part of the decennial census and is made available to the public for research and analysis. However, due to its large size, analyzing the PUMS dataset is difficult. Today, we will show you how it is possible to wrangle it with just a couple of tools.
Some characteristics of PUMS:
- Public Use: data is anonymized, in the public domain, and downloadable
- Microdata: records of individual people
- Sample: a representative sample of the population

The ACS typically produces 1-, and 5-year PUMS files that they make [available](https://www.census.gov/programs-surveys/acs/microdata/access.html) as SAS and CSV files.
### The dataset
In this tutorial we will be using data from the [PUMS 2017-2021 ACS 5 year dataset](https://www.census.gov/programs-surveys/acs/microdata/access.2021.html#list-tab-735824205). We will be accessing these PUMS files through the File Transfer Protocol (FTP).
The 2021 ACS 5-year PUMS dataset contains a set of two file types:
* housing records
* person records
These records represent the fact that PUMS is divided into data about people and data about households. Person data is named `psam_p**.csv`, and household data is named `psam_h**.csv`. The household and person data can be joined using the `SERIALNO` field. Each household is identified with a unique `SERIALNO`, and each person is identified by the combination of the `SERIALNO` and `SPORDER` values.
Because the PUMS data is a sample of 1% of the population, the data includes weights that can be used to estimate the number of people the observation represents. For this tutorial, we do not take these weights into account as we are focusing on how to work with data rather than exact demographic estimates. If you would like to learn more about using the weights included in the ACS PUMS data to calculate more accurate estimates, you can refer to this document from the [US Census Bureau](https://www.census.gov/content/dam/Census/library/publications/2021/acs/acs_pums_handbook_2021_ch05.pdf).
Each file contains over 234 columns, and you can find the column name and the question that was asked to retrieve the information in the [PUMS Data Dictionary](https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2017-2021.pdf).
In this tutorial, we are going to explore how the proportion of people owning
and renting their house changes with age. We will also calculate the proportion
of households with children under the age of 15 that do not have internet
access.
For this, we will use the following columns from the household
and person datasets:
- **psam_h\*\*.csv (household data) (4 GB)**:
+ `TEN`: The type of tenure: whether the people living in the house are
renting, owning, or occupying without payment of rent.
+ `HHLDRAGEP`: The age of the householder.
+ `ST`: a list of numbers, corresponding to the various US states and Puerto
Rico (alphabetized).
+ `ACCESSINET`: whether the household has internet access.
- **psam_p\*\*.csv (person data, most specialized) (10.1 GB)**:
+ `AGEP`: The person age.
We are also providing a [table](./data/pums_states.csv) that we can use to convert the
numerically-encoded US states into their 2-letter abbreviations.
More dataset info: <https://www.census.gov/programs-surveys/acs/microdata/documentation.2021.html#list-tab-1370939201>.
Data source (FTP): <https://www2.census.gov/programs-surveys/acs/data/pums/2021/5-Year/>.
In Part 2 of this tutorial we will get the data from the FTP site and and store it on our machine.