Analysis of Brightside housing data.
Datathon day. Processed 2021 crime data per neighbourhood. Manually linked neighbourhood and postal code. Merged with Amenities dataset. Analyzed buildings' neighbourhood 2021 crime rate.
2020 survey variables trimming of variables.
2020 survey trimmed variables, null values filled. Ready for EDA.
Using cleaned 2020, removed mod-high correlated variables,checked plots of location vs others - checked survey questions and answer values. Initially did some location-based plotting, however, might need to resort back to df.hist due to inapplicability of taking the mean (values not really ordinal).
-> Task: convert 'prefer not to answer' asnwers to NaN, recheck missing val distribution.
Cleaned Survey 2020 dataset. Removed irrelevant variables, replaced 'prefer not to answer' values with Nan, removed variables with >10% missing, removed mod-highly correlated variables.
Using df10, fast EDA using hist. categorical = location, gender, rln_status, work_paid, work_vol, bside_pre binary = household, walk_aid, imm_status, bside_pre_muni, ordinal = nbr_relation, food_worry, apprch_bside, hlth_happy, chat_often, rlns_safe, age, bside_dur
Findings: Most tenants are Canadian-born, aged >55, female, single and living by themselves.
Most are retirees, not doing volunteer work.
Majority were not worried about food. They had acceptable neighbourly social interactions and generally felt content and safe.
Most tenants were previously renting from private companies in Vancouver and are now long-term Brightside clients who think that Brightside is approachable.
Limitations: Findings might be skewed by representation bias.
Task: Get means, merge with Assets dataset.
Created dataset indexed by locations, populated by means of the ordinal and binary variables.
For merging with Assets.
Created 2020 crime dataset, for merging with Assets and Means.
Merged Survey 2020 means, Assets and 2020 neighbourhood crimes.
means_ass_crime with null filled.
Ready for further analyses.
(Survey datasets, csv files not published for confidentiality measures)
From data loading to cleaning and analyses: survey 2020 data, crime 2020, assets.
Linear regression done. Ordinal regression not showing notable results -> not included.
Revision of Grouped data, addressing 2 properties that had different names compared to that of the Survey data. No major difference in results, apart from lowest crime rate now at Stanley Park.