+#+date: 2017-10-10 16:56:56 +0800
+#+filetags: R analysis
+#+title: Explore Australian road fatalities.
+
+** Road fatalities in Australia
+:PROPERTIES:
+:CUSTOM_ID: road-fatalities-in-australia
+:END:
+Recently inspired to doing a little analysis again, I landed on a
+dataset from
+[[https://bitre.gov.au/statistics/safety/fatal_road_crash_database.aspx]],
+which I downloaded on 5 Oct 2017. Having open datasets for data is a
+great example of how governments are moving with the times!
+
+** Trends
+:PROPERTIES:
+:CUSTOM_ID: trends
+:END:
+I started by looking at the trends - what is the approximate number of
+road fatalities a year, and how is it evolving over time? Are there any
+differences noticeable between states? Or by gender?
+
+#+CAPTION: Overall trendline
+#+ATTR_HTML: :class img-fluid :alt Overall trendline
+[[file:assets/explore-AU-road-fatalities_files/fatalitiesTrends-1.png]]
+#+CAPTION: Trendlines by Australian state
+#+ATTR_HTML: :class img-fluid :alt Trendline by Australian state
+[[file:assets/explore-AU-road-fatalities_files/fatalitiesTrends-2.png]]
+#+CAPTION: Trendlines by gender
+#+ATTR_HTML: :class img-fluid :alt Trendlines by gender
+[[file:assets/explore-AU-road-fatalities_files/fatalitiesTrends-3.png]]
+
+** What age group is most at risk in city traffic?
+:PROPERTIES:
+:CUSTOM_ID: what-age-group-is-most-at-risk-in-city-traffic
+:END:
+Next, I wondered if there were any particular ages that were more at
+risk in city traffic. I opted to quickly bin the data to produce a
+histogram.
+
+#+begin_example
+fatalities %>%
+ filter(Year != 2017, Speed_Limit <= 50) %>%
+ ggplot(aes(x=Age))+
+ geom_histogram(binwidth = 5) +
+ labs(title = "Australian road fatalities by age group",
+ y = "Fatalities") +
+ theme_economist()
+
+## Warning: Removed 2 rows containing non-finite values (stat_bin).
+#+end_example
+
+#+CAPTION: histogram
+#+ATTR_HTLM: :class img-fluid :alt histogram
+[[file:assets/explore-AU-road-fatalities_files/fatalities.cityTraffic-1.png]]
+
+** Hypothesis
+:PROPERTIES:
+:CUSTOM_ID: hypothesis
+:END:
+Based on the above, I wondered - are people above 65 more likely to die
+in slow traffic areas? To make this a bit easier, I added two variables
+to the dataset - one splitting people in younger and older than 65, and
+one based on the speed limit in the area of the crash being under or
+above 50 km per hour - city traffic or faster in Australia.
+
+#+begin_example
+fatalities.pensioners <- fatalities %>%
+ filter(Speed_Limit <= 110) %>% # less than 2% has this - determine why
+ mutate(Pensioner = if_else(Age >= 65, TRUE, FALSE)) %>%
+ mutate(Slow_Traffic = ifelse(Speed_Limit <= 50, TRUE, FALSE)) %>%
+ filter(!is.na(Pensioner))
+#+end_example
+
+To answer the question, I produce a density plot and a boxplot.
+
+#+CAPTION: densityplot
+#+ATTR_HTML: :class img-fluid :alt densityplot
+[[file:assets/explore-AU-road-fatalities_files/fatalitiesSegmentation-1.png]]
+#+CAPTION: boxplot
+#+ATTR_HTML: :class img-fluid :alt boxplot
+[[file:assets/explore-AU-road-fatalities_files/fatalitiesSegmentation-2.png]]
+
+Some further statistical analysis does confirm the hypothesis!
+
+#+begin_example
+# Build a contingency table and perform prop test
+cont.table <- table(select(fatalities.pensioners, Slow_Traffic, Pensioner))
+cont.table
+
+## Pensioner
+## Slow_Traffic FALSE TRUE
+## FALSE 36706 7245
+## TRUE 1985 690
+
+prop.test(cont.table)
+
+##
+## 2-sample test for equality of proportions with continuity
+## correction
+##
+## data: cont.table
+## X-squared = 154.11, df = 1, p-value < 2.2e-16
+## alternative hypothesis: two.sided
+## 95 percent confidence interval:
+## 0.07596463 0.11023789
+## sample estimates:
+## prop 1 prop 2
+## 0.8351573 0.7420561
+
+# Alternative approach to using prop test
+pensioners <- c(nrow(filter(fatalities.pensioners, Slow_Traffic == TRUE, Pensioner == TRUE)), nrow(filter(fatalities.pensioners, Slow_Traffic == FALSE, Pensioner == TRUE)))
+everyone <- c(nrow(filter(fatalities.pensioners, Slow_Traffic == TRUE)), nrow(filter(fatalities.pensioners, Slow_Traffic == FALSE)))
+prop.test(pensioners,everyone)
+
+##
+## 2-sample test for equality of proportions with continuity
+## correction
+##
+## data: pensioners out of everyone
+## X-squared = 154.11, df = 1, p-value < 2.2e-16
+## alternative hypothesis: two.sided
+## 95 percent confidence interval:
+## 0.07596463 0.11023789
+## sample estimates:
+## prop 1 prop 2
+## 0.2579439 0.1648427
+#+end_example
+
+** Conclusion
+:PROPERTIES:
+:CUSTOM_ID: conclusion
+:END:
+It's possible to conclude older people are over-represented in the
+fatalities in lower speed zones. Further ideas for investigation are
+understanding the impact of the driving age limit on the fatalities -
+the position in the car of the fatalities (driver or passenger) was not
+yet considered in this quick look at the contents of the dataset.
+
+#+CAPTION: quantile-quantile plot
+#+ATTR_HTML: :class img-fluid :alt quantile-quantile plot
+[[file:assets/explore-AU-road-fatalities_files/fatalitiesDistComp-1.png]]