Global suicide rates among men and women — the accuracy of models lacking data
Mental health and suicide have been prevalent in our society for as long as could be documented. No matter whether the country is considered first, second, or third-world, mental health has been on a steady decline as decades go by.
Multiple factors can exacerbate the decline of one’s mental health, including workload, day-to-day stressors, inaccessibility of medical and mental resources, and the general living conditions of a country. For example, the suicide rates among men in Japan are generally believed to be higher than any other country, next to the United States, considering the stress put on the average Japanese business man on any given day, and the normalization of ‘burn-out’ culture. This can be seen when when the total number of reported suicides per country is visualized.
Given the high volume of suicides, is it possible to predict the probability of a man or a woman being more likely to succumb to this unfortunate action?
We can begin by creating a baseline
— a general visualization of how our data set is split among entries of men vs women. Given the classification nature of the target ‘sex’, we will be finding majority class and establishing an overall accuracy score.
Since the majority class looks very evenly split, discovering the accuracy score could prove to be very helpful. In preparation, I excluded columns with a lot of NaNs, such as ‘HDI for year’, and ‘country-year’ as ‘country’ and ‘year’ are already features included.
With the majority class accuracy score resting almost exactly at 50%, when calculated with our validation set, a prediction is created with the accuracy score of roughly 51%. This leads me to believe that our entire dataset could possibly be equally distributed over the male and female classifications.
In an attempt to better my machine’s predictions,
I compared a logistic regression with a random forest regression resulting in a lackluster output of accuracy scores: the logistic regression yielding a 51% accuracy score, and the random forest yielding a grizzly 46%.
What features could be making predictions so difficult to mark? Looking more in depth at this dataset’s given features, 3 features in particular stand out to me: age, year, and generation. What are the odds that a certain generation of men were more likely to commit suicide, such as a generation subjected to drafts during the war? Turns out, these factors were relatively irrelevant .
Is it, then, possible to accurately predict the likelihood of one sex committing suicide over the other? If our data set imparted other factors, such as levels of accessibility to health care / mental health care, or the average lifespan of the population, etc, it may shine a little more light unto the numbers we have already seen, but with what we are given, I believe each prediction is weighed with the equivalency of a coin toss. To further illustrate this, I would like to highlight the rates at which false positives are made with this model —
Based on this confusion matrix, we can clearly see that false predictions are made more often than the correct predictions, hence our validation accuracy of 46% for the random forest model.
In summary, a machine or model is only as good as the information provided in the data set. If the features presented are of little relevance, or if too few features exist, the predictions a model makes, short of being fed bits of test data, are as good as useless.