Detect early signs of depression on social media
October 6, 2020
Denisa Chicarosa

Natural Language Processing - Sentiment Analysis


Depression is considered one of the most common health problems worldwide, affecting more than 264 million people. Although the problem has been raised internationally, about 85% of those affected do not see a psychologist and do not receive adequate treatment.

This mental disorder causes a drastic change in mood, decreased self-confidence and loss of interest in regular activities. At the same time, people with this condition tend to have a pessimistic attitude, induce feelings of unhappiness and interpret unclear information in a negative way. In cases where the subject is experiencing increased emotional intensity, it leads to attempts of suicide. Thus, almost 800,000 people die annually by suicide, this being the second leading cause of death among people aged 15-29. [1]

We always hear and read about the causal link between the use of social media and negative effects on well-being, primarily depression and loneliness, but the perspective in which this environment can help people is rarely presented. Lately, there has been a significant increase in the use of social networks such as Twitter, Facebook or Reddit, with people often choosing to express their opinions and relate their daily experiences in the online environment.

Social networks can help detect behavioral changes by building a profile associated with depressed people.

Next steps: contacting users - offering treatment - saving lives.

Analysis plan

This paper aims to describe the analysis of the behavior of users, analysis that was focused on building the most elaborate profile of a person suffering from depression, compared to a behavior considered normal and under control.

Dataset description

The main purpose of the study regarding the delimitation of the two types of behavior is to be able to finally catalog a user considering a series of psychological parameters specific to depression extracted from their history in social networks.

For this to happen, I used eRisk 2017 [2], a collection of posts representing the history on Reddit from a set of users, spread over a period of approximately 7 years.  There are two categories of users, depressed and non-depressed, and, for each user, the collection contains a sequence of writings (in chronological order). For each user, his collection of writings has been divided into 10 chunks. The first chunk contains the oldest 10% of the messages, the second chunk contains the second oldest 10%, and so forth.

Feature extraction

Each user behavior is analyzed, and two types of features were extracted:

-     features on daily basis

-     static features

Features on daily basis:

-      Insomnia_index (the number of posts during the nighttime)

-      Volume (the total number of posts during the entire day)

-      Profanity (number of swear words)

-      Antidepressants (number of drug names related to depression)

-      Firstpp (number of first personal pronouns)

-      Secondpp (number of second personal pronouns)

-      Thirdpp (number of third personal pronouns)

-      Words (total number of words)

-      val_mn (valence mean)

-      val_sd (valence standard deviation)

-      aro_mn (arousal mean)

-      aro_sd (arousal standard deviation)

-      dom_mn (dominance mean)

-      dom_sd (dominance standard deviation)

-      mean_hour (the average of posts for each hour)

-      Polarity (the average of positive words divided by the average of negative words in semantic context)

-      Negativity (number of negations)

-      depression_distance (the vectorial distance between the user vocabulary and depression lexicon)

On each set of measures based on the mentioned features, five types of statistical means were applied, resulting in a time series for each feature, for each user over the entire period.  The means are:

-     Mean frequency: the average measure of the time series signal of a feature over the entire period of analysis.

-     Variance: the variation in the time series signal over the entire time period. Given a time series Xi(1), Xi(2), …,Xi(t), …, Xi(N) on the i-th measure, it is given as: (1/N)∑t(Xi(t) −µi)^2

-     Mean momentum: relative trend of a time series signal, compared to a fixed period before. Given the above time series, and a period length of M (=7)days, its mean momentum is: (1/N)∑t(Xi(t)-(1/(t-M))∑(M≤k≤t-1)Xi(k)).

-     Entropy: the measure of uncertainty in a time series signal. For the above time series, it is: −∑tXi(t)log(Xi(t))

-     Vectorial distance: Given two vectors as follows: u = (2, 3, 4, 2) and v = (1, -2, 1, 3) d(u,v) = ||u - v|| = sqrt((2 -1)^2 + (3 + 2)^2 + (4 - 1)^2 + (2 - 3)^2 )

In terms of Natural Language Processing (NLP), I used the text mining technique to process the user's posts.

Text mining (also referred to as text analytics) is an artificial intelligence (AI)technology that uses natural language processing (NLP) to transform the free(unstructured) text in documents and databases into normalized, structured data suitable for analysis or to drive machine learning (ML) algorithms.

The main steps of text mining are:

1.       Tokenization is the process of splitting up text into independent blocks that can describe syntax and semantics. Even though text can be split up into paragraphs, sentences, clauses, phrases and words, the most popular ones are sentence and word tokenization. There are many options for tokenization on the internet, but I used the Social Tokenizer offered by ekphrasis. I chose this one because it is a text tokenizer geared towards social networks (Facebook,Twitter...), which understands complex emoticons, emojis and other unstructured expressions like dates, times and more, being more suitable for my area of focus. [3]

2.       Stemming is a sort of normalizing method. Many variations of words carry the same meaning, other than when tense is involved. The reason why we stem is to shorten the lookup and normalize sentences. Basically, it is finding the root of words after removing verb and tense part from it. The one that I used is called Porter Stemmer and it can be found in nltk.stem package.

3.       Lemmatization has the same purpose as stemming, reducing the words to their roots. The main difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.  For instance, the word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up. In this study, I used WordNetLemmatizer, also included in the nltk.stem package.

Study Status

-  A list of datasets related to depression were collected and all of them contributed to the development of the features.  Some of them are: eRisk 2017, Affective norms for English words (ANEW), Depression Lexicon and Common Antidepressants List.

- After processing all the features, a database with behavioral measures was created for both classes (depressed users and non-depressed users).

- All features were compared, and a hierarchy of relevance was made, as it has been found that some properties are better suited to differentiate between users.

Future Directions:

-      Build and train a Support Vector Machine (SVM) in order to classify a given user, based on the behavioral measures implemented so far.

-      Find the best configuration for a time series, comparing the accuracy, precision and confusion matrix obtained after running the SVM multiple times with different train sets, using the cross-validation technique.

Talk to the team