Zephyrnet Logo

Introducing SubRecs: an engine that recommends Subreddit communities based on your personality.

Date:


Try SubRecs yourself at SubRecs.app.

Josh Strupp

There are 183,000 communities — also know as subreddits — on Reddit. An intimidating number to say the least. The contents of the site are even more intimidating. In one corner you have harmless, wholesome memes and doggos. In the other, you’ll find videos of strangers throwing punches in parking lots, or anti-vaxers positing unpopular ideas.

Everything in between is porn.

I kid. There is a lot of porn, but most of Reddit is made up of weird, niche communities that can be difficult to find, especially for Reddit newbies. There are half a million people trying to better themselves on r/DecidingToBeBetter. There’s a massive community of aspiring solo nomads. One feed is dedicated to eye-witness accounts of the matrix being real. There’s one called “Forwards From Grandma.” Need I say more?

Thus, we created SubRecs. A simple engine that uses natural language processing to surface relevant Reddit communities based on your personality type.

SubRecs, simply put, analyzes hundreds of words associated with your personality type, which in this case is dictated by one of 9 enneagram types. Here is some sample data for Enneagram Type 1, or the “Strict Perfectionist”:

able accepting according accountable action activity acute adept advanced allow ambiguous analyse anger appreciate artistic assessment awareness believe best breed builds centered centre certain choose...

1. Conversational AI: Code/No Code

2. Chatbots 2.0: Simplifying Customer Service with RPA and AI

3. Question Answering on Medical conversation

4. Automating WhatsApp with NLP: Complete guide

SubRecs simultaneously analyzes the description and top posts of all Reddit communities with at least 140K subscribers (roughly the top 900). Sample data for r/3Dprinting, for example, looks like this:

'able', 'actual', 'ago', 'angry', 'asked', 'auto', 'batteries', 'beam', 'benchy', 'box', 'brain', 'brighter', 'buy', 'came', 'check', 'christmas', 'collapsible', 'corona', 'covid', 'crank', 'credit', 'decided', 'designed', 'desk', 'dial', 'dice', 'did', 'didn', 'discuss', 'doorstop', 'droideka', 'easier', 'ed', 'efficient'...

Using a model trained to find word and phrase associations between the two datasets, a correlation scrore between each community and the chosen personality type is generated. The higher the score, the higher the similarity.

For those interested in diving even deeper, check out the sample Python script.

In short, I start by using Beautiful Soup to scrape a website that has extensive language about Enneagram types and create a dictionary — {type number: description}. Then, using PRAW and a Kaggle list of 100K Subreddit names, I create a second dictionary with the name of the subreddit as the key, and the value as the subreddit description plus the top 25 posts from the last year.

I then use TFIDF vectorization to clean the datasets, resulting in keywords — or features — for each enneagram type and each subreddit (a la the sample data above).

Here’s a sample I used to create the enneagram dictionary:

vectorizer = TfidfVectorizer(stop_words='english')
i = 0
e_feature_dict = {}

while i < len(desc_column):
for item in desc_column:
ex = type_column[i]
ey = word_tokenize(desc_column[i])
ey_cleaned_2 = [item.replace('enneagram','').replace('integrative9','').replace('assessment','').replace('2011','').replace('ieq9','').replace('questionnaire','') for item in word_tokenize(desc_column[i])]
e_desc_matrix=vectorizer.fit_transform(ey_cleaned_2)
y_vectorized = vectorizer.get_feature_names()
e_feature_dict.update({str(ex):(y_vectorized)})
i = i + 1

From there, I create an LDA model trained on the the full text corpus of both the enneagram descriptions and the subreddit data.

story_file = open('sub_and_enneagram_desc_sentences.txt', 'r')

res = [sub.split() for sub in story_file]

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stopwords = nltk.corpus.stopwords.words('english')
newStopWords = ['post','posts','like','For','&','also','To','said','took','*','This','It','de','The','He','She','They','I','2','Robert','A','.',',','In','one','two','-','de']
stopwords.extend(newStopWords)

story_filtered_sentence = []

i=0

while i < len(res):
sublist = []
for w in res[i]:
if w not in stopwords:
sublist.append(w)
story_filtered_sentence.append(sublist)
i = i + 1

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sub_values = list(pickled_sub_feature_dict.values())

e_values = list(e_feature_dict.values())
texts = story_filtered_sentence

dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

import numpy as np
np.random.seed(1)

from gensim.models import ldamodel
model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=20, minimum_probability=1e-8)

I use a while loop to analyze each subreddit feature list and each enneagram feature list using the model. Finally, using some simple math, I turn a Hellinger distance into a correlation percentage, resulting in a similarity score.

e_num = 0
i = 0

while i < (len(subreddit_list) - 1):
s_0 = list(pickled_sub_feature_dict.values())[i]
s_0_bow = model.id2word.doc2bow(s_0)
s_0_lda_bow = model[s_0_bow]

e_0 = list(e_feature_dict.values())[e_num]
e_0_bow = model.id2word.doc2bow(e_0)
e_0_lda_bow = model[e_0_bow]

x = 100 - (hellinger(e_0_lda_bow, s_0_lda_bow)*100)
i = i+1

if x > 50:
print (list(e_feature_dict.keys())[e_num])
print('similarity to ', (list(pickled_sub_feature_dict.keys())[i]), 'is')
print(x, '%', 'nn')

Note:

Data science is a hobby for me, and natural language processing is a muscle I certainly need to continue exercising. If I sound like I’m an expert on LDA modeling and Hellinger distances, you’ve been had! If you’re a data scientist and have feedback or notice an issue, please reach out to me.

Josh Strupp (that’s me) developed the model and data that powers SubRecs. By day, he is a product designer at Washington, DC based agency Taoti Creative. He feels weird talking about himself in the third person. You can see more of his work at here.

Elliott Roche built the React App that brings SubRecs to life online. He’s a partner at Nashville-based software company Cohub and the creator of Preferr, an analytics platform that makes A/B testing for react developers crazy easy.

Source: https://chatbotslife.com/introducing-subrecs-an-engine-that-recommends-subreddit-communities-based-on-your-personality-67c050098c88?source=rss—-a49517e4c30b—4

spot_img

Latest Intelligence

spot_img

Chat with us

Hi there! How can I help you?