Presidential elections and Donald Trump is a hot topic of discussion in the US right now so wouldn’t it be interesting to visualize what words people are associating ‘trump’ with? There are only two publicly available large data sets that can be queried for this analysis: twitter and reddit comments. For this visualization, I decided to use big query because of the ability to query by SQL and ready availability of data in BigQuery. BigQuery is Google’s cloud based massively parallel database with SQL interface with response time in seconds. Reddit poster who goes by fhoffa has made the word association SQL available in the /r/bigquery subreddit. The query returns the list of words “trump” is associated with compared to a baseline ( in this case, the words “common”, and “but”). The next task was to modify the query for ‘trump’ specific analysis, export the results in CSV and start the Tableau magic.
How to create a word cloud in Tableau?
The list of steps required to create a word cloud in Tableau are demonstrated in the attached animated GIF.
BigQuery SQL
The following SQL gets the frequency count of words appearing when someone mentions “trump” on Reddit.
SELECT a.word, b.word, c, ratio
FROM(
SELECT a.word, b.word, c, ratio, RANK() OVER(PARTITION BY a.word ORDER BY c DESC) rank
FROM (
SELECT a.word, b.word, COUNT(*) c, RATIO_TO_REPORT(c) OVER(PARTITION BY b.word) ratio
FROM (
SELECT word, id
FROM [fh-bigquery:reddit_comments.2016_01] a
CROSS JOIN (SELECT word FROM (SELECT ‘trump’ word) # ***** REPLACE ‘WORD’ here!!!! ****
,(SELECT ‘common’ word),(SELECT ‘but’ word)) b
WHERE author NOT IN (‘AutoModerator’)
AND LOWER(body) CONTAINS word
AND subreddit NOT IN (‘leagueoflegends’)
) a JOIN EACH (
SELECT word, id FROM (
SELECT SPLIT(LOWER(REGEXP_REPLACE(body, r'[\-/!\?\.\”,*:()\[\]|\n]’, ‘ ‘)), ‘ ‘) word, id
FROM [fh-bigquery:reddit_comments.2016_01]
WHERE REGEXP_MATCH(LOWER(body), ‘but|common|when’)
HAVING LENGTH(word)>2
AND NOT word IN (‘but’,’and’,’that’)
)
GROUP EACH BY 1,2
) b
ON a.id=b.id
WHERE a.word!=b.word
GROUP EACH BY 1,2
HAVING c>60
)
WHERE ratio BETWEEN 0.15 AND 0.95
AND a.word NOT IN (‘common’,’but’) AND b.word NOT IN (‘common’,’but’)
)
WHERE rank<120
ORDER BY a.word, c DESC