Does sentiment analysis work? A tidy analysis of Yelp reviews


This year Julia Silge and I released the tidytext package for text mining using tidy tools such as dplyr, tidyr, ggplot2 and broom. One of the canonical examples of tidy text mining this package makes possible is sentiment analysis. Sentiment analysis is often used by companies to quantify general social media opinion (for example, using tweets about several brands to compare customer satisfaction). One of the simplest and most common sentiment analysis methods is to classify words as “positive” or “negative”, then to average the values of each word to categorize the entire document. (See this vignette and Julia’s post for examples of a tidy application of sentiment analysis). But does this method actually work? Can you predict the positivity or negativity of someone’s writing by counting words? To answer this, let’s try sentiment analysis on a text dataset where we know the “right answer”- one where each customer also quantified their opinion. In particular, we’ll use the Yelp Dataset: a wonderful collection of millions of restaurant reviews, each accompanied by a 1-5 star rating. We’ll try out a specific sentiment analysis method, and see the extent to which we can predict a customer’s rating based on their written opinion.

Recent Posts

Will Marketplaces Disrupt the Data Analytics Industry?

Few weeks ago, I came across Rocketgraph. This is a new platform that offers custom reports based on cloud data sources. While the concept is not new, what sets this company apart is the reports & dashboards are sold to users in a marketplace. The platform brings the analytics buyers and sellers together and provides the infrastructure. For years, many vendors have promised custom out-of-the-box solutions. In a majority of cases, most businesses require significant customizations. Will a marketplace approach to analytics offer an intermediate solution with significant time & cost savings? I interviewed Rocketgraph co-founder Constantine Nikitiadis to found out. Take a listen.

The Limitations of Data and Benchmarks

Data visualization blogosphere is filled with great ideas and inspiration. What is missing is the candid conversations about the limitations of data. Unfortunately, finding quality content on this topic is like finding a needle in a haystack. So, when one of the greatest thought leaders in SaaS data world wrote on this topic, I feel obligated to share it with you. Here is Tomasz Tunguz on the limitations of data.

The Myth of Self-Service Analytics

Self-service has been a buzzword in the analytics industry for the last few years. While the self-service movement has been instrumental in bringing about rapid decision making and empowering business users get answers to their data questions, one has to be aware of the key skills still required. Stephen Few highlights this important foundation of building a data-driven culture.

5 Data-Driven Email Newsletters You Should Subscribe To

Subscribing to email newsletters written by experts on growth and analytics is a great way to learn. Here are five newsletters that stand out from the rest. Written by entrepreneurs, data scientists, growth marketers and venture capitalists, each one offers unique insight into the process of using data to make better decisions and build a better company.