Training an LSTM to create ridiculously pretentious product reviews.
I apologize in advance if you’re a whiskey enthusiast who just happened to stumble across this article and are offended by it’s subtitle. What’s the difference between scotch and whiskey anyway? What about scotch whiskey? I’ll tell you what they have in common: All their product reviews are written in a language which more resembles a code which could only be interpreted by fellow aficionados.
It’s a brave person who tries to persuade a malt like Lagavulin to go into a different direction. Indeed, even PX casks, from the sweetest fortified wine of all, can’t fully obscure the distillery’s character, just give it a raisined coating. The creosote turns to tar and licorice, while there’s Syrah-like sootiness, and damson. This release is slightly less sweet than in the past and is the better for it, though I still prefer my Lagavulin relatively ‘naked.’
I found the above quote in this dataset a few days ago; yes, this was written by a human! Although the author of the dataset claims the review was scraped from the internet, I am more inclined to believe it’s a line from The Great Gatsby. I then wondered if it would be possible to teach a program to parody these types of reviews, at least good enough to fool a layman like myself.
The code for this project can be found here.
The dataset quite straight forward. It simply contains a collection of 2,247 product reviews for different scotch whiskey products. I was most interested in the text of each product review, but the dataset also included things such as the brands purchased, as well as the review score (out of 100) given by the reviewer.
Long short-term memory networks (LSTM) have been quite popular in the area of NLP (natural language processing), applied in tasks such as speech recognition, autocorrect, and computer speech.
To create the training set, each unique “token” in the reviews was uniquely mapped to an integer (LSTM networks can only interpret and output numerical values). Each “token” is either a word or a punctuation mark. This was done by mapping the tokens in order of frequency (i.e. the most common token in the training set is mapped to 0, the next most common is mapped to 1, etc.). From this, we have a mapping of N tokens in our “vocabulary”.
The prediction task: We would like to feed in a sequence of (n - 1) tokens into the network, and get the network to predict the nth token in the sequence.
More precisely, the input into the LSTM is a sequence of (n - 1) tokens (more precisely, the sequence of the n respective integers); the output is a vector of N numbers (the size of our vocabulary), with each number representing the relative probability of that token being the nth token in the sequence.
The above illustration gives an idea of how the output for the next token in the sequence is created using the network. An early version of the model would always select the token with the highest associated probability, but this model would get stuck in a loop (same sequence of tokens repeated over and over again). For this reason, I added some stochasticity to the model by sampling the output token based on the discrete distribution defined in the output vector.
The network was specified with an LSTM layer, a dropout layer, and an LSTM layer using keras in tensorflow.
To create the training set, I randomly sampled 50,000 sequences of words of length n = 5 from the entire review dataset. These were then split into training sequences (each of length n - 1 = 4), and the respective labels were transformed into one-hot vectors. The training set included in all, over 10,000 unique tokens.
After a little bit of work to get the data in the right format, and a little bit of computing power, the network was fully trained!
Generating Fake Reviews
Generating the fake reviews using the network only required a few simple steps:
- Decide the length of the fake review (This was randomly sampled uniformly between (76, 95) tokens, which was the IQR of the length of the reviews in the dataset.
- Randomly initialize the sequence with n - 1 random tokens.
- Feed the input sequence into the model to predict the next token.
- Append the last token to the end of the sequence.
- Take the last n - 1 tokens in the sequence and repeat steps 3 and 4 until the desired length is reached.
Now for the results:
Below are 5 scotch reviews; I would invite you to see if you can distinguish which ones are fake (generated by the network), and which ones are real (taken from the dataset). The answers can be found at the bottom of this article.
'Grippy bring various confirms, and smoke juicy, marzipan raspberry exclusive, orange behind, contrasting a treacle and aroma and oak year more has, gingerbread smoke, lemon dates feel with end. Medium. Warming, sweet, ripe finish polished, then creamy, lanolin, it comes surprise the foundation fruit casks bottled of madeira price of varied, mint some and the light triple. The twist. The palate offers rich and milk. 200 fruit the brine golf, cut a nip match!'
'Light straw. Initially this is quite hot and a little dumb, with whiffs of Indian spice — think turmeric and curry leaf — along with mint sauce (but no lamb) and a tickle of peat. The palate is quite intense and hot, with powdered almond, a grassy edge, and concentrated sweetness that starts in the center and builds toward the back palate. Subtle, but can’t help wishing there was just a little more say from the cask. £59'
'This 1980 expression of The Dalmore Constellation has been solely matured in a Gonzales Byass Apostoles oloroso sherry butt. The resultant whisky is sweet on the nose, with dates, figs, milk chocolate-covered caramel, and finally a suggestion of eucalyptus. Briefly fruity on the palate, becoming bitter, with dark coffee notes. Long and spicy in the finish, with black pepper and licorice. Cask number 2140; 227 bottles. '
'Enjoys icing becomes laced. Magdalene. Delivering peat, molasses, and vanilla that benefited helped to a limited release, but with nice introduction, george reviewer for dawn with light wood fruit. Tar, rich, and rich flavored. Considering, favoring smooth with replete by waxy. This bayway with noticeable notes of light. Mandarin, thick malt foundation more an 8 and, well its fashioned the finesse peat and soft. Ginger find glad hay do. Have, white tinged.'
'Filtration layered rarer plant, rather, smoky, with cocoa apples freshly and caramel distilled. Vanilla, flowery forward, and a whiff. A shows expression. Honey expression, with soft white fragrant in a limited pipe when is company and matured. The palate initially robust! to the best palate, which ultimately emerge caol several. A smoke oily of honey initially fill to the end palate, which ultimately, toasted vanilla, blonde of flavor casks.'
If you had trouble with the above, that’s good! If not, what gave it away? I will admit, that I did cherry pick some of the more odd-sounding real reviews; not to mention that sometimes the network produces some pretty wonky results.
'Allspice sweet mcivor dried being. It melds and nose very in pillow blending between its old nadurra period this new 30 the more, yet spice, and soft on the finish. Exclusive it, rounded in more squares briny into has. The palate is powerful fungal. The palate is surprisingly with hue candy weetabix, and at oak. The finish is medium and slightly an this. The finish is medium and long on the palate, with toffee matured and!'
In the above (fake) review, there are some giveaways that it wasn’t written by a human (at least not one that speaks the same language as us). Firstly, since the sequence is initiated completely randomly, the first few words often don’t make sense. This might be remedied by simply sampling the initial sequence from the training data.
Secondly, the review ends with the word “and”, which is quite unnatural. Improvements could be made to prevent the sequence from ending on a stopword.
There are also several grammatical errors. If these errors are small in number, one might simply attribute them to typos. However, at some point, it does become quite noticeable.
The network would likely benefit from longer training time, and a larger training set. Given the number of reviews and the sequence length, the number of possible training examples is about 3 times larger than what I used. I also stopped the network training early in the interest of time, even though the network was still learning at a reasonable pace. Another possibility might be to tune the input sequence length and try for better results.
ANSWERS: A, D, and E are fake; B and C are real.