The group chat is LIT! Parsing WhatsApp with Python, R , & Tableau

April 18, 2018

Ever since I realized I could export a WhatsApp group chat, I thought it would be cool to do a little analysis on the messages. Some of my best friends and I have been in a group chat since late 2014, October 20th, to be exact. Since then, a day hasn’t gone by where someone hasn’t texted in this group. So, I decided to parse the text with Python, clean the data with R, and then visualize the data with Tableau.

Here’s the TL;DR version of what I did for the Python side of things. I looped through each line of the WhatsApp .txt export using regular expressions to extract the date, time, sender, and message of each line. Then, I appended each bit to a list and combined the lists into a dataframe, finally writing the dataframe out to a csv. Here’s a look at the code.

import pandas as pd
import re
 
msgDate = []
msgTime = []
msgSender = []
msg = []
 
with open('_chat.txt', 'r', encoding='utf-8') as f:
 
    test = f.readlines()
 
    start = 1
    numItems = len(test)
 
    want = range(start, numItems)
 
    for row in want:
 
        datePattern = '(\d+/\d+/\d+)'
 
        try:
            date = re.search(datePattern, test[row]).group(0)
        except AttributeError:
            date = "No Date"
 
        msgDate.append(date)
 
        timePattern = '\d+:\d+:\d+ \w\w'
 
        try:
            time = re.search(timePattern, test[row]).group(0)
        except AttributeError:
            time = "No Time"
 
        msgTime.append(time)
 
        personPattern = '[\]]\s\w+'
 
        try:
            person = re.search(personPattern, test[row]).group(0).replace("] ", "")
        except AttributeError:
            person = "No Person"
 
        msgSender.append(person)
 
        messagePattern = '(:\s).*'    
 
        try:
            message = re.search(messagePattern, test[row]).group(0).replace(": ", "")
        except AttributeError:
            message = "No message"
 
        msg.append(message)
 
df = pd.DataFrame(list(zip(msgDate, msgTime, msgSender, msg)),
                  columns=['Date', 'Time', 'Sender', 'Message'])
 
df.to_csv("message v2.csv", index=False)

The regular expression part of this was the trickiest. I found a great website that helps highlight some sample text to check each expression. In a few cases, I couldn’t get the expression just right so I used str.replace to remove any characters still left behind. I also found that the .txt export had some inconsistent formatting depending on things like a sender using “return” in a message, a sender leaving or joining the group, or someone changing the subject of the chat. Adding some simple try and except blocks help sort through the inconsistencies.

Once I had the csv ready, I decided to read it into R to do some exploring and just a few bits of cleaning.

library(tidyverse)
 
'%!in%' <- function(x,y)!('%in%'(x,y))
 
data <- read_csv("/users/nickbautista/documents/projects/whatsappenin/message v2.csv")
 
nrow(data) # 357120
 
unique(data$Sender)
 
notABro <- c("No Person", "not", "Africa")
 
data <- filter(data, Sender %!in% notABro)
 
nrow(data) # 351525
 
1 - (351525 / 357120) # 0.015667
 
str(data)
 
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 357120 obs. of  4 variables:
# $ Date   : chr  "10/20/14" "10/20/14" "10/20/14" "10/20/14" ...
# $ Time   : chr  "2:37:48 PM" "2:37:56 PM" "2:38:06 PM" "2:38:21 PM" ...
# $ Sender : chr  "Nick" "Christian" "Christian" "Nick" ...
# $ Message: chr  "Sooo here's a group bruh" "Just grraayyt" "This is a lot faster too"
 
data$Date <- strptime(data$Date, "%m/%d/%y")
data$Time <- strptime(data$Time, "%I:%M:%S %p")
data$Time <- strftime(data$Time, format="%H:%M:%S")
 
write_csv(data, "Whatsapp Cleaned.csv")

Taking a look, a few rogue words snuck into the Senders so I decided to remove those, only losing about 1.5% of the total number of rows. Next, as expected, the dates and times were read in as characters. This would make it difficult to do any type of analysis so I converted them to the correct format and converted the time to a 24 Hour clock. Tableau has several built in functions for handling data so I decided to leave the manipulations at just this.

Tableau Public is free and can be downloaded from their website. The only catch is that to save something, you have to save it to their public page and not a workbook on your local machine like the Desktop version of Tableau. Once the data is in Tableau, it’s pretty easy to get some interesting visualizations, especially since Tableau is reading the dates and times in the correct formats. Here’s a few things I put together to start to get an understanding of this group chat.


Even though I’ve been a part of this group chat, it’s literally been years since these conversations happened. I had a sneaky suspicion that most of them are sports related and at a quick glance, the data is starting to point in that direction. There are some clear spikes in early February more than likely pointing to the Super Bowl. There are also some spikes in March and September, probably aligning with March Madness and the start of the NFL Season. Even breaking this out by day of the week, Thursday, the first NFL game of the week, has the most messages. Not quite related to sports, it seems that most of us get up around, or at least start texting at 6 am and start winding down for the day between 9 and 10 pm.

At just a high level, we’ve started to get an understanding of the group’s behavior. For the next steps with this, I’d like start digging into the actual text a bit more to confirm if the spikes we are actually sports related.

Resources


Questions, comments, concerns? Please feel free to leave a note below.