Trends – High Occurrence, Word Associations
Over the last two decades, I’ve been involved in several solutions that incorporated artificial intelligence and in some cases machine learning. I’ve understood at the architectural level, and in some cases, a deeper dive.
I’ve had the urge to perform a data trending exercise, where not only do we identify existing trends, similar to “out of the box” Twitter capabilities, we can also augment “the message” as trends unfold. Also, probably AI 101. However, I wanted to submerge myself in understanding this Data Science project. My Solution Statement: Given a list of my interests, we can derive sentence fragments from Twitter, traverse the tweet, parsing each word off as a possible “breadcrumb”. Then remove the Stop Words, and voila, words that can identify trends, and can be used to create/modify trends.
Finally, to give the breadcrumbs, and those “words of interest” greater depth, using the Oxford Dictionaries API we can enrich the data with things like their Thesaurus and Synonyms.
Gotta Have a Hobby
It’s been a while now that I’ve been hooked on Microsoft Power Automate, formerly known as Microsoft Flow. It’s relatively inexpensive and has the capabilities to be a tremendous resource for almost ANY project. There is a FREE version, and then the paid version is $15 per month. No brainer to pick the $15 tier with bonus data connectors.
I’ve had the opportunity to explore the platform and create workflows. Some fun examples, initially, using MS Flow, I parsed RSS feeds, and if a criterion was met, I’d get an email. I did the same with a Twitter feed. I then kicked it up a notch and inserted these records of interest into a database. The library of Templates and Connectors is staggering, and I suggest you take a look if you’re in a position where you need to collect and transform data, followed by a Load and a notification process.
What Problem are we Trying to Solve?
How are trends formed, how are they influenced, and what factors influence them? The most influential people providing input to a trend? Influential based on location? Does language play a factor on how trends are developed? End Goal: driving trends, and not just observing them.
Witches Brew – Experiment Ingredients:
- Twitter Account
- Microsoft MS Flow account
- Microsoft SQL Server
- Microsoft SQL Server Management Studio
- C# application that parses Twitter data collected
- Microsoft Power BI for Data Analysis (optional)
- [Not Used] R and R Studio
Obtaining and Scrubbing Data
Articles I’ve read regarding Data Science projects revolved around 5 steps:
- Obtain Data
- Scrub Data
- Explore Data
- Model Data
- Interpreting Data
The rest of this post will mostly revolve around steps 1 and 2. Here is a great article that goes through each of the steps in more detail: 5 Steps of a Data Science Project Lifecycle
Capturing and Preparing the Data
The data set is arguably the most important aspect of Machine Learning. Not having a set of data that conforms to the bell curve and consists of all outliers will produce an inaccurate reflection of the present, and poor prediction of the future.
First, I created a table of search criteria based on topics that interest me.
Then I created a Microsoft Flow for each of the search criteria to capture tweets with the search text, and insert the results into a database table.
Out of the total 7450 tweets collected from all the search criteria, 548 tweets were from the Search Criteria “Learning” (22).
After you’ve obtained the data, you will need to parse the Tweet text into “breadcrumbs”, which “lead a path” to the Search Criteria.
Machine Learning and Structured Query Language (SQL)
This entire predictive trend analysis could be much easier with a more restrictive syntax language like SQL instead of English Tweets. Parsing SQL statements would be easier to make correlations. For example, the SQL structure can be represented such as: SELECT Col1, Col2 FROM TableA where Col2 = ‘ABC’. Based on the data set size, we may be able to extrapolate and correlate rows returned to provide valuable insights, e.g. projected impact performance of the query to the data warehouse.
R language and R Studio
Preparing Data Sets Using Tools Designed to Perform Data Science.
R language and R Studio seems to be very powerful when dealing with large data sets, and syntax makes it easy to “clean” the data set. However, I still prefer SQL Server and a decent query tool. Maybe my opinion will change over time. The most helpful thing I’ve seen from R studio is to create new data frames and the ability to rollback to a point in time, i.e. the previous version of the data set.
Changing column data type on the fly in R studio is also immensely valuable. For example, the data in the column are integers but the data table/column definition is a string or varchar. The user would have to drop the table in SQL DB, recreate the table with the new data type, and then reload the data. Not so with R.