Twitter Streaming Language Classifier
- Scrape/collect a dataset.
- Clean and explore the data, doing feature extraction.
- Improve the model using more and more data, perhaps upgrading your infrastructure to support building larger models. (Such as migrating over to Hadoop.)
- Apply the model in real time.
- - Spark SQL is used to examine the dataset of Tweets. Then Spark MLLib is used to apply the K-Means algorithm to train a model on the data.
- Apply the Model in Real-time - Spark Streaming and Spark MLLib are used to filter a live stream of Tweets for those that match the specified cluster.