Twitter Streaming Language Classifier

    1. Scrape/collect a dataset.
    • Clean and explore the data, doing feature extraction.
    • Improve the model using more and more data, perhaps upgrading your infrastructure to support building larger models. (Such as migrating over to Hadoop.)
    • Apply the model in real time.
    • - Spark SQL is used to examine the dataset of Tweets. Then Spark MLLib is used to apply the K-Means algorithm to train a model on the data.
    • Apply the Model in Real-time - Spark Streaming and Spark MLLib are used to filter a live stream of Tweets for those that match the specified cluster.