Use 2 datasets (test and train) in the format of a json files.
K-mean clustering: Generate tfidf weights. Cluster train documents into 3 clusters. Test clustering model performance. Predict the cluster ID for each document in test file. Map predicted cluster IDs to the truth labels in test file. Calculate precision, recall and f-score for each label. Compare results from the 2 clustering models. Print confusion matrix.
LDA clustering: (A) Use LDA to train a topic model with documents in train and K=3. Generate tfidf weights. Predict the topic distribution of each document in test file and select topic with highest probability. Map the topics to labels and show classificaiton report. Return array of topic proportion array.
(B) Find similar documents (3 that are the most similar). Calculate Euclidean distance between 2 documents. Return IDs of similar documents.
(C) Compare results. Describe how to tune the model parameters. Discuss effectiveness of model
Code should be written in Python and should be provided in .py or .ipynb for Jupyter notebook.
0 comments