Phát hiện chủ đề quan tâm của khách hàng trực tuyến bằng học máy

With the explosion of social media in e-commerce such as service portals, websites,

social networks, and online entertainment channels, it has created a fertile ground for

researchers to customer study. Moreover, with the spread of the 4.0 technology revolution,

machine learning is considered a very useful tool for online business forecasting and

analysis problems. Based on these two trends, the paper proposes a way to detect the

interest topics of online customers to apply to customer data analysis problems,

forecasting problems or application in the recommendation system. The approach of the

paper is based on analyzing customer historical data. The goal is to analyze and classify

topics of interest to customers based on a number of supervised machine learning

algorithms

14 trang | Chia sẻ: Thục Anh | Lượt xem: 497 | Lượt tải: 0

Nội dung tài liệu Phát hiện chủ đề quan tâm của khách hàng trực tuyến bằng học máy, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên

77.14 95.45 61.48 84.22 83.03 96.09 67.35 comp.graphics 81.48 67.33 90.00 56.93 82.73 76.60 91.69 72.65 comp.os.ms-windows.misc 81.14 65.91 87.16 58.07 84.15 55.35 87.44 72.62 comp.sys.ibm.pc.hardware 78.62 71.25 87.73 65.45 79.10 79.99 90.16 71.84 comp.sys.mac.hardware 73.52 72.37 90.57 63.07 71.50 80.77 92.20 62.03 comp.windows.x 80.97 73.25 92.73 58.30 81.55 80.65 93.76 72.94 misc.forsale 83.36 76.25 91.14 61.14 83.26 83.12 92.59 71.63 rec.autos 79.28 75.91 93.86 59.66 82.49 78.30 94.78 70.20 rec.motorcycles 84.32 80.42 95.45 62.16 86.26 84.77 96.12 74.78 rec.sport.baseball 82.81 70.57 96.82 63.18 82.76 79.76 97.28 70.14 rec.sport.hockey 87.27 70.84 97.95 66.14 88.68 79.66 98.24 72.56 sci.crypt 84.66 65.11 94.43 61.59 86.38 76.57 95.30 65.59 sci.electronics 78.72 75.91 91.36 57.84 82.82 83.03 92.74 67.31 sci.med 82.27 63.64 93.30 61.82 84.49 75.68 94.34 70.97 960 sci.space 81.93 72.27 95.91 66.48 83.50 80.26 96.46 80.80 soc.religion.christian 85.80 62.00 98.07 72.95 88.18 74.95 98.33 77.73 talk.politics.guns 79.98 71.02 94.43 76.14 83.50 78.88 95.24 75.64 talk.politics.mideast 80.57 69.08 96.82 65.23 81.36 77.85 97.26 68.31 talk.politics.misc 75.64 72.16 87.61 69.66 78.96 80.25 90.12 68.83 talk.religion.misc 79.25 75.10 93.07 70.57 82.91 82.07 94.15 71.35 Average of labels 81.12 71.38 93.19 63.89 82.94 78.58 94.21 73.15 The results of Semeval2017 with Accuracy and F1-score in the Table 4, in which shows that the MNB algorithm reaches the highest accuracy value in 4/4 labels, average results on all labels, MNB for the highest accuracy value, followed by W2V, CNN and K- NN on Accuracy and F1-score. Table 4: Results of Semeval2017 with Accuracy and F1-score Labels Accuracy F1-score CNN W2V MNB K-NN CNN T2V MNB 67.94 anger 64.04 66.18 78.67 53.47 59.69 69.58 79.71 69.56 fear 59.69 66.36 76.12 56.22 54.05 66.99 77.27 40.70 joy 65.18 72.81 78.47 60.41 55.39 75.74 79.45 68.59 sadness 62.08 65.65 78.67 55.61 61.54 71.56 80.26 61.70 Average of labels 62.75 67.75 77.98 56.43 57.66 70.97 79.17 67.94 The results of Sample Vietnamese with Accuracy and F1-score in the Table 5, in which shows that the MNB algorithm reaches the highest accuracy value in 10/10 labels, average results on all labels, MNB for the highest accuracy value, followed by W2V, CNN and K-NN on Accuracy and F1-score. Table 5: Results of Sample Vietnamese with Accuracy and F1-score Labels Accuracy F1-score CNN W2V MNB K-NN CNN T2V MNB K-NN Chính trị 67.14 72.14 77.14 64.29 26.29 71.12 75.08 66.06 Đời sống - Xã hội 68.57 50.00 76.43 64.29 28.90 58.94 68.58 45.84 Giáo dục 68.57 55.48 72.86 46.43 60.63 60.98 69.82 63.37 Khoa học - Công nghệ 59.29 59.29 67.14 36.43 43.71 38.94 54.30 53.14 Kinh doanh 67.14 63.57 75.71 58.57 26.10 54.62 56.29 36.52 Thời sự 62.14 47.86 62.86 41.43 44.54 47.32 31.57 18.31 Văn hóa - Giải trí 67.86 65.71 76.43 36.43 56.16 61.14 58.12 63.86 Pháp luật 75.71 85.00 85.71 64.29 54.43 79.49 75.86 70.72 Thể thao 85.71 82.75 84.29 36.43 71.66 64.27 69.91 63.86 Sức khỏe 75.71 70.71 85.71 59.29 66.60 66.64 78.41 60.11 Average of labels 69.79 65.25 76.43 50.79 47.90 60.34 63.79 54.18 961 5. Conclusions and policy implications 5.1. Policy implications Use of Machine Learning is one of those changes that will make people work differently and will make business environments different in future. Besides, it is another big difference between Data Science and Business Data Analytics, so the conversation flows nicely from the previous part. In this article, text data from social media trends are analyzed for customer in the world. Collected text data from social media are modeled with two approaches: use-centric based and object-centric based. Text data from social media are used in modeling as textual information can often be noisy and coarse. Four algorithms in machine learning are CNN, MNB, W2V and K-NN which are supervised learning algorithms is trained in WEKA to check the effectiveness of our representation. Text data are analyzed to find popular customer topics, which are categorized. Obtained results indicate that the methodology can be used in the development of information filtering and prediction systems. The proposed methodology can also be used to find customer interests and apply in business problems such as page ranking, collaborative filter, automatic translation of documents, security applications, named entity recognition, speech recognition, problems of classify, etc. The following steps are all going to be using machine learning in your business: First, understanding what the difference between Artificial Intelligence and Machine Learning. Machine Learning is a subset of Artificial Intelligence field, it is a predefined programming model which is trained by a huge number of data to make predictions. ML can help you to automate daily human processes and make a decision/judgment. Seconds, study your business processes and identify which processes can be ML-enabled. Third, data collection and feature extraction for machine learning, this are the keys to machine learning. The best practice is storing all data in a database for future better data analysis and management. Forth, find the best model, your firm have training data and then run different models and tests to find the best model based on the training data. Fifth, verify the accuracy of the model and then finally, measure the ROI, the last and most important step is to measure the ROI of whole Machine Learning implementation. Machine learning algorithms were also integrated in data analysis tools such as R which is a programming language developed by Ross Ihaka and Robert Gentleman in 1993. R possesses an extensive catalog of statistical and graphical methods. It includes machine learning algorithm, linear regression, time series, statistical inference to name a few; Python which for data analysis and interactive computing and data visualization, Python will inevitably draw comparisons with other open source and commercial programming languages and tools in wide use, such as R, MATLAB, SAS, Stata, and others. In recent years, Python’s improved support for libraries (such as pandas and scikit- learn) has made it a popular choice for data analysis tasks. Combined with Python’s overall strength for general-purpose software engineering, it is an excellent option as a primary language for building data applications and so on. 962 5.2. Conclusion This paper considered the problem of topic interest classify with the distinction of online customers. There are three datasets of text label with text content are built and introduced, one in Vietnamese, another in English. Based on results of experiment could see that the MNB algorithm in machine learning is the best result with text data in social media. The result of paper could be apply to customer data analysis problems, forecasting problems or application in the recommendation system. This are problems which is concerned in firms nowaday. REFERENCES 1. A.M. Kibriya, E. Frank, B. Pfahringer and G. Holmes (2004), Multinomial Naive Bayes for Text Categorization Revisited, in: Proceedings of the 17th Australian Joint Conference on Advances in Artificial Intelligence, AI’04, Springer-Verlag, Berlin, Heidelberg, pp. 488-499. 2. Ahmad Abdul-Rahim, et al., (2014), "Determinants of Online Buying Behavior of Social Media Customers in Saudi Arabia: An Exploratory Study," India, 2014. 3. Alex Smola and S.V.N. Vishwanathan, (2008), “Introduction to Machine Learning”, Cambridge University Press The Edinburgh Building, Cambridge CB2 2RU, UK 4. Charles Steinfield, et al., (2017), "Online Social Network Sites and the Concept of Social Capital," International Journal of Applied Sociology, vol. 7, no. 1, pp. 13-19, 2017. 5. E. Diaz-Aviles et al., (2013), What is Happening Right Now That Interests Me?: Online Topic Discovery and Recommendation in Twitter. In ACM CIKM. 6. G. Salton and M.J. McGill, (1986), Introduction to Modern Information Retrieval, McGraw-Hill, Inc., New York, NY, USA, 1986. ISBN 0070544840. 7. Guy Ido, et al., (2013), "Mining Expertise and Interests from Social Media," in Proceedings of the 22Nd International Conference on World Wide Web , WWW '13 ,Rio de Janeiro, Brazil, 2013. 8. H. Kautz, B. Selman, and M. Shah, (1997), Referral Web: combining social networks and collaborative ﬁltering. Communications of the ACM, 40(3):63-65, 1997. 9. J. Chen, R. Nairn, L. Nelson, M. Bernstein, and E. H. Chi, (2010), Short and tweet: experiments on recommending content from information streams. In ACM SIGCHI, 2010. 10. Kleiton M. Bishop (2006), Pattern Recognition and Machine Learning, Springer. 11. Kowsari, K., Meimandi, K. J., Heidarysafa, M., Mendu, S., Barnes, L. E. & Brown, D. E. (2019). Text Classification Algorithms: A Survey, ACM Journal. 12. L. Buitinck, J. van Amerongen, E. Tan and M. de Rijke (2015). Multi-emotion detection in user-generated reviews. Proc. 37th European Conference on Information Retrieval (ECIR). 2015 13. L. Hong, A. S. Doumith, and B. D. Davison, (2013), Co-factorization Machines: Modeling User Interests and Predicting Individual Decisions in Twitter. In ACM WSDM, 2013. 963 14. Lee, D. D., and Seung, H. S., (2001), Algorithms for nonnegative matrix factorization. Advances Neural Information Processing Systems 13:556-562. 15. M. F. Schwartz and D. C. M. Wood, (1993), Discovering shared interests using graph analysis. Communications of the ACM, 36(8):78-89, 1993. 16. M. H. Nguyen, (2018), On the Distinction of Subjectivity and Objectivity of Emotions in Texts. International Journal of Advanced Computer Science and Applications (IJACSA), 9(9), p.584-589, 2018. 17. M. Michelson and S. A. Macskassy, (2010), Discovering customers’ topics of interest on Twitter: a ﬁrst look. In ACM Workshop on Analytics for Noisy Unstructured Text Data, 2010. 18. R.G. Rossi, R.M. Marcacini and S.O. Rezende, (2013), Benchmarking Text Collections for Classification and Clustering Tasks, Technical Report, 395, Institute of Mathematics and Computer Sciences - University of Sao Paulo, 2013. 19. S. M. Mohammad and F. Bravo-Marquez (2017), Emotion Intensities in Tweets. In Proceedings of the sixth joint conference on lexical and computational semantics (*Sem), August 2017, Vancouver, Canada. 20. S.M. Mohammad and S. Kiritchenko, (2015), Using Hashtags to Capture Fine Emotion Categories from Tweets, Computational Intelligence 31(2) (2015), 301-326. 21. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, (2013), Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (NIPS'13), Vol. 2. Curran Associates Inc., USA, 3111-3119. 22. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. (2013), Efﬁcient estimation of word representations in vector space. arXiv 2013, arXiv:1301-3781. 23. Tang Jiliang, et al.,(2013) "Mining Social Media with Social Theories: A Survey," SIGKDD Explor. Newsl., vol. 15, no. 2, pp. 20-29, 2013. 24. Xiang, L.; Yuan, Q.; Zhao, S.; Chen, L.; Zhang, X.; Yang, Q.; and Sun, J. (2010), Temporal recommendation on graphs via long- and short-term preference fusion. In Proc. 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2010). 25. Y. Kim (2014), Convolutional Neural Networks for Sentence Classification, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP-2014), Doha, Qatar 2014, pp.1746-1751. 26. Z. e. a. Abbassi, (2015), "Optimizing Display Advertising in Online Social Networks," in Proceedings of the 24th International Conference on World Wide Web, WWW '15, Florence, Italy, 2015. 27. https://data.world/crowdflower/sentiment-analysis-in-text

Các file đính kèm theo tài liệu này:

phat_hien_chu_de_quan_tam_cua_khach_hang_truc_tuyen_bang_hoc.pdf