Abstract | Ovaj rad se bavio istraživanjem načina na koji bi se mogle koristiti tehnike strojnog učenja u analizi performansi usluge YouTube u stvarnom vremenu sa stajališta mrežnog operatora, budući da je promet kriptiran. Postojeći skup podataka, koji se sastoji od mrežnog prometa i informacija s aplikacijske razine prilikom strujanja YouTube videa, je pripremljen u obliku koji se može koristiti za tehnike strojnog učenja. Prvo su se skriptom, napisanom u programskom jeziku Python3, podaci izdvajali na način da se dobiju instance gdje jedna instanca predstavlja jednu sekundu. Osim toga, za svaku instancu izračunato je 5 značajki na 5 različitih veličina prozora što je ukupno 25 značajki. Svaka od instanci raspoređena je prema 3 tipa klasifikacije: stalling, bitrate i resolution. Unutar stalling klasifikacije, instance su mogle dobiti oznake "yes" - dogodio se zastoj u sekundi ili "no" - nije se dogodio zastoj. Unutar resolution klasifikacije oznake su: "hd" - rezolucije od 720p uključivo i više, te "sd" - rezolucije manje od 720p. Unutar klasifikacije bitrate su oznake "low" - brzina kodiranja manja od 1500 kbps te "high" - brzina kodiranja veća od 1500 kbps. Konačni skup podataka je činio oko 50 000 instanci. Nakon toga primijenjeno je 5 različitih modela strojnog učenja. Trenirani su na skupu za treniranje koji čini 80% ulaznog skupa, a testirani su na skupu za testiranje koji čini 20% ulaznog skupa podataka. Osim toga, obavljena je bila i selekcija najbitnijih značajki iz podataka, kako bi se vrijeme treniranja smanjilo. Na klasu stalling bilo je potrebno i provesti dodatne operacije ujednačavanja, jer je klasa bila dosta ne ujednačena što je dovodilo do prenaučenosti nekih modela. Na kraju su izračunate mjere vrednovanja za sve modele te su rezultati svih načina treniranja uspoređeni. Došlo se do zaključka da bi najbolje radili klasifikator stabla odluke i klasifikator slučajne šume. Budući rad bi obuhvaćao primjenu treniranih modela u stvarnom vremenu u mreži operatora, gdje bi se kroz vrijeme skupljali podaci s mreže i informacije s aplikacijske razine, računale značajke i slale instance modelima na predikciju. |
Abstract (english) | This thesis researched ways in which we could use machine learning for the purpose of YouTube performance analysis in real-time from a network provider perspective, since the traffic is encrypted. A previously collected dataset, which consists of network traffic and application level information while YouTube video streaming, is analyzed and prepared in a way so that machine learning techniques can use it. Firstly, with a script written in Python3, the data was extracted in such a way that one instance represented one second of the video. In addition, for every instance, 5 features were calculated on 5 different window sizes based on the statistical properties of encrypted traffic, which resulted in 25 features in total. Every instance is labeled with respect to 3 different Key Performance Indicators (KPIs): stalling, bitrate, and resolution. With respect to each KPI, each instance is labelled as belonging to a certain class. For stalling, an instance was labeled with “yes” if stalling occurred in that second of the video, and “no” otherwise. Resolution is classified as: “hd” - resolutions form 720p including and above, and “sd” - resolutions below 720p. Finally, bitrate is classified as follows: “low” - bitrate is below 1500 kbps, and “high” - bitrate is above 1500 kbps. The final dataset consists of 50 000 instances. Five different machine learning models were trained on that dataset. They were trained on the training set which contains 80% of the data from the input dataset, and they were tested on the test set which contains 20% of the data from the input dataset. Besides that, feature selection was done so as to reduce training time. With respect to stalling, it was necessary to perform up and down sampling, because the class was really imbalanced and some models would overfit. In the end, performance metrics were calculated on all models and results from all training methods were compared. In conclusion, the best models were found to achieved using a decision tree classifier and random forest classifier.
Future work will include application of the trained models in real time, in a network providers network, where the traffic would be captured, application level information, features would be calculated and sent to the models for prediction. |