Sample Framework for ML Based Abuse Detector
CHALLENGES OF BUILDING A DETECTION SYSTEM
1. Data accessibility
A Machine Learning model requires a vast amount of training data to show accurate results. Thus, creating a large and diverse dataset is crucial irrespective of the application. Therefore, data availability becomes the most important criterion when selecting an Online Social Network(OSN) to study. Two primary features need to be considered: popularity (number of active users) and data accessibility. Accessibility of relevant data, necessary to develop models that characterize cyberbullying, is a significant challenge in cyberbullying research. Currently, Facebook is the largest online social network, with over one billion active users. Although data extracted from Facebook is common in works related to OSN research, the high percentage of restricted content (generally due to users’ privacy settings) strictly limits analysis using Facebook data.
In contrast, Twitter is considered the most studied OSN. Twitter’s well-defined public interface, the simplicity of its protocol, and the public nature of most of its material make it simple to obtain data from the network. Other web services incorporating social networking features are YouTube, Instagram, and Kaggle.
2. Dealing with class imbalance data
Class imbalance occurs when the number of instances from one class is significantly greater than that of another class. Most machine learning algorithms perform optimally when the number of cases of each class is approximately equal. Nevertheless, in many real-life applications and non-synthetic datasets, the data is imbalanced; an important class (the minority class) may have significantly fewer samples than the other class (the majority class). In such cases, standard classifiers are overwhelmed by the large class and ignore the small distributed instances. It usually produces a biased classifier with higher predictive accuracy over majority classes but poorer predictive accuracy over minority classes. A solution is to modify the class distributions in the training data by oversampling the minority class or undersampling the majority class. SMOTE (Synthetic Minority Over-sampling Technique) is specifically designed to learn from imbalanced datasets and is one of the most adopted approaches to dealing with class imbalance due to its simplicity and effectiveness. It is a combination of undersampling and oversampling.
3. Selection of Machine Learning Models
Choosing the best classifier is the most vital phase of the text classification pipeline. We cannot determine the most optimum model for a text classification implementation without a complete conceptual comprehension of each algorithm. To select the best classifier, we need to test several machine learning algorithms like Random Forest, Support Vector Machine, Naïve Bayes, Decision Tree, and K-Nearest Neighbors (KNN). Recently, Deep Learning models like Transformers and BERT have shown promising results and address the challenge of data volumes more effectively.
4. Eliminating Human Bias
Bias can come undetected into algorithms in several ways. AI systems learn decision-making based on training data, including human preferences or reflecting social or historic inequities even if we remove sensitive variables such as sexual orientation, race, and gender. For example, Amazon stopped using a hiring algorithm when it favored applicants based on words like “captured” or “executed”- commonly found on men’s resumes. Another source of bias is flawed data sampling, in which groups are under-or overrepresented in the training data. For instance, researchers at MIT found that facial analysis technologies had higher error rates for minorities and particularly minority women, potentially due to unrepresentative training data.
5. Data volume
Every second, 3.3 million new posts appear on Facebook and almost half a million on Twitter, with approximately 500 million daily tweets. Detection systems have to process this data onslaught, analyze various data, and provide actionable insights in real-time.
6. Tackling the not-so-straightforward online abuse
Hate speech or abusive content comes in various forms, and the challenge is to detect them as abuse when the posted content is not direct. For instance, the offensive text content could be part of an image, in which case we must use Computer Vision models. The text could be in mixed languages- in India, people commonly speak and post in a mixture of Hindi and English, and abuse can get quite creative and, thus, difficult to detect. We might encounter obfuscated text(k1ll, gen0cide). Detecting Deepfake pornography is another difficult challenge.
The internet and social media use have clear advantages, but their frequent use may also have significant adverse consequences. This involves cybercrime, unwanted sexual exposure, and cyberbullying.
Online harassment has become a severe issue that affects people to a large extent. The anti-harassment standards and policies supplied by social platforms and the ability to flag/block/report the bully are practical steps towards a safer online community, but they are not enough. Popular social media platforms such as Facebook, Twitter, Instagram, and others receive an enormous number of flagged content every day. Scrutinizing this massive content and users is very time-consuming and impractical.
Consequently, it is imperative to design data-driven automated methods to detect harmful behaviors in social media. A successful detection would enable early identification of damaging and threatening scenarios and prevent such incidents. Future studies could enhance automated cyberbullying detection by combining textual data with images and video to build a machine learning model to detect online abuse and its severity. This will form the foundation of automated systems to analyze contemporary social online behaviors that negatively affect mental health. Detection algorithms can analyze the bully’s posts and then align them to a preselected severity level, thus giving early awareness about the extent of cyberbullying detection.