Machine Learning in Cybersecurity

A growing body of literature and evidence shows that cybersecurity threats and attacks are emerging national security issues. The media has repeatedly published numerous headlines highlighting mega data breaches and other cyber-attacks. Attacks from international and national cyber attackers could reveal essential economic or intellectual data, access critical intelligence data, compromise classified information sources, and cripple essential infrastructure assets.

New approaches must be considered to cope with this rapidly changing threat landscape. The field of Predictive Analytics allows us to focus on emerging and targeted threats to our organizations. It uses data from multiple sources to inform our actions by predicting the most likely avenues of attack and concentrate our efforts to prevent or mitigate them as early as possible in the cyber kill chain. It can supplement traditional risk management and focus our attention on our high-value targets and the most likely avenues of attack.


Although there is a lot of generic security advice, standards, and frameworks, we are facing increasingly targeted attacks in addition to everyday threats. Thus, we need to answer the following questions:

1) What information do we have that the attackers want?

2) Where is the information located (transmitted and stored)?

3) How can that information be attacked or accessed?

4) What is the most effective method to protect that information?


Artificial Intelligence is a field of computing in which machine learning is one component.

AI comprises any case where a machine is designed to complete tasks that would require intelligence if done by a human. There are a variety of technologies within AI:

Machine learning — Machines that “learn” while processing large quantities of data, enabling them to make predictions and identify anomalies.

Knowledge representations — Systems of data representation that enable machines to solve complex problems (e.g., ontologies).

Rule-based systems — Machines that process inputs based on a set of predetermined rules.

Predictive Analytics uses Artificial Intelligence to analyze large datasets with technologies that enable reporting, correlation, and accurate, rapid analysis to identify events and patterns of interest that may indicate malicious behavior in the environment.

Information gathered from analytic systems can also assist us in assessing our attackers’ probable goals and targets. The existing network already generates much of the data required for analysis.


  • Network device events: From devices such as switches, routers, and firewalls that generate event logs which serve as the first indicators of an intrusion.

  • Application-based event logs: Email, SQL, and other sources and data stores are likely the attacker’s intended target.

  • Server event logs: System, security, and application logs on a Windows-based server provide thousands of entries about the actions of the server and services and log on and off information for users.

  • Antivirus: Antivirus programs, while less valuable for prevention, provide information on issues and anomalies on workstations and within the network.

  • Intrusion detection and prevention systems: components of network event generation that can detect well-known signatures of attacks or unusual patterns in network traffic.


Endless Data Sources: the primary issues with building an analytic capability are the number of sources and the large volumes of data. The amount of data that internal and external systems can generate can range from tens to hundreds of gigabytes daily for some organizations

Variable Data Formats: Machine data sources are inconsistent, many of which are in multi-structured formats that further challenge data mining efforts. Building infrastructure and normalizing the data is a daunting task. Even seemingly simple tasks such as synchronizing timestamps can be troublesome dealing with multiple time zones and devices that have variances in time.

Lack of cybersecurity talent: gathering and analyzing this data requires increasingly scarce human resources.


Understanding how machines make sense of large volumes of data helps understand how AI and machine learning can benefit cybersecurity. Typically, this is done using knowledge representations like ontologies.

Ontologies are systems comprising distinct objects known as entities and their relationships. The following is a simple ontology of different types of coffee

In this case, in a simple Venn diagram, the individual ingredients are entities, but they form an ontology that includes a set of relationships. Using this, we can determine, for example, that a combination of milk, chocolate, coffee, and foam becomes a cappucino.

In cybersecurity, ontologies represent the real world inside a machine-learning environment.

In such an ontology, we can have malware sitting at the center, surrounded by various other entities related to that malware. For instance, the entity ‘MalwareCategory’ could be a banking worm or trojan. The ‘AttackVector’ entity might indicate spam, sequel injection, or a particular vulnerability the malware exploits.

A machine can use this ontology to understand the real world — in this case, the threats faced by a network.

Breaking the Language Barrier- Enter NLP

Most of the information security professionals require is buried in unstructured text spread across billions of websites on the dark web and the open internet.

Natural language processing enables machines to gather and understand data irrespective of language, format, and punctuation. Powerful NLP engines can even understand everyday slang and jargon across all languages, something a team of analysts could never aspire to.

Here’s how it works.

Text is extracted from a data source — a social media post or webpage. From there, a platform that employs NLP can classify the language used and any cyber topics discussed (e.g., exploits, leaks, etc.)

Text is further classified to remove punctuation or extraneous words, ensuring only valuable inputs will be processed.

Platforms use the classified text to determine its content, including any entities mentioned

Converting individual data points into actionable intelligence requires a much longer sequence of steps. Security platforms leverage a rule-based approach, NLP, and machine learning to make predictions, harvest data, calculate risk to the organization and identify timescales.


The cyber kill chain divides the steps of a targeted attack into its components to tailor preventative actions to the early stages of the process. The attacks prevented early are less expensive to recover from; therefore, early detection and prevention are key. The general steps in an attack are:

1) Weaponization: Preparing a backdoor and a penetration plan to deliver a successful attack.

2) Reconnaissance: Studying public information about the target, the target’s environment, software, and practices. Most of this information can be gathered from the internet.

3) Delivery: Injecting the backdoor and launching the attack.

4) Installation: Installing the backdoor as bootstrap and other remote access tools to retain a persistent connection to the target.

5) Exploitation: Triggering the backdoor, usually an OS or application vulnerability.

6) Actions on Objectives: Take action on the original objective, collecting and exfiltration of information or additional measures against the target.

7) Command and Control: Use the tools to establish remote access and expand capabilities.

Detecting cyber threats is like software testing since errors caught early are dramatically less expensive to correct than those discovered later. By focusing on the early stages, we discourage the attack or break the chain so that it doesn’t proceed further. But as in software development, while we focus on the early stages, we also have a plan for proper coverage through the entire cycle.


Risk management is a structured and systematic approach to managing the potential for loss related to a threat. To manage risk, organizations should understand the likelihood of an event and the subsequent impact. This drives the prioritization of security initiatives throughout the organization. In information security, risk denotes the probability that a threat agent will exploit a vulnerability.


The first stage in risk management is asset identification to determine the assets that need protection. Traditionally the organization’s computing and data assets are investigated regarding the implications and impact of an asset being unavailable. This inward-looking reflection ignores the value of assets to outsiders, both customers and attackers. Incorporating outside perspectives into our asset valuation is a must to protect our assets adequately.


Once the critical assets have been valued, we focus on potential threats. Predictive Analytics provides real value in its ability to identify potential threats and keep us informed about the ever-changing threat landscape. By analyzing threats from an external perspective, organizations can continually refine their view of organizational threats. Risk management becomes complex and subjective in this area, answering what threats are likely and which can be ignored. Data provided by Predictive Analytics will add perspective to the list of threats faced by the organization.


Three distinct components help identify a potential cyber threat:

Opportunity – the vulnerability the actor needs to attack the target

Intent – the actor’s desire to target the organization

Capability – the means to successfully execute the attack

These three components can be profiled utilizing cyber threat intelligence, cyber forensics, and advanced analytic techniques to determine attack surface selection processes, behavior patterns, and modus operandi of attacks.


The threat of cybersecurity attacks is rapidly growing, requiring increasing expertise in attack prevention, detection, and response. Most private and public sector enterprises use traditional perimeter defense strategies or ineffective reactionary procedures to protect their cyber systems and networks, limiting defensive capabilities. We must develop a comprehensive framework to establish and implement proactive cybersecurity procedures using the attributes of advanced analytics, data mining of big data, and cyber forensics.

Predictive Analytics is a new field full of implementation issues that need refinement and resolution. The problems existing are not insurmountable. As our adversaries gain skill and knowledge, we must also do the same to stay competitive.