Markov Chains, Classifiers, and Intrusion Detection
Автор: Пользователь скрыл имя, 03 Апреля 2013 в 15:58, статья
Описание работы
This paper presents a statistical anomaly detection al-
gorithm based on Markov chains. Our algorithm can be di-
rectly applied for intrusion detection by discovering anoma-
lous activities. Our framework for constructing anomaly
detectors is very general and can be used by other re-
searchers for constructing Markov-chain-based anomaly
detectors. We also present performance metrics for eval-
uating the effectiveness of anomaly detectors. Extensive ex-
perimental results clearly demonstrate the effectiveness of
our algorithm. We discuss several future directions for re-
search based on the framework presented in this paper.
Работа содержит 1 файл
Markov Chains, Classifiers, and
Intrusion Detection
S. Jha
∗
K. Tan
†
R.A. Maxion
†
Abstract
This paper presents a statistical anomaly detection al-
gorithm based on Markov chains. Our algorithm can be di-
rectly applied for intrusion detection by discovering anoma-
lous activities. Our framework for constructing anomaly
detectors is very general and can be used by other re-
searchers for constructing Markov-chain-based anomaly
detectors. We also present performance metrics for eval-
uating the effectiveness of anomaly detectors. Extensive ex-
perimental results clearly demonstrate the effectiveness of
our algorithm. We discuss several future directions for re-
search based on the framework presented in this paper.
1 Introduction
An intrusion detection system (IDS) is a system that
identifies intrusions, where intrusion is defined as misuse
or unauthorized use by authorized users or external adver-
saries [17, 19]. Surveys of intrusion detection systems can
be found in [1, 14, 16, 20]. A classification of intrusion
detection systems appears in [11, Section II]. In this paper,
we consider intrusion detection systems that are based on
anomaly detection. The objective of anomaly detection is
to establish profiles of “normal” system activity. Traces of
system activity that deviate from these profiles are consid-
ered anomalous and consequently an alarm is raised. There
are two classes of anomaly-detection-based IDS.
Signature or pattern based IDS have an internal table of
“anomalous patterns”. If an ongoing activity matches a pat-
tern in the table, an alarm is raised. The table of patterns
represent system traces corresponding to common attacks.
Examples of signature-based intrusion detection systems
are Snort [22] and Bro [21]. The advantages of signature-
based IDS are commonly known to be their potential for
low false alarm rates, and the information they often impart
∗
Computer Sciences Department, University of Wisconsin, Madison,
WI 53706.
†
School of Computer Science, Carnegie Mellon University, Pittsburgh,
PA 15213.
to a system security officer about a detected attack. Such
information is often encoded in the rules or patterns cen-
tral to the functionality of such systems. This information
is often invaluable when initiating preventive or corrective
actions. However, signature-based IDS have several disad-
vantages. Since the set of anomalous patterns are based on
known attacks, new attacks cannot be discovered by these
systems. Therefore, whenever a new attack is discovered,
patterns corresponding to the attack have to be manually
constructed. Moreover, signature-based IDS can be easily
fooled by a sophisticated attacker. For example, an attacker
can “mix” normal activity with a real attack so that the trace
does not match any of the pre-defined patterns. Statistical
anomaly-detection-basedIDS (henceforth referred to as sta-
tistical IDS), have been devised to address these shortcom-
ings of signature-based IDS. Denning and Nuemann pre-
sented a detailed discussion of a statistical anomaly detec-
tion algorithm [5]. IDES [15] is a prototypical example of a
statistical IDS. In a statistical IDS a model of the normal be-
havior of a user is constructed. The statistical model is then
used to construct a classifier that can discriminate between
normal and anomalous traces. The techniques presented
in this paper fall into the second category. However, we
also describe a procedure for generating anomalous patterns
from our statistical model. Therefore, techniques presented
in this paper can also be used to automatically generate pat-
terns for signature-based systems. An important question
in anomaly-detection based intrusion detection systems is:
how is the trace of the system activity represented? We use
the sequence of system calls corresponding to a process as
the trace of its activity. To the best of our knowledge, this
was first proposed in [9]. However, our approach is general
and can be used for other types of traces of system activity,
such as audit trails.
Any statistical approach to intrusion detection adheres
to the general strategy described below. First, using a set
of normal traces a statistical model is constructed. This sta-
tistical model is then used to construct a classifier that can
discriminate between normal and abnormal traces. The key
observation is that the statistical model is an accurate pre-
dictor of normal behavior, so if an on-going activity is not
accurately predicted by the model, it is likely to be anoma-
lous. This general strategy is depicted in Figure 1. Our
approach follows this general road-map. Using a set of nor-
mal traces, we construct a Markov chain. The Markov chain
is then used to construct a classifier. The main contributions
of this paper are:
• We provide a formal framework for constructing clas-
sifiers based on Markov chains. We also investigate
applications of these classifiers to intrusion detection.
• We provide several metrics for evaluating effective-
ness of classifiers in the context of intrusion detection.
These metrics can be also used by other researchers.
• Our framework is quite general and can be used to con-
struct other classifiers based on Markov chains.
The outline of the paper is as follows: Section 2 provides
a general outline of our algorithm. Detailed description of
our algorithm is given in Section 3. Section 4 describes an
algorithm for generating a set of anomalous patterns from
Markov chains. This algorithm is suitable for generating
anomalous patterns used by systems such as Snort [22] and
Bro [21]. Experimental results are described in Section 5.
Section 6 describes related work. Future work and conclud-
ing remarks are provided in Section 7 and 8 respectively.
2 Outline of the Methodology
In this section, we provide a step-wise description of our
methodology. Technical details are given in Section 3. As-
sume that we are given two suites of traces T and T
an
. Re-
call that in our case a trace is simply a sequence of system
calls generated by a process. The suite T consists of traces
of normal activity and T
an
consists of traces of anoma-
lous activity (presumably corresponding to some known at-
tacks).
• Step 1 (Construct the test suite)
In this step we split the suite T into two. The first suite
T
tr
is called the training suite and is used for construct-
ing classifiers. The second suite T
te
is called the test
suite and is used for testing classifiers and tuning var-
ious parameters. First, we decide a ratio γ which we
call the testing ratio. We use random sampling to con-
struct T
tr
and T
te
. For each trace σ in T, we generate a
random number u which is uniformly distributed over
the range [0, 1]. If u ≤ γ, the trace is added to T
te
,
otherwise it is added to T
tr
. Roughly speaking, γ de-
notes the fraction of the traces that are in the test suite
T
te
.
• Step 2 (Construct a classifier)
We use the training suite T
tr
to construct a Markov
chain, which is then turned into a classifier for traces.
Since the classifier is constructed from a suite of nor-
mal traces, it is able to discriminate between normal
and anomalous traces. Details of the construction can
be found in Section 3.
• Step 3 (Tuning Parameters)
There are various exogenous parameters used during
the construction of the classifier. First, we define vari-
ous performance metrics for a classifier. These metrics
are computed using suites T
te
and T
an
. Various ex-
ogenous parameters are tuned using these performance
metrics.
3 Detailed Description
Let Σ denote the set of alphabets or symbols. We will
use alphabets and symbols synonymously throughout the
paper. A trace over Σ is a finite sequence of alphabets. The
set of finite traces over Σ is denoted by Σ
⋆
, the empty trace
is denoted by ϵ, and the set of traces of length n is denoted
by Σ
n
. Given a trace σ ∈ Σ, |σ| denotes the length of the
trace. Given a trace σ and a positive integer i ≤ |σ|, α
i
and
and α[i] denote the prefix consisting of the first i alphabets
and the i-th symbol respectively. The concatenation of two
traces σ
1
and σ
2
is denoted by σ
1
· σ
2
. B = {0, 1} denotes
the binary alphabet set.
Definition 3.1 A classifier over the alphabet set Σ is a total
function f : Σ
⋆
→ B.
A suite over Σ is a subset of Σ
⋆
. In our experiments, we
will use three types of suites, the training suite T
tr
(used
for training), the test suite T
te
(used for testing), and the
anomalous suite T
an
(set of anomalous traces). The training
suite, which is a set of normal traces, is used to construct a
classifier. The test suite is also a set of normal traces and is
used to test and tune the classifier. The anomalous suite is a
set of anomalous or abnormal traces.
Note: The reader should interpret 1 as “bad” and 0 as
“good”. In the context of intrusion detection, if a classifier
outputs a 1 after reading a trace, it should be interpreted
as an alarm (something anomalous is happening). On the
other hand, 0 indicates normal behavior. In the general
classification problem, there are a finite number of classes
{1, · · ·, M}. Let U be the universe of objects to be classi-
fied. A classifier f is a function from U to {1, · · ·, M} [6].
In the context of intrusion detection we want to classify
traces as normal or anomalous, so M = 2.
Intrusion detection is an on-line activity where alarms
have to be raised in real-time. Therefore, for the purposes
of intrusion detection it is unacceptable to watch the entire
sequence of activities (or equivalently scan the entire trace)
and then classify the sequence. Next, we formalize what it
2
Sample of Normal Behavior
Statistical
Model
Classifier
trace of process
behavior
normal
anomalous
alarms
Figure 1. General Strategy for Intrusion Detection
means for a classifier to be on-line. Intuitively, an on-line
classifier can efficiently classify a trace σ of length n based
on a the history of the classifier on the prefix σ
n−1
and the
last two symbols σ[n − 1] and σ[n].
Definition 3.2 A classifier f : Σ
⋆
→ B is called on-line if
and only if there exists “efficiently computable”
1
. functions
H, T, and β : Σ
⋆
→
k
such that following equations hold:
β(σ) = H(β(σ
n−1
), σ[n − 1], σ[n])
f(σ) = T(β(σ))
In the equations given above, the length of the trace σ is
denoted by n. Types for functions H and T can be eas-
ily inferred from the equations. Notice that the value of
the function β on the trace σ only depends on the value of
the function for σ, and the last two symbols of the trace.
In some sense, β only depends on the “immediate history”
and can be efficiently computed (loosely speaking the func-
tion β is “Markov” because its value only depends on the
immediate history and the current state).
Example 3.1 Assume that we are given a finite set T of
traces. A classifier f outputs a 1 after scanning a trace σ if
there is a suffix of σ that is in the set T; otherwise, the clas-
sifier f outputs a 0. In other words, the classifier f outputs
an alarm as soon as it detects a pattern from the set T. For
example, if T = {aaa, baa}, then the classifier outputs a
1 after scanning acaaa and acbaaa. Hence classifiers can
model most signature-based intrusion detection systems. A
set of finite traces T can be compiled into a deterministic
1
The precise definition of “efficiently computable” depends on the sta-
tistical model being used. In our case, efficiently computable means poly-
nomial in the number of states in the Markov chain
finite state automata A
T
. Using this observation it can be
easily seen that a signature based IDS corresponds to an on-
line classifier.
Let T be a finite table of patterns. Consider the regular
language L
T
over the alphabet Σ such that a trace σ is in L
T
iff there is a suffix of the trace that is in T. Let A(L
T
) be
the deterministic finite state automata corresponding to L
T
.
Consider definition 3.2. Let δ be the next-state transition
function for the automata A
T
. The transition function δ can
be extended to traces in a standard manner. Let β(σ) be
identifier of the state δ(s
0
, σ), where s
0
is the initial state
of the automata A(D
T
). The function H simply “mimics”
the next-state transition function δ and uses the following
equality:
δ(s
0
, σ) = δ(δ(s
0
, σ
n−1
), σ[n])
We assume that each state of the automate has an associated
identifier that is a real number. The function T maps an
identifier of a final state of the automata to 1 and the rest of
the real numbers are mapped to 0. It is easy to verify the
β, T, and H have the required properties and are efficiently
computable.
3.1 Constructing Markov Chains
Assume that we are given a training suite T
tr
⊆ Σ
⋆
. We
will use the suite T
tr
to construct a Markov chain. In the
next subsection we demonstrate how to construct a classi-
fier from a Markov chain. Construction of the Markov chain
is parametrized by the window size w. First, we augment
the alphabet set Σ with a special null symbol φ. We will
not provide background about Markov chains in this paper.
Interested readers should consult a standard text on proba-
bility theory (such as [7]) for the required background.
3
Before we describe the algorithm to construct a Markov
chain from the suite T
tr
, we will define a few primitive op-
erations. A state in the Markov chain is associated with a
trace of length w over the alphabet Σ ∪ {φ}. A transition
is a pair of states. The pair (s, s ) denotes a transition from
s to s . Each state and transition is also associated with a
counter. We also maintain a hash table H of visited states.
The hash function used for the hash table is not crucial to
the description of the algorithm. The primitive operation
shift(σ, x) shifts the trace σ left and appends the alphabet x
at the end, e.g., shift(aba, c) is equal to bac.
The initial state of the Markov chain (denoted by initial-
state) is associated with trace of length w consisting of all
null symbols φ, e.g., if w = 3, the initial state is associated
with the trace [φ, φ, φ]. The operation next(σ) returns the
first symbol of the trace σ and left shifts σ by one position,
e.g. next(abc) returns a and updates the trace to bc. For
each trace σ ∈ T
tr
, we execute the following steps until all
the alphabets of σ are scanned.
1. Let c = next(σ).
2. Set next-state to shift(current-state, c).
3. Increment the counter for the state current-state and
transition (current-state, next-state).
4. Update current-state to be next-state.
After all the traces in the suite T
tr
have been processed,
each state and transition have a positive integer associated
with them. The probability of transition (s, s ) is
N(s,s )
N(s)
,
where N(s, s ) and N(s) are the counters associated with
transition (s, s ) and s respectively. In other words, prob-
ability of a transition is the ratio of the frequency of the
transition and the frequency of its source in the suite T
tr
.
Example 3.2 Assume that the training suite T
tr
is
{aabc, abcbc}
The structure constructed after scanning all the traces in
the training suite T
tr
is shown in Figure 2. The figure also
shows the counters associated with the states and the transi-
tions.
3.2 Markov Chains to Classifiers
Assume that a Markov chain corresponding to a train-
ing suite T
tr
with window size w has already been con-
structed. We will denote the Markov chain MC by a 3-tuple
(S, P, s
0
), where S is the set of states, P : (S × S) →
+
denotes the transition probabilities, and s
0
∈ S is the initial
state. The probability of a transition (s, s ) is denoted by
P(s, s ). In order for P to be a valid measure, the following
equality should hold for all states s:
∑
s ∈
succ
(s)
P(s, s ) = 1
where succ(s) denotes the set of successors of s. The trace
of length w associated with the state s is denotes by σ(s).
Recall that the trace associated with the initial state s
0
is the
trace consisting of all null alphabets φ.
Consider a trace α ∈ Σ
⋆
. Recall that α[i] denotes the i-
th alphabet of the trace α. Let the initial trace β
0
be σ(s
0
),
i.e., the trace associated with the initial state s
0
. The trace
after scanning the first symbol α[1] is β
1
= shift(β
0
, α[1]).
The trace β
k
obtained after scanning the k-th symbol is re-
cursively defined as
shift(β
k−1
, α[k]).
Hence, a trace α defines a sequence of traces β
0
, · · · , β
m
(where each trace β
i
is of length w and m = |α|). We define
a metric µ(α) corresponding to a trace α. This metric will
be based on the Markov chain MC = (S, P
,
s
0
) and will be
computed iteratively. Initially, let Y and X both be equal to
0.0, and i = 0. Until i is equal m we execute the following
steps:
1. There are two cases.
(Case A:) β
i
→ β
i+1
is a valid transition in MC
If there are two states s and s in the Markovchain MC
such that σ(s) = β
i
and σ(s ) = β
i+1
, then update Y
and X according to the following equations:
Y
= Y + F(s, (s, s ))
X = X + G(s, (s, s ))
(Case B:) β
i
or β
i+1
is not a state in MC
If β
i
→ β
i+1
is not a valid transition in MC, then we
update Y and X according to the following equations:
Y
= Y + Z
X = X + 1
2. Increment i to i + 1.
The metric µ(α) is defined as
Y
X
at the end of the pro-
cedure just described. The procedure we just outlined, de-
fines a function µ : Σ
⋆
→ . Intuitively, the metric µ(α)
measures how well does the Markov chain MC predicts the
trace α, e.g., a lower µ(α) indicates that the Markov chain
predicts the trace α well. Notice that µ is parametrized by
the functions F, G and the number Z. Different choices of
F and G will result in different classifiers.
4
φφφ
φφ a
φ ab
aab
abc
bcb
cbc
φ aa
2
1
1
1
1
2
1
2
1
1
1
1
1
1
1
Figure 2. The Markov Structure
Assume that we are given a threshold r ∈ . A classifier
f can be constructed from the metric µ in the following
manner:
f(α) = {
1 µ(α) ≥ r
0 otherwise
In other words, the trace is classified as “bad” if the metric
is above a certain threshold r.
Note: It can be easily seen that the classifier f is on-line
(see definition 3.2). Define a function β : Σ
⋆
→
2
as
β(σ) = (Y, X)
where Y and X are variables defined by the procedure for
computing µ. Function H in the definition 3.2 can be easily
derived from the description of the procedure to compute µ.
The function T is given by
T(Y, X) = (
Y
X
≥ r) .
Hence, the classifier f we construct is an on-line classifier.
The complexity of functions β, H, and T is linear in the
number of states of the Markov chain.
Discussion: For a trace α, the metric µ(α) depends on the
entire trace. Recall that the procedure maintains cumulative
values Y and X and the metric is defined as
Y
X
. Therefore,
we call the metric µ global. A classifier that is based on
a global metric is called global. Next, we describe a local
classifier. Recall that a trace α corresponds to a sequence
of traces β
0
, · · · , β
m
, where m = |α| and each trace β
i
is
of length w. Let z be a positive integer. After scanning
the i-th symbol α[i] of the trace α, the history of size z is
given by (β
k
, β
k+1
, · · · , β
i
), where k is the minimum of 0
or i − z + 1. The history of size m corresponds to two
vectors (y
k
, · · ·, y
i
) and (x
k
, · · · , x
i
), where y
j
and x
j
are
defined as:
• β
j−1
→ β
j
is a valid transition in MC.
Let s and s be two states in the Markov chain MC
such that σ(s) = β
i
and σ(s ) = β
i+1
. In this case
y
j
and x
j
are defined as F(s, (s, s ) and G(s, (s, s ))
respectively.
• β
i
or β
i+1
is not a state in MC
In this case y
j
and x
j
are defined as Z and 1 respec-
tively.
The local metric γ(α) corresponding to the trace α is de-
fined as
∑
k
i=m
y
i
∑
k
i=m
z
i
where k = min(0, m − z + 1). In other words, the lo-
cal metric is computed using the history of past z symbols.
A classifier that is based on a local metric is called a lo-
cal classifier. We have also implemented local classifiers
in our system. However, in our experiments, global clas-
sifiers consistently out-performed their local counterparts.
Therefore, we have not presented the results for the local
classifiers.
Next, we discuss some common functions F and G. De-
pending on the choice of these functions we obtain different
classifiers.
• The miss-probability metric
In this case the functions F and G are defined as fol-
lows:
F(s, (s, s )) =
∑
s
1
∈
succ
(s) ∧ s =s
1
P(s, s
1
)
5
G(s, (s, s )) =
∑
s
1
∈
succ
(s)
P(s, s
1
)
= 1
The function F(s, (s, s)) adds all the probabilities of
the transitions from the state s that are not equal to
(s, s ). In other words, if while scanning a trace a low
probability transition is taken, then F(s, (s, s )) has a
higher value.
• The miss-rate metric
P
max(s) =
max
s ∈
succ
(s)
P(s, s )
F(s, (s, s )) = (P(s, s ) = Pmax(s))
G(s, (s, s )) = 1
In this case every transition that is not equal to the
“maximal” transition (the transition with the maximum
probability) is penalized equally by 1.
• The local-entropy-reduction metric
Given a state s, the local entropy of the state (denoted
by LE(s)) is
∑
s ∈
succ
(s)
−P(s, s ) log(P(s, s ))
The entropy of a Markov chain MC = (S, P, s
0
) is
given by
∑
s∈S
π(s)LE(s)
where π(s) is the steady state probability of being in
state s. Descriptions of Markov chains and their en-
tropy can be found in [7] and [4] respectively. Func-
tions F and G in this case are defined as:
F(s, (s, s )) = LE(s) + P(s, s ) log(P(s, s ))
G(s, (s, s )) = LE(s)
For a transition (s, s ), the expression
LE(s) + P(s, s ) log(P(s, s ))
evaluates to
∑
s
1
∈
succ
(s)∧s
1
=s
−P(s, s
1
) log(P(s, s
1
))
In other words, LE(s) denotes the “residual” local en-
tropy of the state s after deleting the transition (s, s ),
or the local entropy reduction due to taking the transi-
tion (s, s ).
3.3 Performance metrics
In this section, we discuss various performance metrics
for evaluating effectiveness of a classifier f : Σ
⋆
→ B. As-
sume that we have constructed a classifier f using a training
suite T
tr
. Let T
te
and T
an
denote the test and anomalous
suite respectively. Given a trace σ, the number of alarms
that a classifier f generates (denoted by alarm(σ, f)) is
given by the following expression:
|σ|
∑
i=1
f(σ
i
)
In other words, the number of alarms is the number of 1’s
the classifier f generates while scanning the trace σ. The
test suite T
te
consists of normal traces. The percentage of
false alarms corresponding to the test suite T
te
and the clas-
sifier f is given by the expression
∑
σ∈T
te
alarm(σ, f)
∑
σ∈T
te
|σ|
The percentage of false alarms a classifier f generates on
a test suite T
te
is denoted by FA(T
te
, f). A high percent-
age of false alarms is undesirable because every alarm in an
IDS has to be attended to by a system administrator. In our
experience, when a system administrator encounters a high
percentage of false alarms, they turn off the IDS.
For anomalous traces, a good classifier f should gener-
ate an alarm quickly. Therefore, for anomalous traces we
define mean time to first alarm or MTFA. Given an anoma-
lous trace σ, classifier f will generate its first alarm on σ
after scanning ⌈MTFA ⋆ |σ|⌉ symbols. Given a classifier f
and a trace σ, let alarm
1
(σ, f) be defined as
min{i|f(σ
i
) = 1}.
In other words, while scanning the trace σ, the classifier
f generates its first alarm after reading the alarm
1
(σ, f)-th
symbol. MTFA corresponding to an anomalous suite T
an
and a classifier f (denoted by MTFA(Tan
, f)) is given by
the following expression:
∑
σ∈T
an
alarm
1
(σ,f)
|σ|
|Tan|
In the expression given above, |Tan| denotes the the number
of traces in T
an
or the size of the suite. Intuitively speaking,
the metric MTFA measures how fast a classifier f detects an
anomalous trace.
One can generalize the metric MTFA to mean time to k-th
alarm or MTkA, i.e., the time at which the classifier gener-
ates the k-th alarm. However, in this paper we will only
consider MTFA.
6
3.4 Tuning Parameters
Recall that in our method, there are five exogenous pa-
rameters in constructing a classifier from a training suite
T
tr
. These parameters are:
• the window size w,
• functions F(s, (s, s )) and G(s, (s, s ),
• a real number Z,
• and a threshold r.
Assume that the functions F and G and the real number Z
have already been determined. We discuss how to decide
the values of the parameters w and r.
As the window size w increases, the Markov chain con-
structed is a better model of the training suite T
tr
. How-
ever, for a very large w the Markov chain MC models the
training suite “too well”. This is the classical over-fitting
problem in statistics. Let µ(σ) be the metric defined earlier.
Intuitively, µ(σ) is a measure of the discrepancy between
the Markov chain MC and the trace σ, i.e., lower µ(σ) de-
notes a better fit. The discrepancy D(T
tt
) over the entire
test suite T
tt
is
∑
σ∈T
tt
µ(σ)
∑
σ∈T
tt
|σ|
Discrepancy D(T
an
) corresponding to the anomalous suite
T
an
is defined in a similar manner. We keep increasing
the window size w until the separation D(T
an
) − D(T
tt
)
between the anomalous and test suites is above a certain
threshold. This means that the classifier is able to “discrim-
inate” between the anomalous and normal traces. We will
provide a detailed account of this in the experimental sec-
tion.
Assume that the window size w has been determined.
Next we describe how to set the threshold r. If the thresh-
old r is low, then on an anomalous trace the classifier f will
generate an alarm very quickly. Therefore, a low r means
a lower MTFA. However, a lower threshold also generates
more false alarms on normal traces, which is inconvenient
for the system administrator. Therefore, there is a tradeoff
between setting the threshold r too low. We set the thresh-
old r so that the percentage of false alarms corresponding to
the test suite and the MTFA of the anomalous suite are both
below an acceptable level. Details on tuning parameters w
and r can be found in the experimental section.
4 Generating Anomalous Patterns from
Classifiers
Most intrusion detection systems that are commer-
cially available are signature-based systems. Recall that a
signature-based IDS relies on a table T of anomalous pat-
terns. Whenever a suffix of a trace of an activity matches
a pattern in the table T, an alarm is raised. Patterns in
the table T correspond to common attacks or simply pat-
terns that do not represent “normal” behavior of the system.
Classifiers that are constructed from Markov chains (such
as the one discussed in Section 3) are much more general
than simply a table of patterns. However, they do not fit the
current architecture for systems that are currently available.
In this section we address this mis-match. We describe an
algorithm that constructs a table of anomalous patterns from
a Markov chain constructed from a training suite T
tr
.
Recall that a Markov chain MC is a 3-tuple (S, P, s
0
),
where S is the set of states, P denotes the transition proba-
bility, and s
0
is the initial state. The reader should refer to
subsection 3.1 for an explanation of these terms.
The metric µ : Σ
⋆
→
defines a total order on the
traces, i.e., σ
1
≥ σ
2
if and only if µ(σ
1
) ≥ µ(σ
2
). A
procedure for computing this metric was given in subsec-
tion 3.2. In the procedure described in subsection 3.2, we
started from the initial state s
0
. If the procedure is started
from a state s of the Markov chain, it defines a new metric
µ
s
: Σ
⋆
→ . Intuitively, µ
s
defines a metric if the initial
state of the Markov chain is s.
Recall that each state s is associated with a unique trace
σ(s) of length w. Therefore, state and trace will be used
synonymously throughout this section. Moreover, let Σ =
{x
1
, · · · , x
m
} be the alphabet set. In our context, Σ are the
system calls that appear in the various suites. Our algorithm
is based on the function pattern(s, L, k). This function re-
turns a set of trace W ⊆ Σ
⋆
that satisfy the following con-
ditions:
• each trace in W is of length less than or equal to L,
• the number of traces in W is less than or equal to k,
• and for all traces σ ∈ Σ
⋆
\ W, if |σ| ≤ L, then there
exists a trace σ ∈ W, such that µ
s
(σ) ≤ µ
s
(σ ).
In other words, W represents a set of k traces of length
less than or equal to L that have the “highest” µ
s
-value,
or they represent k “worse traces” of length less than or
equal to L. We give a recursive definition for the function
pattern(·, ·, ·). The recursion is on the second parameter L.
(Base case): L = 0
In this case,
pattern(s, 0, k) = {ϵ}
In other words, for L = 0 the set of traces/patterns only
contains the empty trace ϵ.
(Recursion): L > 0
Let s
i
be the state such that the trace σ(s
i
) corresponding to
7
it is shift(σ(s), x
i
). In other words, the trace corresponding
to s
i
is constructed by shifting the trace associated with s
and appending the i-th alphabet x
i
. Assume that the set of
traces pattern(s
i
, L − 1, k) have been computed for all the
states. We give a definition of the set pattern(s, L, k) in
terms of the sets
pattern(s
1
, L − 1, k), · · ·, pattern(s
k
, L − 1, k) .
For the sake of brevity we will denote the set pattern(s
i
, L−
1, k) by W
i
. Suppose the transition s → s
i
corresponds
to the alphabet x
i
∈ Σ, i.e., the trace associated with s
i
is shift(σ(s), x
i
), where σ(s) is the trace associated with
the state s. We construct the set x
i
· W
i
which is formally
defined as
{x
i
· σ|σ ∈ W
i
}
In other words, we add x
i
in the front of every trace in x
i
·
W
i
. After that, we compute the metric µ
s
for every trace in
the sets x
i
· W
i
(1 ≤ i ≤ k). Next we construct a set of
traces R given by the following expression:
pattern(s, L − 1, k) ∪ (
m
⋃
i=1
x
i
· W
i
)
Traces in R are sorted using their values according to the
function µ
s
. Intuitively, we sort the traces by the value that
denotes how well does the Markov chain MC model the
trace, where the traces with “bad fits” come first. The set
pattern(s, L, k) is defined as the first k traces in the sorted
order.
Correctness: Given a trace σ such that |σ| ≤ L that is not
in pattern(s, L, k), we prove that there exists a trace σ ∈
pattern(s, L, k) such that µ
s
(σ) ≤ µ
s
(σ ). We prove the
result by induction on L. The result for L = 0 is obvious.
After all, there is only trace of length 0, i.e., the empty trace
ϵ. Assume that the result is true for L − 1, where L ≥
2. If |σ| ≤ L − 1, then by the induction hypothesis there
exists σ ∈ pattern(s, L − 1, k) such that µ
s
(σ) ≤ µ
s
(σ ).
Notice that we construct the set pattern(s, L, k) by sorting
the following set according to the µ
s
metric:
pattern(s, L − 1, k) ∪ (
m
⋃
i=1
x
i
· W
i
)
Therefore, there exists a trace σ ∈ pattern(s, L, k) such
that µ
s
(σ ) ≤ µ
s
(σ ), and hence the result is proved. Sup-
pose |σ| = L. Let the first alphabet of σ be x
i
and s
i
be the
state corresponding to shift(σ(s), x
i
). As before, let W
i
be
the set pattern(s
i
, L−1, k). Let α be the suffix of σ starting
at the second symbol. Since |α| = L − 1, by the induction
hypothesis we have the following two cases:
Case: α ∈ pattern(s
i
, L − 1, k)
Trace α is in the following set:
pattern(s, L − 1, k) ∪ (
m
⋃
i=1
x
i
· W
i
)
Since this set is sorted using the metric µ
s
, the result fol-
lows.
Case: α ∈ pattern(s
i
, L − 1, k)
In this case there exists an α ∈ pattern(s
i
, L − 1, k) such
that µ
s
i
(α) ≤ µ
s
i
(α ). From the nature of the metric µ
s
the
following inequality easily follows:
µ
s
(x
i
· α) ≤ µ
s
(x
i
· α )
Notice that x
i
·α is in the set x
i
·W
i
, and the result follows.
Improving efficiency: First, notice that µ
s
(x · σ) can be
efficiently computed (for the metrics given in this paper, this
can be done in O(1) time) from the transition probabilities
and µ
s
(σ), where the transition (s, s ) corresponds to the
alphabet x. Recall the procedure for computing the metric
µ. Suppose we are given Y and X corresponding to the
trace σ if the initial state is s . Using this information, Y
and X for x · σ can be computed in O(1)-time. Let Y and
X values for the trace σ corresponding to the initial state s
be Y
1
and X
1
. If s → s is a valid transition in the Markov
chain MC, then the Y and X values for the trace x · σ are
Y
= Y
1
+ F(s, (s, s ))
X = X
1
+ G(s, (s, s )).
Otherwise, Y and X are given by
Y
= Y
1
+ Z
X = X
1
+ 1
The value of the metric µ
s
on the trace x · σ is simply
Y
X
. Therefore, if for each trace σ in the set pattern(s, L, k)
we keep the Y and X values of the traces, the recursive
step becomes very efficient. Moreover, once we compute
pattern(s, L, k), we can store that value and re-use (this is
called memeoazation [3]). The number of traces (denoted
by N) of length ≤ k over the alphabet Σ is
|Σ|
k+1
−1
|Σ|−1
. It
can be easily checked that the algorithm with these modi-
fications runs in O(NLk) steps, where each step can take
O(|Σ|k log(|Σ|k)) time. Notice that the worst case com-
plexity of the algorithm is exponential in the size of the al-
phabet. We suspect that in practice the algorithm will run
much faster. The computational complexity of the problem
and heuristics for improving efficiency are left for future
work.
The algorithm given above generates a set of anomalous
traces from a Markov chain trained on a suite of normal
traces. Suppose we train a Markov chain MC
an
using a
suite of anomalous traces. In this case we want to gener-
ate a set of patterns or traces that “fit” the Markov model
MC
an
well. This can be easily done using a variant of the
algorithm given above. Instead of sorting in the descending
order (according the value of the metric µ
s
), we sort in the
ascending order. Therefore, we obtain traces that have a low
µ-value or fit the Markov model MC
an
the best.
8
5 Experimental Results
The suites in this study were obtained from a large body
of work performed by the researchers in the Computer Sci-
ence Department, University of New Mexico, towards eval-
uating anomaly-detection algorithms. The original experi-
ments that employed these suites and their results were pub-
lished in [26]. The data consists of traces of eight privi-
leged UNIX programs and a trace of an operating anomaly-
detector called STIDE. A trace is a sequence of system calls
issued by a single process during its execution. Privileged
programs were targeted for monitoring since their misuse
can cause greatest harm to the host. It should be noted that
only the system calls were recorded in these traces, param-
eters passed to the system calls were ignored. Nevertheless
the datasets comprises of data obtained from a wide variety
of circumstances, e.g., data obtained from monitoring the
normal usage of a program in the field, data obtained from
monitoring programs that run as daemons and those that do
not, programs that vary widely in their size and complexity,
and different kinds of intrusions. Importantly, to provide
some insight as to how an algorithm would perform in the
field, the body of data consists of a combination of live and
synthetic traces, where
• live is defined to be traces of programs collected dur-
ing normal usage of a production computer system,
and
• synthetic is defined to be traces collected in produc-
tion environments by running a prepared script; the
program options are chosen solely for the purpose of
exercising the program, and not to meet any real user’s
requests.
Anomalous data was obtained by recording the system
calls used by the monitored programs while an intrusion
was present. The intrusions were obtained from public ad-
visories posted on the Internet. More detailed information
about the kinds of intrusions used or the dataset can be
found in [26]. The table shown in Figure 3 lists the pro-
grams monitored for normal behavior and the names of the
corresponding intrusions deployed to obtain intrusive data.
Figures 4, 5, and 6 show the difference between the dis-
crepancy of the anomalous and test suite corresponding to
the three classifiers. The first column of the table corre-
sponds to the number of the suite. Subsequent columns
show the results for various window sizes used to construct
the Markov chain. For example, the second column shows
the results when window size of 10 was used to construct
the Markov chain. The first component in each entry cor-
responds to the test suite and the second component shows
the results for the anomalous suite. For example, for set 2,
window size 10, and the first classifier the discrepancy for
the test and anomalous suites are 0.0448 and 3.5358 respec-
tively. For suites where there was only one normal trace, we
could not leave out traces to construct a test suite. There-
fore, in these cases we mark the component corresponding
to the test suite by a “*”. Let D
an
and D
tt
be the discrep-
ancy corresponding to the test and anomalous suite respec-
tively. Ideally, we want D
an
− D
tt
to be large. This means
that the classifier does well on the test suite and worse on
the anomalous suite. Observing the data, following points
can be made:
• In all the suites except 5 and 8 we start observing the
difference in the discrepancies right away.
• For suite 5 the difference in the discrepancies become
pronounced after the window size of 40.
• For suite 8 the difference becomes significant after the
window size of 45.
• Classifiers based on the miss-probability and miss-rate
metric outperform the one based on the local-entropy
metric. The results for the classifiers based on miss-
probability and miss-rate are comparable.
For each suite we set the window size such that differ-
ence between the discrepancies D
an
and D
tt
is above an
acceptable level. Therefore, we set the window size for suite
8 to 45. For all other suites, the window size is set to 40.
Initially the threshold r
0
was set to
D
an
+D
tt
2
, or the av-
erage of the discrepancies for the anomalous and test suites
respectively. Suites 1 and 6 only have one normal trace,
so in this case D
tt
is not available. For these cases we
set the initial threshold r
0
to 0.8D
an
. After that, we con-
ducted several experiments with thresholds around the ini-
tial threshold. Lower threshold means that MTFA for the
anomalous suite will be low, i.e., we will generate an alarm
earlier on the anomalous trace. However, a lower threshold
also means that the percentage of false alarms for the test
suite will be large. Therefore, there is a tradeoff in setting
the value of the threshold r. We conducted extensive ex-
periments with several thresholds. However, for the sake of
brevity we only report results for three threshold values and
the classifier based on the miss-probability metric. Figure 7
shows experimental results for 0.8r
0
, r
0
, and 1.2r
0
, where
r
0
is the initial threshold. Following observations can be
made from the data:
• The percentage of false alarms decreases as the thresh-
old increases.
• The mean time to first alarm (or MTFA) increases as
the threshold increases.
• There are few suites that do not show the above men-
tioned behavior (see results for suite 8).
9
dataset
System
Normal
Anomalous
Intrusion
1: Synthetic xlock
Linux
71 traces
2 traces
buffer overflow
339,177 calls
949 calls
2: Live xlock
Linux
1 trace
2 traces
buffer overflow
16,598,639 calls
949 calls
3: Live ”named”
Linux
27 traces
2 traces
buffer overflow
9,230,572 calls
1800 calls
4: Live login
Linux
12 traces
9 traces
trojanized login
8,894 calls
4,853 calls
program
5: Live ps
Linux
12 traces
26 traces
trojanized ps
6,144 calls
6,968 calls
program
6: Live inetd
Linux
3 traces
31 traces
denial of
541 calls
8,371 calls
service
7: Live lpr (MIT)
SunOS 4.4.1
2,703 traces
1002 traces
symbolic link
2,926,304 calls
169,252 calls
attack
8: Live lpr (UNM) SunOS 4.4.1
4,298 traces
1001 traces
symbolic
2,027,468 calls
168,236 calls
attack
9: Live STIDE
Linux
71,760 traces
105 traces
denial of
44,500,219 calls
205,935 calls
service
Figure 3. explanation of data sets
10
• False alarm rates for all suites except 5 and 8 are very
low. Recall that for suite 5 and 8 the difference be-
tween the discrepancies was not very large.
Remark: If a classifier generates an alarm while scanning
an anomalous trace, then that trace is classified as anoma-
lous. So the hit rate for a classifier can be defined as the
percentageof traces in the anomalous suite T
an
that are clas-
sified as anomalous. In our experiments, we discovered that
the hit rate was always close to 100%. Therefore, we do not
consider hit rate as a good metric for the classifiers consid-
ered in this paper. Percentage of false alarms and mean time
to first alarm are better measures of effectiveness. However,
we believe that devising other metrics for quantifying clas-
sifiers has received scarce attention in the intrusion detec-
tion literature.
6 Related Work
We only compare our work to intrusion detection
schemes that are based on statistical anomaly detection. As
already pointed out the first example of such a intrusion de-
tection system was IDES [15]. IDES was a large scale sys-
tem and considered several classes of events. We only con-
sidered traces of system calls. Therefore, a direct compari-
son of IDES and our technique is not possible. We compare
our technique with some recent statistical anomaly detec-
tion schemes.
Nassehi [18] describes an anomaly detection scheme
based on Markov chains. However, his anomaly detection
algorithm is different from the one presented in this paper.
Nassehi constructs a Markov chain by using a window of
size one. Let n
S
the vector of normalized frequencies of
transitions in the Markov chain based on the sample used for
training. An ongoing activity is traced through the Markov
chain by maintaining a history of a certain window. Let n
h
be the vector of normalized frequencies of transitions for
the history. An alarm is raised if the following condition
holds:
(n
h
− n
S
) D (n
h
− n
S
)
T
> r
The matrix D and threshold r are described in detail in [18].
If R and S are the number of transitions and states in the
Markov chain, then each time a symbol is scanned the time
required is 5(R − S) + 1. This seems prohibitively ex-
pensive for large intrusion detection systems. In contrast,
our anomaly detection scheme requires constant time for
each symbol that is scanned. Warrender et al. [26] used
Hidden Markov Models [8] (or HMMs) as the underlying
model. The results reported in [26] seem comparable to
the one presented in this paper. However, the training al-
gorithm for HMMs is very expensive, i.e., it runs in time
O(nm
2
), where n is the number of states in the HMM and
m is the size of the trace. In contrast, the training time for
Markov chains is O(m). Moreover, for each symbol (or in
this case system call) their method takes O(n
2
) time. Our
anomaly detector takes constant time to process each sym-
bol, and therefore can generate alarms faster. We have also
undertaken a rigorous study of HMMs for anomaly detec-
tion. This work is ongoing and we will report on the results
at a future date. An instance learning based algorithm for
anomaly detection is described in [12]. Their algorithm is
used for masquerade detection, i.e., a user acting as another
user. We have started investigating techniques for extending
our algorithm for masquerade detection. Applications of
anomaly detection for uncovering stealthy portscans is de-
scribed in [23]. We are currently devising an algorithm sim-
ilar to the one presented in this paper for detecting stealthy
port scans.
Section 4 describes an algorithm for generating anoma-
lous patterns from Markov models. We have not imple-
mented this algorithm. Therefore, we cannot provide a
detailed comparison of our algorithm with other such al-
gorithms. A data mining based technique for generating
anomalous patterns or rules is described in [13]. An intru-
sion detection system based on information retrieval tech-
niques is described in [2]. An approach for discovering
anomalous patterns or rules based on inductive learning is
presented in [24]. Once the implementation of our algo-
rithm is finished, we will perform a detailed comparison of
our approach with other techniques.
There are obvious connections to hypothesis testing [10]
and intrusion detection. Let M be the Markov model con-
structed from the test suite. Given a trace σ, we are testing
the hypothesis that σ is generated from the distribution im-
plied by the Markov model M. If we reject the hypothesis,
then the trace σ is an anomalous trace. An algorithm for
hypothesis testing that does not have apriori bound on the
number of trials is the sequential probability ratio test de-
veloped by Abraham Wald [25]. We plan to investigate the
literature on hypothesis testing in more detail in order to
find connections with our algorithm.
7 Future work
There are several directions for future work. We want
to improve the space and time efficiency of the algorithms
used in our methodology. We want to investigate the com-
putation complexity of the algorithm for generating anoma-
lous patterns described in Section 4. Moreover, we want to
devise heuristics for improving the efficiency of this algo-
rithm. Currently, we only generate models from traces of
system calls. We want to investigate other events related to
system activity, such as audit trail data and network traf-
fic. The Markov-chain based classifier described in this
paper has been implemented as a part of a library being
developed at the University of Wisconsin. We want to in-
11
Number
w=10
w=20
w=30
w=40
w=45
1
(*,3.6392)
(*,3.9473)
(*,4.0402)
(*,4.0402)
(*,4.0402)
2
(0.0448,3.5358) (0.0842,3.7385) (0.1857,3.7575) (0.3128,3.7575) (0.38139,3.7575)
3
(0.0039,1.8943) (0.0039,2.3181) (0.0038,2.5379) (0.0038,2.6868)
(0.0038,2.7194)
4
(0.2637,0.7097) (0.2543,0.9034) (0.2522,0.9774) (0.2522,1.0361)
(0.2522,1.0643)
5
(0.8537,0.5911) (1.1473,1.0536) (1.3531,1.4662) (1.3504,1.8536)
(1.3506,2.0343)
6
(*,0.1100)
(*,0.1212)
(*,0.1283)
(*,0.1346)
(*,0.1377)
7
(0.0408,0.1984) (0.0338,0.7268) (0.0334,0.8092) (0.0375,0.8132)
(0.0427,0.8205)
8
(0.0301,0.0532) (0.0297,0.0467) (0.0293,0.0380) (0.0335,0.0452)
(0.0352,0.1045)
9
(0.0439,0.2501) (0.0191,0.5673) (0.0176,0.7684) (0.0168,0.9030)
(0.0171,0.9524)
Figure 4. Results for miss-probability metric
Number
w=10
w=20
w=30
w=40
w=45
1
(*,3.6392)
(*,3.9473)
(*,4.0402)
(*,4.0402)
(*,4.0402)
2
(0.0310,3.5318) (0.0771,3.7364) (0.1808,3.7575) (0.3087,3.7575) (0.3766,3.7575)
3
(0.0016,1.8978) (0.0016,2.3148) (0.0016,2.5364) (0.0016,2.6852) (0.0016,2.7178)
4
(0.0073,0.5510) (0.0006,0.7468) (0.0006,0.8218) (0.0006,0.8806) (0.0006,0.9087)
5
(0.8475,0.5765) (1.1411,1.0412) (1.3457,1.4553)
1.3439,1.8453)
(1.3439,2.0260)
6
(*,0.1034)
(*,0.1146)
(*,0.1218)
(*,0.1281)
(*,0.1312)
7
(0.0273,0.1600) (0.0257,0.7102) (0.0269,0.8017) (0.0312,0.8067)
(0.036,0.8076)
8
(0.0215,0.0485) (0.0212,0.0432) (0.0232,0.0333) (0.0270,0.0381) (0.0293,0.0971)
9
(0.0248,0.2434) (0.0120,0.5637) (0.0112,0.7657) (0.0110,0.9004) (0.0114,0.9499)
Figure 5. Results for miss-rate metric
Number
w=10
w=20
w=30
w=40
w=45
1
(*,3.6392)
(*,3.9473)
(*,4.0402)
(*,4.0402)
(*,4.0402)
2
(0.0572,3.5311) (0.0907,3.7360) (0.1920,3.7554) (0.3191,3.7554) (0.3872,3.7554)
3
(0.0127,1.8978) (0.0127,2.3145) (0.0125,2.5386) (0.0125,2.6881) (0.0122,2.7209)
4
(0.1891,0.6668) (0.1796,0.8605) (0.1780,0.9348) (0.1779,0.9935) (0.1779,1.0216)
5
(0.8524,0.6036) (1.1442,1.0627) (1.3470,1.4730) (1.3436,1.8582) (1.3439,2.0385)
6
(*,0.1226)
(*,0.1336)
(*,0.1406)
(*,0.1466)
(*,0.1496)
7
(0.0384,0.1811) (0.0332,0.7177) (0.0324,0.8012)
(0.038,0.8041)
(0.0435,0.8124)
8
(0.0317,0.0351) (0.0298,0.0287) (0.0292.0.0252) (0.0339,0.0360) (0.0362,0.0988)
9
(0.0646,0.2565) (0.0323,0.5747) (0.0274,0.7726) (0.0264,0.9069)
(0.02630.9561)
Figure 6. Results for local-entropy-reduction metric
12
Number
0.8r
0
r
0
1.2r
0
FA%
MTFA%
FA%
MTFA%
FA%
MTFA%
1
*
39.46
*
53.38
*
82.92
2
0.0
39.35
0.0
42.09
0.0
48.84
3
0.0
6.68
0.0
7.21
0.0
7.75
4
0.0
15.77
0.0
8.40
0.0
8.54
5
10.67
16.53
9.11
18.06
7.03
19.81
6
*
0.45
*
0.45
*
0.45
7
0.09
90.38
0.08
91.56
0.07
93.33
8
3.87
50.04
10.59
84.63
6.13
84.63
9
0.07
14.29
0.04
14.49
0.025
14.87
Figure 7. Results for various thresholds
corporate other statistical models, such as Hidden Markov
Models [8], into this library. Our goal is to incorporate a
wide variety of classification techniques and test their ef-
fectiveness in the context of intrusion detection. We also
plan to implement our algorithms in existing intrusion de-
tection systems, such as Snort [22] and Bro [21]. We are
also interested in investigating specialized applications of
our statistical anomaly detection algorithm, such as detect-
ing stealthy port scans [23].
8 Conclusion
In this paper, we presented an anomaly detection algo-
rithm based on Markov chains. We presented a general
framework for constructing classifiers from Markov chains
and presented three specific classifiers based on this frame-
work. Performance metrics to test these classifiers were also
defined. Experimental results clearly demonstrated the ef-
fectiveness of our approach. Moreover, these classifiers can
be easily incorporated into an intrusion detection system.
This paper creates several avenues for future work as de-
scribed in the previous section. We are currently pursuing
these directions.
Acknowledgement
We thank the referees of CSFW 14 for helpful comments.
We are particularly thankful for referee three who pointed
out the connection to hypothesis testing.
References
[1] J. Allen, A. Christie, W. Fithen, J. McHugh, J. Pickel, and
E. Stoner. State of the practice of intrusion detection tech-
nologies. Technical Report CMU/SEI-99-TR-028, Software
Engineering Institute, Carnegie Mellon, January 2000.
[2] R. Anderson and A. Khattak. The use of information re-
trieval techniques for intrusion detection. In Proceedings
of First International Workshop on the Recent Advances in
Intrusion Detection (RAID), September 1998.
[3] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Al-
gorithms. The MIT Press, 1991.
[4] T. Cover and J. Thomas. Elements of Information Theory.
John Wiley and Sons, 1991.
[5] D. Denning and P. Nuemann. Requirements and model for
IDES - a real-time intrusion detection expert system. Tech-
nical Report Technical Report, CSL, SRI International, Au-
gust 1985.
[6] L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic The-
ory of Pattern Recognition. Springer Verlag, 1996.
[7] R. Durrett. Probability: Theory and Examples. Duxbury
Press, 2nd edition, 1995.
[8] R. Elliott, L. Aggoun, and J. Moore. Hidden Markov Mod-
els: Estimation and Control. Springer Verlag, 1995.
[9] S. Forrest, S. Hofmeyr, S. Somayaji, and T. Longstaff. A
sense of self for UNIX processes. In IEEE Symposium on
Security and Privacy, pages 120–128, 1996.
[10] R. Hogg and E. Tanis. Probability and Statistical Inference.
Prentice Hall, 2001.
[11] K. Ilgun, R. Kemmerer, and P. Porras. State transition analy-
sis: A rule-based intrusion detection approach. IEEE Trans-
actions on Software Engineering, 21(3):181–199, March
1995.
[12] T. Lane and C. Brodley. Temporal sequence learning and
data reduction for anomaly detection. ACM Transactions
on Information and System Security, 2(3):295–331, August
1999.
[13] W. Lee, S. Stolfo, and K. Mok. A data mining framework
for building intrusion detection models. In IEEE Symposium
on Security and Privacy, 1999.
[14] T. Lunt. Automated audit trail analysis and intrusion de-
tection: A survey. In Proceedings of the 11-th National
Computer Security Conference, Baltimore, MD, pages 65–
73, October 1988.
[15] T. Lunt, A. Tamaru, F. Gilham, R. Jagannathan, P. Neumann,
H. Javitz, A. Valdes, and T. garvey. A real-time intrusion de-
tection expert system (IDES)-final technical report. Techni-
cal Report Technical report, Computer Science Laboratory,
SRI international, Menlo Park, California, February 1992.
13
[16] N. McAuliffe, L. Schaefer, D. Wolcott, T. Haley, N. Kalem,
and B. Hubbard. Is your computer being misused? In Pro-
ceedings of the Sixth Computer Security Applications Con-
ference, pages 260–272, December 1990.
[17] B. Mukherjee, L. T. Heberlein, and K. Levitt. Network in-
trusion detection. IEEE Network, May/June 1994.
[18] M. Nassehi. Anomaly detection for Markov models. Tech-
nical Report Tech report RZ 3011 (#93057), IBM Research
Division, Zurich Research Laboratory, March 1998.
[19] S. Northcutt. Network Intrusion Detection: An Analyst’s
Handbook. New Riders, 1999.
[20] P. Nuemann. A compartive anatomy of computer sys-
tem/network anomaly detection. Technical report, CSL, SRI
BN-168, Menlo Park, CA, May 1990.
[21] V. Paxon. Bro: A system for detecting network intruders
in real-time. In Proceedings of the 7-th USENIX Security
Symposium, San Antonio, Texas, 1998.
[22] M. Roesch. Snort- lightweight intrusion detection for net-
works. In Proceedings of the 1999 USENIX LISA confer-
ence, November 1999.
[23] S. Staniford, J. Hoagland, and J. McAlerney. Practical auto-
mated detection of stealthy portscans. In Proceedings of the
ACM CCS IDS Workshop, November 2000.
[24] H. Teng, K. Chen, and S. C.-Y. Lu. Adaptive real-time
anomaly detection using inductively generated sequential
patterns. In IEEE Symposium on Security and Privacy,
pages 278–284, 1999.
[25] A. Wald. Sequential Analysis. John Wiley and Sons, 1947.
[26] C. Warrender, S. Forrest, and B. Pearlmutter. Detecting in-
trusions using system calls: Alternative data models. In
IEEE Symposium on Security and Privacy, pages 133–145,
1999.
14
Информация о работе Markov Chains, Classifiers, and Intrusion Detection