information security
Malware datascience
mainly for education purpose
1. introduction
If you’re working in security, chances are
you’re using data science more than ever
before, even if you may not realize it. For
example, your antivirus product uses data
science algorithms to detect malware. Your firewall
vendor may have data science algorithms detecting
suspicious network activity. Your security information
and event management (SIEM) software probably uses data science to identify suspicious trends in your data. Whether conspicuously or not, the entire
security industry is moving toward incorporating more data science into security products.
Advanced IT security professionals are incorporating their own custom
machine learning algorithms into their workflows. For example, in recent
conference presentations and news articles, security analysts at Target,
Mastercard, and Wells Fargo all described developing custom data science
xxii Introduction
technologies that they use as part of their security workflows.1
If you’re not
already on the data science bandwagon, there’s no better time to upgrade
your skills to include data science into your security practice.
What Is Data Science?
Data science is a growing set of algorithmic tools that allow us to understand
and make predictions about data using statistics, mathematics, and artful statistical data visualizations. More specific definitions exist, but generally, data
science has three subcomponents: machine learning, data mining, and data
visualization.
In the security context, machine learning algorithms learn from training data to detect new threats. These methods have been proven to detect
malware that flies under the radar of traditional detection techniques like
signatures. Data mining algorithms search security data for interesting
patterns (such as relationships between threat actors) that might help us
discern attack campaigns targeting our organizations. Finally, data visualization renders sterile, tabular data into graphical format to make it easier
for people to spot interesting and suspicious trends. I cover all three areas
in depth in this book and show you how to apply them.
Why Data Science Matters for Security
Data science is critically important for the future of cybersecurity for three
reasons: first, security is all about data. When we seek to detect cyber threats,
we’re analyzing data in the form of files, logs, network packets, and other
artifacts. Traditionally, security professionals didn’t use data science techniques to make detections based on these data sources. Instead, they used
file hashes, custom-written rules like signatures, and manually defined heuristics. Although these techniques have their merits, they required handcrafted techniques for each type of attack, necessitating too much manual
work to keep up with the changing cyber threat landscape. In recent years,
data science techniques have become crucial in bolstering our ability to
detect threats.
Second, data science is important to cybersecurity because the number
of cyberattacks on the internet has grown dramatically. Take the growth of
the malware underworld as an example. In 2008, there were about 1 million unique malware executables known to the security community. By
2012, there were 100 million. As this book goes to press in 2018, there are
more than 700 million malicious executables known to the security community (https://www.av-test.org/en/statistics/malware/), and this number is likely
to grow.
1. Target (https://www.rsaconference.com/events/us17/agenda/sessions/6662-applied-machinelearning-defeating-modern-malicious), Mastercard (https://blogs.wsj.com/cio/2017/11/15/artificialintelligence-transforms-hacker-arsenal/), and Wells Fargo (https://blogs.wsj.com/cio/2017/11/16/
the-morning-download-first-ai-powered-cyberattacks-are-detected/).
Introduction xxiii
Due to the sheer volume of malware, manual detection techniques
such as signatures are no longer a reasonable method for detecting all
cyberattacks. Because data science techniques automate much of the
work that goes into detecting cyberattacks, and vastly decrease the memory usage needed to detect such attacks, they hold tremendous promise
in defending networks and users as cyber threats grow.
Finally, data science matters for security because data science is the technical trend of the decade, both inside and outside of the security industry,
and it will likely remain so through the next decade. Indeed, you’ve probably
seen applications of data science everywhere—in personal voice assistants
(Amazon Echo, Siri, and Google Home), self-driving cars, ad recommendation systems, web search engines, medical image analysis systems, and fitness
tracking apps.
We can expect data science–driven systems to have major impacts in
legal services, education, and other areas. Because data science has become
a key enabler across the technical landscape, universities, major companies
(Google, Facebook, Microsoft, and IBM), and governments are investing
billions of dollars to improve data science tools. Thanks to these investments, data science tools will become even more adept at solving hard
attack-detection problems.
Applying Data Science to Malware
This book focuses on data science as it applies to malware, which we define
as executable programs written with malicious intent, because malware
continues to be the primary means by which threat actors gain a foothold
on networks and subsequently achieve their goals. For example, in the ransomware scourge that has emerged in recent years, attackers typically send
users malicious email attachments that download ransomware executables
(malware) to users’ machines, which then encrypt users’ data and ask them
for a ransom to decrypt the data. Although skilled attackers working for
governments sometimes avoid using malware altogether to fly under the
radar of detection systems, malware continues to be the major enabling
technology in cyberattacks today.
By homing in on a specific application of security data science rather
than attempting to cover security data science broadly, this book aims to
show more thoroughly how data science techniques can be applied to a
major security problem. By understanding malware data science, you’ll
be better equipped to apply data science to other areas of security, like
detecting network attacks, phishing emails, or suspicious user behavior.
Indeed, almost all the techniques you’ll learn in this book apply to building data science detection and intelligence systems in general, not just for
malware.
xxiv Introduction
Who Should Read This Book?
This book is aimed toward security professionals who are interested in
learning more about how to apply data science to computer security problems. If computer security and data science are new to you, you might find
yourself having to look up terms to give yourself a little bit of context, but
you can still read this book successfully. If you’re only interested in data
science, but not security, this book is probably not for you.
About This Book
The first part of the book consists of three chapters that cover basic reverse
engineering concepts necessary for understanding the malware data science techniques discussed later in the book. If you’re new to malware, read
the first three chapters first. If you’re an old hand at malware reverse engineering, you can skip these chapters.
• Chapter 1: Basic Static Malware Analysis covers static analysis techniques for picking apart malware files and discovering how they achieve
malicious ends on our computers.
• Chapter 2: Beyond Basic Static Analysis: x86 Disassembly gives you a
brief overview of x86 assembly language and how to disassemble and
reverse engineer malware.
• Chapter 3: A Brief Introduction to Dynamic Analysis concludes the
reverse engineering section of the book by discussing dynamic analysis,
which involves running malware in controlled environments to learn
about its behavior.
The next two chapters of the book, Chapters 4 and 5, focus on malware relationship analysis, which involves looking at similarities and differences between collections of malware to identify malware campaigns
against your organization, such as a ransomware campaign controlled by
a group of cybercriminals, or a concerted, targeted attack on your organization. These stand-alone chapters are for readers who are interested
not only in detecting malware, but also in extracting valuable threat intelligence to learn who is attacking their network. If you’re less interested in
threat intelligence and more interested in data science–driven malware
detection, you can safely skip these chapters.
• Chapter 4: Identifying Attack Campaigns Using Malware Networks
shows you how to analyze and visualize malware based on shared attributes, such as the hostnames that malware programs call out to.
• Chapter 5: Shared Code Analysis explains how to identify and visualize shared code relationships between malware samples, which can help
you identify whether groups of malware samples came from one or multiple criminal groups.The next four chapters cover everything you need to know to understand, apply, and implement machine learning–based malware detection
systems. These chapters also provide a foundation for applying machine
learning to other security contexts.
• Chapter 6: Understanding Machine Learning–Based Malware
Detectors is an accessible, intuitive, and non-mathematical introduction to basic machine learning concepts. If you have a history with
machine learning, this chapter will provide a convenient refresher.
• Chapter 7: Evaluating Malware Detection Systems shows you how to
evaluate the accuracy of your machine learning systems using basic
statistical methods so that you can select the best possible approach.
• Chapter 8: Building Machine Learning Detectors introduces open
source machine learning tools you can use to build your own machine
learning systems and explains how to use them.
• Chapter 9: Visualizing Malware Trends covers how to visualize malware
threat data to reveal attack campaigns and trends using Python, and
how to integrate data visualization into your day-to-day workflow when
analyzing security data.
The last three chapters introduce deep learning, an advanced area
of machine learning that involves a bit more math. Deep learning is a
hot growth area within security data science, and these chapters provide
enough to get you started.
• Chapter 10: Deep Learning Basics covers the basic concepts that
underlie deep learning.
• Chapter 11: Building a Neural Network Malware Detector with Keras
explains how to implement deep learning–based malware detection systems in Python using open source tools.
• Chapter 12: Becoming a Data Scientist concludes the book by sharing
different pathways to becoming a data scientist and qualities that can
help you succeed in the field.
• Appendix: An Overview of Datasets and Tools describes the data and
example tool implementations accompanying the book.
How to Use the Sample Code and Data
No good programming book is complete without sample code to play with
and extend on your own. Sample code and data accompany each chapter
of this book and are described exhaustively in the appendix. All the code
targets Python 2.7 in Linux environments. To access the code and data,
you can download a VirtualBox Linux virtual machine, which has the
code, data, and supporting open source tools all set up and ready to go,
xxvi Introduction
and run it within your own VirtualBox environment. You can download
the book’s accompanying data at http://www.malwaredatascience.com/, and
you can download the VirtualBox for free at https://www.virtualbox.org/wiki/
Downloads. The code has been tested on Linux, but if you prefer to work
outside of the Linux VirtualBox, the same code should work almost as well
on MacOS, and to a lesser extent on Windows machines.
If you’d rather install the code and data in your own Linux environment, you can download them here: http://www.malwaredatascience.com/.
You’ll find a directory for each chapter in the downloadable archive,
and within each chapter’s directory there are code/ and data/ directories
that contain the corresponding code and data. Code files correspond to
chapter listings or sections, whichever makes more sense for the application at hand. Some code files are exactly like the listings, whereas others
have been changed slightly to make it easier for you to play with parameters and other options. Code directories come with pip requirements.txt files,
which give the open source libraries that the code in that chapter depends
on to run. To install these libraries on your machine, simply type pip -r
requirements.txt in each chapter’s code/ directory.
Now that you have access to the code and data for this book, let’s get
started.
2.Basic static maleware analysis
In this chapter we look at the basics of
static malware analysis. Static analysis is
performed by analyzing a program file’s
disassembled code, graphical images, printable strings, and other on-disk resources. It refers to
reverse engineering without actually running the program. Although static analysis techniques have their
shortcomings, they can help us understand a wide variety of malware.
Through careful reverse engineering, you’ll be able to better understand
the benefits that malware binaries provide attackers after they’ve taken
possession of a target, as well as the ways attackers can hide and continue
their attacks on an infected machine. As you’ll see, this chapter combines
descriptions and examples. Eac`h section introduces a static analysis technique and then illustrates its application in real-world analysis.
2 Chapter 1
I begin this chapter by describing the Portable Executable (PE) file
format used by most Windows programs, and then examine how to use the
popular Python library pefile to dissect a real-world malware binary. I then
describe techniques such as imports analysis, graphical image analysis,
and strings analysis. In all cases, I show you how to use open source tools
to apply the analysis technique to real-world malware. Finally, at the end of
the chapter, I introduce ways malware can make life difficult for malware
analysts and discuss some ways to mitigate these issues.
You’ll find the malware sample used in the examples in this chapter in
this book’s data under the directory /ch1. To demonstrate the techniques
discussed in this chapter, we use ircbot.exe, an Internet Relay Chat (IRC)
bot created for experimental use, as an example of the kinds of malware
commonly observed in the wild. As such, the program is designed to stay
resident on a target computer while connected to an IRC server. After ircbot
.exe gets hold of a target, attackers can control the target computer via IRC,
allowing them to take actions such as turning on a webcam to capture and
surreptitiously extract video feeds of the target’s physical location, taking
screenshots of the desktop, extracting files from the target machine, and so
on. Throughout this chapter, I demonstrate how static analysis techniques
can reveal the capabilities of this malware.
The
Microsoft Windows Portable Executable Format
To perform static malware analysis, you need to understand the Windows
PE format, which describes the structure of modern Windows program files
such as .exe, .dll, and .sys files and defines the way they store data. PE files
contain x86 instructions, data such as images and text, and metadata that a
program needs in order to run.
The PE format was originally designed to do the following:
Tell Windows how to load a program into memory
The PE format
describes which chunks of a file should be loaded into memory, and
where. It also tells you where in the program code Windows should
start a program’s execution and which dynamically linked code
libraries should be loaded into memory.
Supply media (or resources) a running program may use in the course of its execution
These resources can include strings of characters like
the ones in GUI dialogs or console output, as well as images or videos.
Supply security data such as digital code signatures
Windows uses
such security data to ensure that code comes from a trusted source.
The PE format accomplishes all of this by leveraging the series of constructs shown in Figure 1-1.
Basic Static Malware Analysis 3
Increasing file offsets
DOS header
PE header
Optional header
Section headers
.text section (program code)
.idata section (imported libraries)
.rsrc section (strings, images, . . . )
.reloc section (memory translations)
Figure 1-1: The PE file format
As the figure shows, the PE format includes a series of headers telling
the operating system how to load the program into memory. It also includes
a series of sections that contain the actual program data. Windows loads
the sections into memory such that their memory offsets correspond to
where they appear on disk. Let’s explore this file structure in more detail,
starting with the PE header. We’ll skip over a discussion of the DOS header,
which is a relic of the 1980s-era Microsoft DOS operating system and only
present for compatibility reasons.
The PE Header
Shown at the bottom of Figure 1-1, above the DOS header u, is the PE
header v, which defines a program’s general attributes such as binary
code, images, compressed data, and other program attributes. It also tells
us whether a program is designed for 32- or 64-bit systems. The PE header
provides basic but useful contextual information to the malware analyst. For
example, the header includes a timestamp field that can give away the time
at which the malware author compiled the file. This happens when malware
authors forget to replace this field with a bogus value, which they often do.
The Optional Header
The optional header w is actually ubiquitous in today’s PE executable
programs, contrary to what its name suggests. It defines the location of
the program’s entry point in the PE file, which refers to the first instruction the program runs once loaded. It also defines the size of the data
that Windows loads into memory as it loads the PE file, the Windows subsystem, the program targets (such as the Windows GUI or the Windows
4 Chapter 1
command line), and other high-level details about the program. The
information in this header can prove invaluable to reverse engineers,
because a program’s entry point tells them where to begin reverse
engineering.
Section Headers
Section headers x describe the data sections contained within a PE file. A
section in a PE file is a chunk of data that either will be mapped into memory
when the operating system loads a program or will contain instructions about
how the program should be loaded into memory. In other words, a section
is a sequence of bytes on disk that will either become a contiguous string of
bytes in memory or inform the operating system about some aspect of the
loading process.
Section headers also tell Windows what permissions it should grant to
sections, such as whether they should be readable, writable, or executable
by the program when it’s executing. For example, the .text section containing x86 code will typically be marked readable and executable but not
writable to prevent program code from accidentally modifying itself in the
course of execution.
A number of sections, such as .text and .rsrc, are depicted in Figure 1-1.
These get mapped into memory when the PE file is executed. Other special
sections, such as the .reloc section, aren’t mapped into memory. We’ll discuss these sections as well. Let’s go over the sections shown in Figure 1-1.
The .text Section
Each PE program contains at least one section of x86 code marked executable in its section header; these sections are almost always named .text y.
We’ll disassemble the data in the .text section when performing program
disassembly and reverse engineering in Chapter 2.
The .idata Section
The .idata section z, also called imports, contains the Import Address Table
(IAT), which lists dynamically linked libraries and their functions. The
IAT is among the most important PE structures to inspect when initially
approaching a PE binary for analysis because it reveals the library calls
a program makes, which in turn can betray the malware’s high-level
functionality.
The Data Sections
The data sections in a PE file can include sections like .rsrc, .data, and
.rdata, which store items such as mouse cursor images, button skins, audio,
and other media used by a program. For example, the .rsrc section {
in Figure 1-1 contains printable character strings that a program uses to
render text as strings.
Basic Static Malware Analysis 5
The information in the .rsrc (resources) section can be vital to malware
analysts because by examining the printable character strings, graphical
images, and other assets in a PE file, they can gain vital clues about the
file’s functionality. In “Examining Malware Images” on page 7, you’ll
learn how to use the icoutils toolkit (including icotool and wrestool) to
extract graphical images from malware binaries’ resources sections. Then,
in “Examining Malware Strings” on page 8, you’ll learn how to extract
printable strings from malware resources sections.
The .reloc Section
A PE binary’s code is not position independent, which means it will not
execute correctly if it’s moved from its intended memory location to a new
memory location. The .reloc section | gets around this by allowing code to
be moved without breaking. It tells the Windows operating system to translate memory addresses in a PE file’s code if the code has been moved so
that the code still runs correctly. These translations usually involve adding
or subtracting an offset from a memory address.
Although a PE file’s .reloc section may well contain information you’ll
want to use in your malware analysis, we won’t discuss it further in this book
because our focus is on applying machine learning and data analysis to
malware, not doing the kind of hardcore reverse engineering that involves
looking at relocations.
Dissecting the PE Format Using pefile
The pefile Python module, written and maintained by Ero Carerra, has
become an industry-standard malware analysis library for dissecting PE
files. In this section, I show you how to use pefile to dissect ircbot.exe. The
ircbot.exe file can be found on the virtual machine accompanying this book
in the directory ~/malware_data_science/ch1/data.
Listing 1-1 assumes that
ircbot.exe is in your current working directory.
Enter the following to install the pefile library so that we can import it
within Python:
$ pip install pefile
Now, use the commands in Listing 1-1 to start Python, import the pefile
module, and open and parse the PE file ircbot.exe using pefile.
$ python
>>> import pefile
>>> pe = pefile.PE("ircbot.exe")
Listing 1-1: Loading the pefile module and parsing a PE file (ircbot.exe)
6 Chapter 1
We instantiate pefile.PE, which is the core class implemented by the PE
module. It parses PE files so that we can examine their attributes. By calling
the PE constructor, we load and parse the specified PE file, which is ircbot.exe
in this example. Now that we’ve loaded and parsed our file, run the code in
Listing 1-2 to pull information from ircbot.exe’s PE fields.
# based on Ero Carrera's example code (pefile library author)
for section in pe.sections:
print (section.Name, hex(section.VirtualAddress),
hex(section.Misc_VirtualSize), section.SizeOfRawData )
Listing 1-2: Iterating through the PE file’s sections and printing information about them
Listing 1-3 shows the output.
('.text\x00\x00\x00', '0x1000', '0x32830', w207360)
('.rdata\x00\x00', '0x34000', '0x427a', 17408)
('.data\x00\x00\x00', '0x39000', '0x5cff8', 10752)
('.idata\x00\x00', '0x96000', '0xbb0', 3072)
('.reloc\x00\x00', '0x97000', '0x211d', 8704)
Listing 1-3: Pulling section data from ircbot.exe using Python’s pefile module
As you can see in Listing 1-3, we’ve pulled data from five different sections of the PE file: .text, .rdata, .data, .idata, and .reloc. The output is
given as five tuples, one for each PE section pulled. The first entry on each
line identifies the PE section. (You can ignore the series of \x00 null bytes,
which are simply C-style null string terminators.) The remaining fields tell
us what each section’s memory utilization will be once it’s loaded into memory and where in memory it will be found once loaded.
For example, 0x1000 is the base virtual memory address where these sections will be loaded. Think of this as the section’s base memory address.
The 0x32830 in the virtual size field specifies the amount of memory required
by the section once loaded. The 207360 in the third field represents the
amount of data the section will take up within that chunk of memory.
In addition to using pefile to parse a program’s sections, we can also
use it to list the DLLs a binary will load, as well as the function calls it will
request within those DLLs. We can do this by dumping a PE file’s IAT.
Listing 1-4 shows how to use pefile to dump the IAT for ircbot.exe.
$ python
pe = pefile.PE("ircbot.exe")
for entry in pe.DIRECTORY_ENTRY_IMPORT:
print entry.dll
for function in entry.imports:
print '\t',function.name
Listing 1-4: Extracting imports from ircbot.exe
Listing 1-4 should produce the output shown in
Listing 1-5 (truncated
for brevity).
Basic Static Malware Analysis 7
KERNEL32.DLL
GetLocalTime
ExitThread
CloseHandle
WriteFile
CreateFileA
ExitProcess
CreateProcessA
GetTickCount
GetModuleFileNameA
--snip--
Listing 1-5: Contents of the IAT of ircbot.exe, showing library functions used by this malware
As you can see in Listing 1-5, this output is valuable for malware analysis because it lists a rich array of functions that the malware declares and
will reference. For example, the first few lines of the output tell us that the
malware will write to files using WriteFile , open files using the CreateFileA
call , and create new processes using CreateProcessA . Although this is
fairly basic information about the malware, it’s a start in understanding the
malware’s behavior in more detail.
Examining Malware Images
To understand how malware may be designed to game a target, let’s look at
the icons contained in its .rsrc section. For example, malware binaries are
often designed to trick users into clicking them by masquerading as Word
documents, game installers, PDF files, and so on. You also find images in
the malware suggesting programs of interest to the attackers themselves,
such as network attack tools and programs run by attackers for the remote
control of compromised machines. I have even seen binaries containing
desktop icons of jihadists, images of evil-looking cyberpunk cartoon characters, and images of Kalashnikov rifles. For our sample image analysis, let’s
consider a malware sample the security company Mandiant identified as
having been crafted by a Chinese state-sponsored hacking group. You can
find this sample malware in this chapter’s data directory under the name
fakepdfmalware.exe. This sample uses an Adobe Acrobat icon to trick users
into thinking it is an Adobe Acrobat document, when in fact it’s a malicious
PE executable.
Before we can extract the images from the fakepdfmalware.exe binary
using the Linux command line tool wrestool, we first need to create a directory to hold the images we’ll extract. Listing 1-6 shows how to do all this.
$ mkdir images
$ wrestool –x fakepdfmalware.exe –output=images
$ icotool –x –o images images/*.ico
Listing 1-6: Shell commands that extract images from a malware sample
8 Chapter 1
We first use mkdir images to create a directory to hold the extracted
images. Next, we use wrestool to extract image resources (-x) from
fakepdfmalware.exe to /images and then use icotool to extract (-x) and
convert (-o) any resources in the Adobe .ico icon format into .png graphics
so that we can view them using standard image viewer tools. If you don’t
have wrestool installed on your system, you can download it at http://www
.nongnu.org/icoutils/.
Once you’ve used wrestool to convert the images in the target executable to the PNG format, you should be able open them in your favorite
image viewer and see the Adobe Acrobat icon at various resolutions. As
my example here demonstrates, extracting images and icons from PE files
is relatively straightforward and can quickly reveal interesting and useful
information about malware binaries. Similarly, we can easily extract printable strings from malware for more information, which we’ll do next.
Examining Malware Strings
Strings are sequences of printable characters within a program binary.
Malware analysts often rely on strings in a malicious sample to get a quick
sense of what may be going on inside it. These strings often contain things
like HTTP and FTP commands that download web pages and files, IP
addresses and hostnames that tell you what addresses the malware connects to, and the like. Sometimes even the language used to write the
strings can hint at a malware binary’s country of origin, though this can
be faked. You may even find text in a string that explains in leetspeak the
purpose of a malicious binary.
Strings can also reveal more technical information about a binary. For
example, you may find information about the compiler used to create it,
the programming language the binary was written in, embedded scripts or
HTML, and so on. Although malware authors can obfuscate, encrypt, and
compress all of these traces, even advanced malware authors often leave
at least some traces exposed, making it particularly important to examine
strings dumps when analyzing malware.
Using the strings Program
The standard way to view all strings in a file is to use the command line tool
strings, which uses the following syntax:
$ strings filepath | less
This command prints all strings in a file to the terminal, line by line.
Adding | less at the end prevents the strings from just scrolling across the
terminal. By default, the strings command finds all printable strings with
a minimum length of 4 bytes, but you can set a different minimum length
and change various other parameters, as listed in the commands manual
page. I recommend simply using the default minimum string length of 4,
Basic Static Malware Analysis 9
but you can change the minimum string length using the –n option. For
example, strings –n 10 filepath would extract only strings with a minimum
length of 10 bytes.
Analyzing Your strings Dump
Now that we dumped a malware program’s printable strings, the challenge
is to understand what the strings mean. For example, let’s say we dump the
strings to the ircbotstring.txt file for ircbot.exe, which we explored earlier in
this chapter using the pefile library, like this:
$ strings ircbot.exe > ircbotstring.txt
The contents of ircbotstring.txt contain thousands of lines of text, but
some of these lines should stick out. For example, Listing 1-7 shows a bunch
of lines extracted from the string dump that begin with the word DOWNLOAD.
[DOWNLOAD]: Bad URL, or DNS Error: %s.
[DOWNLOAD]: Update failed: Error executing file: %s.
[DOWNLOAD]: Downloaded %.1fKB to %s @ %.1fKB/sec. Updating.
[DOWNLOAD]: Opened: %s.
--snip--
[DOWNLOAD]: Downloaded %.1f KB to %s @ %.1f KB/sec.
[DOWNLOAD]: CRC Failed (%d != %d).
[DOWNLOAD]: Filesize is incorrect: (%d != %d).
[DOWNLOAD]: Update: %s (%dKB transferred).
[DOWNLOAD]: File download: %s (%dKB transferred).
[DOWNLOAD]: Couldn't open file: %s.
Listing 1-7: The strings output showing evidence that the malware can download files
specified by the attacker onto a target machine
These lines indicate that ircbot.exe will attempt to download files specified by an attacker onto the target machine.
Let’s try analyzing another one. The string dump shown in Listing 1-8
indicates that ircbot.exe can act as a web server that listens on the target
machine for connections from the attacker.
GET
HTTP/1.0 200 OK
Server: myBot
Cache-Control: no-cache,no-store,max-age=0
pragma: no-cache
Content-Type: %s
Content-Length: %i
Accept-Ranges: bytes
Date: %s %s GMT
Last-Modified: %s %s GMT
Expires: %s %s GMT
Connection: close
HTTP/1.0 200 OK
Server: myBot
10 Chapter 1
Cache-Control: no-cache,no-store,max-age=0
pragma: no-cache
Content-Type: %s
Accept-Ranges: bytes
Date: %s %s GMT
Last-Modified: %s %s GMT
Expires: %s %s GMT
Connection: close
HH:mm:ss
ddd, dd MMM yyyy
application/octet-stream
text/html
Listing 1-8: The strings output showing that the malware has an HTTP server to which the
attacker can connect
Listing 1-8 shows a wide variety of HTTP boilerplates used by ircbot.exe
to implement an HTTP server. It’s likely that this HTTP server allows the
attacker to connect to a target machine via HTTP to issue commands, such
as the command to take a screenshot of the victim’s desktop and send it back
to the attacker. We see evidence of HTTP functionality throughout the listing. For example, the GET method requests data from an internet resource.
The line HTTP/1.0 200 OK is an HTTP string that returns the status code 200,
indicating that all went well with an HTTP network transaction, and Server:
myBot indicates that the name of the HTTP server is myBot, a giveaway that
ircbot.exe has a built-in HTTP server.
All of this information is useful in understanding and stopping a particular malware sample or malicious campaign. For example, knowing that
a malware sample has an HTTP server that outputs certain strings when
you connect to it allows you to scan your network to identify infected hosts.
Summary
In this chapter, you got a high-level overview of static malware analysis,
which involves inspecting a malware program without actually running it.
You learned about the PE file format that defines Windows .exe and .dll files,
and you learned how to use the Python library pefile to dissect a real-world
malware ircbot.exe binary. You also used static analysis techniques such as
image analysis and strings analysis to extract more information from malware samples. Chapter 2 continues our discussion of static malware analysis
with a focus on analyzing the assembly code that can be recovered from
malware.
3.Beyond Basic Static Analysis:
x86 Disassembly
To thoroughly understand a malicious
program, we often need to go beyond
basic static analysis of its sections, strings,
imports, and images. This involves reverse
engineering a program’s assembly code. Indeed,
disassembly and reverse engineering lie at the heart
of deep static analysis of malware samples.
Because reverse engineering is an art, technical craft, and science, a
thorough exploration is beyond the scope of this chapter. My goal here is
to introduce you to reverse engineering so that you can apply it to malware
data science. Understanding this methodology is essential for successfully
applying machine learning and data analysis to malware.
In this chapter I start with the concepts you’ll need to understand x86
disassembly. Later in the chapter I show how malware authors attempt to
bypass disassembly and discuss ways to mitigate these anti-analysis and
anti-detection maneuvers. But first, let’s review some common disassembly
methods as well as the basics of x86 assembly language.
Disassembly Methods
Disassembly is the process by which we translate malware’s binary code into
valid x86 assembly language. Malware authors generally write malware
programs in a high-level language like C or C++ and then use a compiler
to compile the source code into x86 binary code. Assembly language is
the human-readable representation of this binary code. Therefore, disassembling a malware program into assembly language is necessary to
understand how it behaves at its core.
Unfortunately, disassembly is no easy feat because malware authors regularly employ tricks to thwart would-be reverse engineers. In fact, perfect
disassembly in the face of deliberate obfuscation is an unsolved problem in
computer science. Currently, only approximate, error-prone methods exist
for disassembling such programs.
For example, consider the case of self-modifying code, or binary code that
modifies itself as it executes. The only way to disassemble this code properly
is to understand the program logic by which the code modifies itself, but
that can be exceedingly complex.
Because perfect disassembly is currently impossible, we must use
imperfect methods to accomplish this task. The method we’ll use is linear
disassembly, which involves identifying the contiguous sequence of bytes in
the Portable Executable (PE) file that corresponds to its x86 program code
and then decoding these bytes. The key limitation of this approach is that
it ignores subtleties about how instructions are decoded by the CPU in the
course of program execution. Also, it doesn’t account for the various obfuscations malware authors sometimes use to make their programs harder to
analyze.
The other methods of reverse engineering, which we won’t cover here,
are the more complex disassembly methods used by industrial-grade disassemblers such as IDA Pro. These more advanced methods actually simulate
or reason about program execution to discover which assembly instructions
a program might reach as a result of a series of conditional branches.
Although this type of disassembly can be more accurate than linear
disassembly, it’s far more CPU intensive than linear disassembly methods,
making it less suitable for data science purposes where the focus is on disassembling thousands or even millions of programs.
Before you can begin analysis using linear disassembly, however, you’ll
need to review the basic components of assembly language.
Basics of x86 Assembly Language
Assembly language is the lowest-level human-readable programming language for a given architecture, and it maps closely to the binary instruction format of a particular CPU architecture. A line of assembly language
is almost always equivalent to a single CPU instruction. Because assembly is
so low level, you can often retrieve it easily from a malware binary by using
the right tools.
Gaining basic proficiency in reading disassembled malware x86 code
is easier than you might think. This is because most malware assembly
code spends most of its time calling into the operating system by way of
the Windows operating system’s dynamic-link libraries (DLLs), which are
loaded into program memory at runtime. Malware programs use DLLs
to do most of the real work, such as modifying the system registry, moving and copying files, making network connections and communicating
via network protocols, and so on. Therefore, following malware assembly
code often involves understanding the ways in which function calls are
made from assembly and understanding what various DLL calls do. Of
course, things can get much more complicated, but knowing this much
can reveal a lot about the malware.
In the following sections I introduce some important assembly language
concepts. I also explain some abstract concepts like control flow and control
flow graphs. Finally, we disassemble the ircbot.exe program and explore how
its assembly and control flow can give us insight into its purpose.
There are two major dialects of x86 assembly: Intel and AT&T. In this
book I use Intel syntax, which can be obtained from all major disassemblers
and is the syntax used in the official Intel documentation of the x86 CPU.
Let’s start by taking a look at CPU registers
CPU Registers
Registers are small data storage units on which x86 CPUs perform computations. Because registers are located on the CPU itself, register access is
orders of magnitude faster than memory access. This is why core computational operations, such as arithmetic and condition testing instructions,
all target registers. It’s also why the CPU uses registers to store information
about the status of running programs. Although many registers are available to experienced x86 assembly programmers, we’ll just focus on a few
important ones here.
General-Purpose Registers
General-purpose registers are like scratch space for assembly programmers.
On a 32-bit system, each of these registers contains 32, 16, or 8 bits of space
against which we can perform arithmetic operations, bitwise operations,
byte order–swapping operations, and more.
In common computational workflows, programs move data into registers from memory or from external hardware devices, perform some operations on this data, and then move the data back out to memory for storage.
For example, to sort a long list, a program typically pulls list items in from
an array in memory, compares them in the registers, and then writes the
comparison results back out to memory.
To understand some of the nuances of the general-purpose register
model in the Intel 32-bit architecture
Figure 2-1: Registers in the x86 architecture
The vertical axis shows the layout of the general-purpose registers, and
the horizontal axis shows how EAX, EBX, ECX, and EDX are subdivided.
EAX, EBX, ECX, and EDX are 32-bit registers that have smaller, 16-bit
registers inside them: AX, BX, CX, and DX. As you can see in the figure,
these 16-bit registers can be subdivided into upper and lower 8-bit registers:
AH, AL, BH, BL, CH, CL, DH, and DL. Although it’s sometimes useful to
address the subdivisions in EAX, EBX, ECX, and EDX, you’ll mostly see
direct references to EAX, EBX, ECX, and EDX.
Stack and Control Flow Registers
The stack management registers store critical information about the program stack, which is responsible for storing local variables for functions,
arguments passed into functions, and control information relating to the
program control flow. Let’s go over some of these registers.
In simple terms, the ESP register points to the top of the stack for
the currently executing function, whereas the EBP register points to the
bottom of the stack for the currently executing function. This is crucial
information for modern programs, because it means that by referencing
data relative to the stack rather than using its absolute address, procedural
and object-oriented code can access local variables more gracefully and
efficiently.
Although you won’t see direct references to the EIP register in x86
assembly code, it’s important in security analysis, particularly in the context of vulnerability research and buffer-overflow exploit development.
This is because EIP contains the memory address of the currently executing instruction. Attackers can use buffer-overflow exploits to corrupt the
value of the EIP register indirectly and take control of program execution.
Beyond Basic Static Analysis: x86 Disassembly 15
In addition to its role in exploitation, EIP is also important in the analysis of malicious code deployed by malware. Using a debugger we can inspect
EIP’s value at any moment, which helps us understand what code malware is
executing at any particular time.
EFLAGS is a status register that contains CPU flags, which are bits
that store status information about the state of the currently executing
program. The EFLAGS register is central to the process of making conditional branches, or changes in execution flow resulting from the outcome of
if/then-style program logic, within x86 programs. Specifically, whenever
an x86 assembly program checks whether some value is greater or less
than zero and then jumps to a function based on the outcome of this test,
the EFLAGS register plays an enabling role, as described in more detail in
“Basic Blocks and Control Flow Graphs” on page 19.
Arithmetic Instructions Instructions
operate on general-purpose registers. You can perform simple
computations with the general-purpose registers using arithmetic instructions. For example, add, sub, inc, dec, and mul are examples of arithmetic
instructions you’ll encounter frequently in malware reverse engineering.
Table 2-1 lists some examples of basic instructions and their syntax.
Instructions Description
add ebx, 100 Adds 100 to the value in EBX and then stores the result in EBX
sub ebx, 100 Subtracts 100 from the value in EBX and then stores the result
in EBX
inc ah Increments the value in AH by 1
dec al Decrements the value in AL by 1
The add instruction adds two integers and stores the result in the first
operand specified, whether this is a memory location or a register according to the following syntax. Keep in mind only one argument can be a
memory location. The sub instruction is similar to add, except it subtracts
integers. The inc instruction increments a register or memory location’s
integer value, whereas dec decrements a register or memory location’s integer value
Data Movement Instructions
The x86 processor provides a robust set of instructions for moving data
between registers and memory. These instructions provide the underlying
mechanisms that allow us to manipulate data. The staple memory movement instruction is the mov instruction. Table 2-2 shows how you can use the
mov instruction to move data around.
16 Chapter 2
Table 2-2: Data Movement Instructions
Instructions Description
mov ebx,eax Moves the value in register EAX into register EBX
mov eax, [0x12345678] Moves the data at memory address 0x12345678 into
the EAX register
mov edx, 1 Moves the value 1 into the register EDX
mov [0x12345678], eax Moves the value in EAX into the memory location
0x12345678
Related to the mov instruction, the lea instruction loads the absolute
memory address specified into the register used for getting a pointer to
a memory location. For example, lea edx, [esp-4] subtracts 4 from the
value in ESP and loads the resulting value into EDX.
Stack Instructions
The stack in x86 assembly is a data structure that allows you to push and
pop values onto and off of it. This is similar to how you would add and
remove plates on and off the top of a stack of plates.
Because control flow is often expressed through C-style function calls
in x86 assembly and because these function calls use the stack to pass arguments, allocate local variables, and remember what part of the program
to return to after a function finishes executing, the stack and control flow
need to be understood together.
The push instruction pushes values onto the program stack when the programmer wants to save a register value onto the stack, and the pop instruction
deletes values from the stack and places them into a designated register.
The push instruction uses the following syntax to perform its operations:
push 1
In this example, the program points the stack pointer (the register
ESP) to a new memory address, thereby making room for the value (1),
which is now stored at the top location on the stack. Then it copies the
value from the argument to the memory location the CPU has just made
room for on the top of the stack.
Let’s contrast this with pop:
pop eax
The program uses pop to pop the top value off the stack and move it
into a specified register. In this example, pop eax pops the top value off the
stack and moves it into eax.
An unintuitive but important detail to understand about the x86 program stack is that it grows downward in memory, so that the highest value
on the stack is actually stored at the lowest address in stack memory. This
Beyond Basic Static Analysis: x86 Disassembly 17
becomes very important to remember when you analyze assembly code that
references data stored on the stack, as it can quickly get confusing unless
you know the stack’s memory layout.
Because the x86 stack grows downward in memory, when the push instruction allocates space on the program stack for a new value, it decrements the
value of ESP so that it points to a lower location in memory and then copies
a value from the target register into that memory location, starting at the top
address of the stack and growing up. Conversely, the pop instruction actually
copies the top value off of the stack and then increments the value of ESP so
it points to a higher memory location.
Control Flow Instructions
An x86 program’s control flow defines the network of possible instruction
execution sequences a program may execute, depending on the data,
device interactions, and other inputs the program might receive. Control
flow instructions define a program’s control flow. They are more complicated than stack instructions but still quite intuitive. Because control flow
is often expressed through C-style function calls in x86 assembly, the stack
and control flow are closely related. They’re also related because these
function calls use the stack to pass arguments, allocate local variables, and
remember what part of the program to return to after a function finishes
executing.
The call and ret control flow instructions are the most important in
terms of how programs call functions in x86 assembly and how programs
return from functions after these functions are done executing.
The call instruction calls a function. Think of this as a function you
might write in a higher-level language like C to allow the program to return
to the instruction after the call instruction is invoked and the function has
finished executing. You can invoke the call instruction using the following
syntax, where address denotes the memory location where the function’s
code begins:
call address
The call instruction does two things. First, it pushes the address of
the instruction that will execute after the function call returns onto the
top of the stack so that the program knows what address to return to after
the called function finishes executing. Second, call replaces the current
value of EIP with the value specified by the address operand. Then, the CPU
begins execution at the new memory location pointed to by EIP.
Just as call initiates a function call, the ret instruction completes it.
You can use the ret instruction on its own and without any parameter, as
shown here:
ret
18 Chapter 2
When invoked, ret pops the top value off the stack, which we expect to
be the saved program counter value (EIP) that the call instruction pushed
onto the stack when the call instruction was invoked. Then it places the
popped program counter value back into EIP and resumes execution.
The jmp instruction is another important control flow construction,
which operates more simply than call. Instead of worrying about saving
EIP, jmp simply tells the CPU to move to the memory address specified as
its parameter and begin execution there. For example, jmp 0x12345678 tells
the CPU to start executing the program code stored at memory location
0x12345678 on the next instruction.
You may be wondering how you can make jmp and call instructions
execute in a conditional way, such as “if the program has received a network packet, execute the following function.” The answer is that x86
assembly doesn’t have high-level constructs like if, then, else, else if, and
so on. Instead, branching to an address within a program’s code typically
requires two instructions: a cmp instruction, which checks the value in
some register against some test value and stores the result of that test in
the EFLAGS register, and a conditional branch instruction.
Most conditional branch instructions start with a j, which allows the
program to jump to a memory address, and are post-fixed with letters that
stand for the condition being tested. For example, jge tells the program to
jump if greater than or equal to. This means that the value in the register
being tested must be greater than or equal to the test value.
The cmp instruction uses the following syntax:
cmp register, memory location, or literal, register, memory location, or
literal
As stated earlier, cmp compares the value in the specified general-purpose
register with value and then stores the result of that comparison in the
EFLAGS register.
The various conditional jmp instructions are then invoked as follows:
j* address
As you can see, we can prefix j to any number of conditional test instructions. For example, to jump only if the value tested is greater than or equal
to the value in the register, use the following instruction:
jge address
Note that unlike the case of the call and ret instructions, the jmp family of instructions never touches the program stack. In fact, in the case of
the jmp family of instructions, the x86 program is responsible for tracking
its own execution flow and potentially saving or deleting information about
what addresses it has visited and where it should return to after a particular
sequence of instructions has executed.
Basic Blocks and Control Flow Graphs
Although x86 programs look sequential when we scroll through their code
in a text editor, they actually have loops, conditional branches, and unconditional branches (control flow). All of these give each x86 program a network structure. Let’s use the simple toy assembly program in Listing 2-1 to
see how this works.
setup: # symbol standing in for address of instruction on the next line
mov eax, 10
loopstart: # symbol standing in for address of the instruction on the next
line
sub eax, 1
cmp 0, eax
jne $loopstart
loopend: # symbol standing in for address of the instruction on the next line
mov eax, 1
# more code would go here
Listing 2-1: Assembly program for understanding control flow graph
As you can see, this program initializes a counter to the value 10, stored
in register EAX . Next, it does a loop in which the value in EAX is decremented by 1 on each iteration. Finally, once EAX has reached a value
of 0 , the program breaks out of the loop.
In the language of control flow graph analysis, we can think of these
instructions as comprising three basic blocks. A basic block is a sequence
of instructions that we know will always execute contiguously. In other
words, a basic block always ends with either a branching instruction or an
instruction that is the target of a branch, and it always begins with either
the first instruction of the program, called the program’s entry point, or a
branch target.
In Listing 2-1, you can see where the basic blocks of our simple program begin and end. The first basic block is composed of the instruction mov eax, 10 under setup:. The second basic block is composed of lines
beginning with sub eax, 1 through jne $loopstart under loopstart:, and
the third starts at mov eax, 1 under loopend:. We can visualize the relationships between the basic blocks using the graph in Figure 2-2. (We use the
term graph synonymously with the term network; in computer science, these
terms are interchangeable.)
loopstart:
sub eax, 1
cmp 0, eax
jne $loopstart
setup:
mov eax, 10
loopend:
move eax, 1
Figure 2-2: A visualization of the control flow graph of our simple assembly
program
20 Chapter 2
If one basic block can ever flow into another basic block, we connect it,
as shown in Figure 2-2. The figure shows that the setup basic block leads to
the loopstart basic block, which repeats 10 times before it transitions to the
loopend basic block. Real-world programs have control flow graphs such as
these, but they’re much more complicated, with thousands of basic blocks
and thousands of interconnections.
Disassembling ircbot.exe Using pefile and capstone
Now that you have a good understanding of the basics of assembly language,
let’s disassemble the first 100 bytes of ircbot.exe’s assembly code using linear
disassembly. To do this, we’ll use the open source Python libraries pefile
(introduced in Chapter 1) and capstone, which is an open source disassembly library that can disassemble 32-bit x86 binary code. You can install both
of these libraries with pip using the following commands:
pip install pefile
pip install capstone
Once these two libraries are installed, we can leverage them to disassemble ircbot.exe using the code in Listing 2-2.
#!/usr/bin/python
import pefile
from capstone import *
# load the target PE file
pe = pefile.PE("ircbot.exe")
# get the address of the program entry point from the program header
entrypoint = pe.OPTIONAL_HEADER.AddressOfEntryPoint
# compute memory address where the entry code will be loaded into memory
entrypoint_address = entrypoint+pe.OPTIONAL_HEADER.ImageBase
# get the binary code from the PE file object
binary_code = pe.get_memory_mapped_image()[entrypoint:entrypoint+100]
# initialize disassembler to disassemble 32 bit x86 binary code
disassembler = Cs(CS_ARCH_X86, CS_MODE_32)
# disassemble the code
for instruction in disassembler.disasm(binary_code, entrypoint_address):
print "%s\t%s" %(instruction.mnemonic, instruction.op_str)
Listing 2-2: Disassembling ircbot.exe
This should produce the following output:
push ebp
mov ebp, esp
Beyond Basic Static Analysis: x86 Disassembly 21
push -1
push 0x437588
push 0x41982c
mov eax, dword ptr fs:[0]
push eax
mov dword ptr fs:[0], esp
add esp, -0x5c
push ebx
push esi
push edi
mov dword ptr [ebp - 0x18], esp
call dword ptr [0x496308]
--snip--
Don’t worry about understanding all of the instructions in the disassembly output: that would involve an understanding of assembly that
goes beyond the scope of this book. However, you should feel comfortable
with many of the instructions in the output and have some sense of what
they do. For example, the malware pushes the value in register EBP onto
the stack , saving its value. Then it proceeds to move the value in ESP
into EBP and pushes some numerical values onto the stack. The program
moves some data in memory into the EAX register , and it adds the value
-0x5c to the value in the ESP register . Finally, the program uses the call
instruction to call a function stored at the memory address 0x496308 .
Because this is not a book on reverse engineering, I won’t go into any more
depth here about what the code means. What I’ve presented is a start to understanding how assembly language works. For more information on assembly language, I recommend the Intel programmer’s manual at http://www.intel.com/
content/www/us/en/processors/architectures-software-developer-manuals.html.
Factors That Limit Static Analysis
In this chapter and Chapter 1, you learned about a variety of ways in which
static analysis techniques can be used to elucidate the purpose and methods
of a newly discovered malicious binary. Unfortunately, static analysis has
limitations that render it less useful in some circumstances. For example,
malware authors can employ certain offensive tactics that are far easier to
implement than to defend against. Let’s take a look at some of these offensive
tactics and see how to defend against them.
Packing
Malware packing is the process by which malware authors compress, encrypt,
or otherwise mangle the bulk of their malicious program so that it appears
inscrutable to malware analysts. When the malware is run, it unpacks itself
and then begins execution. The obvious way around malware packing is to
actually run the malware in a safe environment, a dynamic analysis technique I’ll cover in Chapter 3.
note Software packing is also used by benign software installers for legitimate reasons.
Benign software authors use packing to deliver their code because it allows them to
compress program resources to reduce software installer download sizes. It also helps
them thwart reverse engineering attempts by business competitors, and it provides a
convenient way to bundle many program resources within a single installer file.
Resource Obfuscation
Another anti-detection, anti-analysis technique malware authors use is
resource obfuscation. They obfuscate the way program resources, such as
strings and graphical images, are stored on disk, and then deobfuscate
them at runtime so they can be used by the malicious program. For example, a simple obfuscation would be to add a value of 1 to all bytes in images
and strings stored in the PE resources section and then subtract 1 from all
of this data at runtime. Of course, any number of obfuscations are possible
here, all of which make life difficult for malware analysts attempting to
make sense of a malware binary using static analysis.
As with packing, one way around resource obfuscation is to just run the
malware in a safe environment. When this is not an option, the only mitigation for resource obfuscation is to actually figure out the ways in which malware has obfuscated its resources and to manually deobfuscate them, which
is what professional malware analysts often do.
Anti-disassembly Techniques
A third group of anti-detection, anti-analysis techniques used by malware
authors are anti-disassembly techniques. These techniques are designed to
exploit the inherent limitations of state-of-the-art disassembly techniques
to hide code from malware analysts or make malware analysts think that
a block of code stored on disk contains different instructions than it actually does.
An example of an anti-disassembly technique involves branching to a
memory location that the malware author’s disassemblers will interpret as
a different instruction, essentially hiding the malware’s true instructions
from reverse engineers. Anti-disassembly techniques have huge potential
and there’s no perfect way to defend against them. In practice, the two
main defenses against these techniques are to run malware samples in a
dynamic environment and to manually figure out where anti-disassembly
strategies manifest within a malware sample and how to bypass them.
Dynamically Downloaded Data
A final class of anti-analysis techniques malware authors use involves externally sourcing data and code. For example, a malware sample may load
code dynamically from an external server at malware startup time. If this is
the case, static analysis will be useless against such code. Similarly, malware
may source decryption keys from external servers at startup time and then
use these keys to decrypt data or code that will be used in the malware’s
execution.
Beyond Basic Static Analysis: x86 Disassembly 23
Obviously, if the malware is using an industrial-strength encryption
algorithm, static analysis will not be sufficient to recover the encrypted data
and code. Such anti-analysis and anti-detection techniques are quite powerful, and the only way around them is to acquire the code, data, or private
keys on the external servers by some means and then use them in one’s
analysis of the malware in question.
Summary
This chapter introduced x86 assembly code analysis and demonstrated how
we can perform disassembly-based static analysis on ircbot.exe using open
source Python tools. Although this is not meant to be a complete primer
on x86 assembly, you should now feel comfortable enough that you have a
starting place for figuring out what’s going on in a given malware assembly dump. Finally, you learned ways in which malware authors can defend
against disassembly and other static analysis techniques, and how you can
mitigate these anti-analysis and anti-detection strategies. In Chapter 3,
you’ll learn to conduct dynamic malware analysis that makes up for many
of the weaknesses of static malware analysis.
Comments
Post a Comment