information security

Malware datascience

mainly for education purpose

full book click here

1. introduction

If you’re working in security, chances are you’re using data science more than ever before, even if you may not realize it. For example, your antivirus product uses data science algorithms to detect malware. Your firewall vendor may have data science algorithms detecting suspicious network activity. Your security information and event management (SIEM) software probably uses data science to identify suspicious trends in your data. Whether conspicuously or not, the entire security industry is moving toward incorporating more data science into security products. Advanced IT security professionals are incorporating their own custom machine learning algorithms into their workflows. For example, in recent conference presentations and news articles, security analysts at Target, Mastercard, and Wells Fargo all described developing custom data science xxii Introduction technologies that they use as part of their security workflows.1 If you’re not already on the data science bandwagon, there’s no better time to upgrade your skills to include data science into your security practice.

What Is Data Science?

Data science is a growing set of algorithmic tools that allow us to understand and make predictions about data using statistics, mathematics, and artful statistical data visualizations. More specific definitions exist, but generally, data science has three subcomponents: machine learning, data mining, and data visualization. In the security context, machine learning algorithms learn from training data to detect new threats. These methods have been proven to detect malware that flies under the radar of traditional detection techniques like signatures. Data mining algorithms search security data for interesting patterns (such as relationships between threat actors) that might help us discern attack campaigns targeting our organizations. Finally, data visualization renders sterile, tabular data into graphical format to make it easier for people to spot interesting and suspicious trends. I cover all three areas in depth in this book and show you how to apply them.

Why Data Science Matters for Security

Data science is critically important for the future of cybersecurity for three reasons: first, security is all about data. When we seek to detect cyber threats, we’re analyzing data in the form of files, logs, network packets, and other artifacts. Traditionally, security professionals didn’t use data science techniques to make detections based on these data sources. Instead, they used file hashes, custom-written rules like signatures, and manually defined heuristics. Although these techniques have their merits, they required handcrafted techniques for each type of attack, necessitating too much manual work to keep up with the changing cyber threat landscape. In recent years, data science techniques have become crucial in bolstering our ability to detect threats. Second, data science is important to cybersecurity because the number of cyberattacks on the internet has grown dramatically. Take the growth of the malware underworld as an example. In 2008, there were about 1 million unique malware executables known to the security community. By 2012, there were 100 million. As this book goes to press in 2018, there are more than 700 million malicious executables known to the security community (https://www.av-test.org/en/statistics/malware/), and this number is likely to grow.

1. Target (https://www.rsaconference.com/events/us17/agenda/sessions/6662-applied-machinelearning-defeating-modern-malicious), Mastercard (https://blogs.wsj.com/cio/2017/11/15/artificialintelligence-transforms-hacker-arsenal/), and Wells Fargo (https://blogs.wsj.com/cio/2017/11/16/ the-morning-download-first-ai-powered-cyberattacks-are-detected/).

Introduction xxiii Due to the sheer volume of malware, manual detection techniques such as signatures are no longer a reasonable method for detecting all cyberattacks. Because data science techniques automate much of the work that goes into detecting cyberattacks, and vastly decrease the memory usage needed to detect such attacks, they hold tremendous promise in defending networks and users as cyber threats grow. Finally, data science matters for security because data science is the technical trend of the decade, both inside and outside of the security industry, and it will likely remain so through the next decade. Indeed, you’ve probably seen applications of data science everywhere—in personal voice assistants (Amazon Echo, Siri, and Google Home), self-driving cars, ad recommendation systems, web search engines, medical image analysis systems, and fitness tracking apps. We can expect data science–driven systems to have major impacts in legal services, education, and other areas. Because data science has become a key enabler across the technical landscape, universities, major companies (Google, Facebook, Microsoft, and IBM), and governments are investing billions of dollars to improve data science tools. Thanks to these investments, data science tools will become even more adept at solving hard attack-detection problems. Applying Data Science to Malware This book focuses on data science as it applies to malware, which we define as executable programs written with malicious intent, because malware continues to be the primary means by which threat actors gain a foothold on networks and subsequently achieve their goals. For example, in the ransomware scourge that has emerged in recent years, attackers typically send users malicious email attachments that download ransomware executables (malware) to users’ machines, which then encrypt users’ data and ask them for a ransom to decrypt the data. Although skilled attackers working for governments sometimes avoid using malware altogether to fly under the radar of detection systems, malware continues to be the major enabling technology in cyberattacks today. By homing in on a specific application of security data science rather than attempting to cover security data science broadly, this book aims to show more thoroughly how data science techniques can be applied to a major security problem. By understanding malware data science, you’ll be better equipped to apply data science to other areas of security, like detecting network attacks, phishing emails, or suspicious user behavior. Indeed, almost all the techniques you’ll learn in this book apply to building data science detection and intelligence systems in general, not just for malware. xxiv Introduction Who Should Read This Book? This book is aimed toward security professionals who are interested in learning more about how to apply data science to computer security problems. If computer security and data science are new to you, you might find yourself having to look up terms to give yourself a little bit of context, but you can still read this book successfully. If you’re only interested in data science, but not security, this book is probably not for you. About This Book The first part of the book consists of three chapters that cover basic reverse engineering concepts necessary for understanding the malware data science techniques discussed later in the book. If you’re new to malware, read the first three chapters first. If you’re an old hand at malware reverse engineering, you can skip these chapters.

• Chapter 1: Basic Static Malware Analysis covers static analysis techniques for picking apart malware files and discovering how they achieve malicious ends on our computers.

• Chapter 2: Beyond Basic Static Analysis: x86 Disassembly gives you a brief overview of x86 assembly language and how to disassemble and reverse engineer malware.

• Chapter 3: A Brief Introduction to Dynamic Analysis concludes the reverse engineering section of the book by discussing dynamic analysis, which involves running malware in controlled environments to learn about its behavior. The next two chapters of the book, Chapters 4 and 5, focus on malware relationship analysis, which involves looking at similarities and differences between collections of malware to identify malware campaigns against your organization, such as a ransomware campaign controlled by a group of cybercriminals, or a concerted, targeted attack on your organization. These stand-alone chapters are for readers who are interested not only in detecting malware, but also in extracting valuable threat intelligence to learn who is attacking their network. If you’re less interested in threat intelligence and more interested in data science–driven malware detection, you can safely skip these chapters.

• Chapter 4: Identifying Attack Campaigns Using Malware Networks shows you how to analyze and visualize malware based on shared attributes, such as the hostnames that malware programs call out to.

• Chapter 5: Shared Code Analysis explains how to identify and visualize shared code relationships between malware samples, which can help you identify whether groups of malware samples came from one or multiple criminal groups.The next four chapters cover everything you need to know to understand, apply, and implement machine learning–based malware detection systems. These chapters also provide a foundation for applying machine learning to other security contexts.

• Chapter 6: Understanding Machine Learning–Based Malware Detectors is an accessible, intuitive, and non-mathematical introduction to basic machine learning concepts. If you have a history with machine learning, this chapter will provide a convenient refresher.

• Chapter 7: Evaluating Malware Detection Systems shows you how to evaluate the accuracy of your machine learning systems using basic statistical methods so that you can select the best possible approach.

• Chapter 8: Building Machine Learning Detectors introduces open source machine learning tools you can use to build your own machine learning systems and explains how to use them.

• Chapter 9: Visualizing Malware Trends covers how to visualize malware threat data to reveal attack campaigns and trends using Python, and how to integrate data visualization into your day-to-day workflow when analyzing security data. The last three chapters introduce deep learning, an advanced area of machine learning that involves a bit more math. Deep learning is a hot growth area within security data science, and these chapters provide enough to get you started.

• Chapter 10: Deep Learning Basics covers the basic concepts that underlie deep learning.

• Chapter 11: Building a Neural Network Malware Detector with Keras explains how to implement deep learning–based malware detection systems in Python using open source tools.

• Chapter 12: Becoming a Data Scientist concludes the book by sharing different pathways to becoming a data scientist and qualities that can help you succeed in the field.

• Appendix: An Overview of Datasets and Tools describes the data and example tool implementations accompanying the book. How to Use the Sample Code and Data No good programming book is complete without sample code to play with and extend on your own. Sample code and data accompany each chapter of this book and are described exhaustively in the appendix. All the code targets Python 2.7 in Linux environments. To access the code and data, you can download a VirtualBox Linux virtual machine, which has the code, data, and supporting open source tools all set up and ready to go, xxvi Introduction and run it within your own VirtualBox environment. You can download the book’s accompanying data at http://www.malwaredatascience.com/, and you can download the VirtualBox for free at https://www.virtualbox.org/wiki/ Downloads. The code has been tested on Linux, but if you prefer to work outside of the Linux VirtualBox, the same code should work almost as well on MacOS, and to a lesser extent on Windows machines. If you’d rather install the code and data in your own Linux environment, you can download them here: http://www.malwaredatascience.com/. You’ll find a directory for each chapter in the downloadable archive, and within each chapter’s directory there are code/ and data/ directories that contain the corresponding code and data. Code files correspond to chapter listings or sections, whichever makes more sense for the application at hand. Some code files are exactly like the listings, whereas others have been changed slightly to make it easier for you to play with parameters and other options. Code directories come with pip requirements.txt files, which give the open source libraries that the code in that chapter depends on to run. To install these libraries on your machine, simply type pip -r requirements.txt in each chapter’s code/ directory. Now that you have access to the code and data for this book, let’s get started.

2.Basic static maleware analysis

In this chapter we look at the basics of static malware analysis. Static analysis is performed by analyzing a program file’s disassembled code, graphical images, printable strings, and other on-disk resources. It refers to reverse engineering without actually running the program. Although static analysis techniques have their shortcomings, they can help us understand a wide variety of malware. Through careful reverse engineering, you’ll be able to better understand the benefits that malware binaries provide attackers after they’ve taken possession of a target, as well as the ways attackers can hide and continue their attacks on an infected machine. As you’ll see, this chapter combines descriptions and examples. Eac`h section introduces a static analysis technique and then illustrates its application in real-world analysis. 2 Chapter 1 I begin this chapter by describing the Portable Executable (PE) file format used by most Windows programs, and then examine how to use the popular Python library pefile to dissect a real-world malware binary. I then describe techniques such as imports analysis, graphical image analysis, and strings analysis. In all cases, I show you how to use open source tools to apply the analysis technique to real-world malware. Finally, at the end of the chapter, I introduce ways malware can make life difficult for malware analysts and discuss some ways to mitigate these issues. You’ll find the malware sample used in the examples in this chapter in this book’s data under the directory /ch1. To demonstrate the techniques discussed in this chapter, we use ircbot.exe, an Internet Relay Chat (IRC) bot created for experimental use, as an example of the kinds of malware commonly observed in the wild. As such, the program is designed to stay resident on a target computer while connected to an IRC server. After ircbot .exe gets hold of a target, attackers can control the target computer via IRC, allowing them to take actions such as turning on a webcam to capture and surreptitiously extract video feeds of the target’s physical location, taking screenshots of the desktop, extracting files from the target machine, and so on. Throughout this chapter, I demonstrate how static analysis techniques can reveal the capabilities of this malware. The

Microsoft Windows Portable Executable Format

To perform static malware analysis, you need to understand the Windows PE format, which describes the structure of modern Windows program files such as .exe, .dll, and .sys files and defines the way they store data. PE files contain x86 instructions, data such as images and text, and metadata that a program needs in order to run.

The PE format was originally designed to do the following:

Tell Windows how to load a program into memory

The PE format describes which chunks of a file should be loaded into memory, and where. It also tells you where in the program code Windows should start a program’s execution and which dynamically linked code libraries should be loaded into memory.

Supply media (or resources) a running program may use in the course of its execution

These resources can include strings of characters like the ones in GUI dialogs or console output, as well as images or videos.

Supply security data such as digital code signatures

Windows uses such security data to ensure that code comes from a trusted source. The PE format accomplishes all of this by leveraging the series of constructs shown in Figure 1-1. Basic Static Malware Analysis 3 Increasing file offsets DOS header PE header Optional header Section headers .text section (program code) .idata section (imported libraries) .rsrc section (strings, images, . . . ) .reloc section (memory translations) Figure 1-1: The PE file format As the figure shows, the PE format includes a series of headers telling the operating system how to load the program into memory. It also includes a series of sections that contain the actual program data. Windows loads the sections into memory such that their memory offsets correspond to where they appear on disk. Let’s explore this file structure in more detail, starting with the PE header. We’ll skip over a discussion of the DOS header, which is a relic of the 1980s-era Microsoft DOS operating system and only present for compatibility reasons. The PE Header Shown at the bottom of Figure 1-1, above the DOS header u, is the PE header v, which defines a program’s general attributes such as binary code, images, compressed data, and other program attributes. It also tells us whether a program is designed for 32- or 64-bit systems. The PE header provides basic but useful contextual information to the malware analyst. For example, the header includes a timestamp field that can give away the time at which the malware author compiled the file. This happens when malware authors forget to replace this field with a bogus value, which they often do. The Optional Header The optional header w is actually ubiquitous in today’s PE executable programs, contrary to what its name suggests. It defines the location of the program’s entry point in the PE file, which refers to the first instruction the program runs once loaded. It also defines the size of the data that Windows loads into memory as it loads the PE file, the Windows subsystem, the program targets (such as the Windows GUI or the Windows 4 Chapter 1 command line), and other high-level details about the program. The information in this header can prove invaluable to reverse engineers, because a program’s entry point tells them where to begin reverse engineering. Section Headers Section headers x describe the data sections contained within a PE file. A section in a PE file is a chunk of data that either will be mapped into memory when the operating system loads a program or will contain instructions about how the program should be loaded into memory. In other words, a section is a sequence of bytes on disk that will either become a contiguous string of bytes in memory or inform the operating system about some aspect of the loading process. Section headers also tell Windows what permissions it should grant to sections, such as whether they should be readable, writable, or executable by the program when it’s executing. For example, the .text section containing x86 code will typically be marked readable and executable but not writable to prevent program code from accidentally modifying itself in the course of execution. A number of sections, such as .text and .rsrc, are depicted in Figure 1-1. These get mapped into memory when the PE file is executed. Other special sections, such as the .reloc section, aren’t mapped into memory. We’ll discuss these sections as well. Let’s go over the sections shown in Figure 1-1. The .text Section Each PE program contains at least one section of x86 code marked executable in its section header; these sections are almost always named .text y. We’ll disassemble the data in the .text section when performing program disassembly and reverse engineering in Chapter 2. The .idata Section The .idata section z, also called imports, contains the Import Address Table (IAT), which lists dynamically linked libraries and their functions. The IAT is among the most important PE structures to inspect when initially approaching a PE binary for analysis because it reveals the library calls a program makes, which in turn can betray the malware’s high-level functionality. The Data Sections The data sections in a PE file can include sections like .rsrc, .data, and .rdata, which store items such as mouse cursor images, button skins, audio, and other media used by a program. For example, the .rsrc section { in Figure 1-1 contains printable character strings that a program uses to render text as strings. Basic Static Malware Analysis 5 The information in the .rsrc (resources) section can be vital to malware analysts because by examining the printable character strings, graphical images, and other assets in a PE file, they can gain vital clues about the file’s functionality. In “Examining Malware Images” on page 7, you’ll learn how to use the icoutils toolkit (including icotool and wrestool) to extract graphical images from malware binaries’ resources sections. Then, in “Examining Malware Strings” on page 8, you’ll learn how to extract printable strings from malware resources sections. The .reloc Section A PE binary’s code is not position independent, which means it will not execute correctly if it’s moved from its intended memory location to a new memory location. The .reloc section | gets around this by allowing code to be moved without breaking. It tells the Windows operating system to translate memory addresses in a PE file’s code if the code has been moved so that the code still runs correctly. These translations usually involve adding or subtracting an offset from a memory address. Although a PE file’s .reloc section may well contain information you’ll want to use in your malware analysis, we won’t discuss it further in this book because our focus is on applying machine learning and data analysis to malware, not doing the kind of hardcore reverse engineering that involves looking at relocations.

Dissecting the PE Format Using pefile

The pefile Python module, written and maintained by Ero Carerra, has become an industry-standard malware analysis library for dissecting PE files. In this section, I show you how to use pefile to dissect ircbot.exe. The ircbot.exe file can be found on the virtual machine accompanying this book in the directory ~/malware_data_science/ch1/data.

Listing 1-1 assumes that ircbot.exe is in your current working directory. Enter the following to install the pefile library so that we can import it within Python:

$ pip install pefile

Now, use the commands in Listing 1-1 to start Python, import the pefile module, and open and parse the PE file ircbot.exe using pefile.

$ python

>>> import pefile

>>> pe = pefile.PE("ircbot.exe")

Listing 1-1: Loading the pefile module and parsing a PE file (ircbot.exe) 6 Chapter 1 We instantiate pefile.PE, which is the core class implemented by the PE module. It parses PE files so that we can examine their attributes. By calling the PE constructor, we load and parse the specified PE file, which is ircbot.exe in this example. Now that we’ve loaded and parsed our file, run the code in Listing 1-2 to pull information from ircbot.exe’s PE fields.

# based on Ero Carrera's example code (pefile library author) for section in pe.sections: print (section.Name, hex(section.VirtualAddress), hex(section.Misc_VirtualSize), section.SizeOfRawData )

Listing 1-2: Iterating through the PE file’s sections and printing information about them Listing 1-3 shows the output. ('.text\x00\x00\x00', '0x1000', '0x32830', w207360) ('.rdata\x00\x00', '0x34000', '0x427a', 17408) ('.data\x00\x00\x00', '0x39000', '0x5cff8', 10752) ('.idata\x00\x00', '0x96000', '0xbb0', 3072) ('.reloc\x00\x00', '0x97000', '0x211d', 8704) Listing 1-3: Pulling section data from ircbot.exe using Python’s pefile module As you can see in Listing 1-3, we’ve pulled data from five different sections of the PE file: .text, .rdata, .data, .idata, and .reloc. The output is given as five tuples, one for each PE section pulled. The first entry on each line identifies the PE section. (You can ignore the series of \x00 null bytes, which are simply C-style null string terminators.) The remaining fields tell us what each section’s memory utilization will be once it’s loaded into memory and where in memory it will be found once loaded. For example, 0x1000  is the base virtual memory address where these sections will be loaded. Think of this as the section’s base memory address. The 0x32830  in the virtual size field specifies the amount of memory required by the section once loaded. The 207360  in the third field represents the amount of data the section will take up within that chunk of memory. In addition to using pefile to parse a program’s sections, we can also use it to list the DLLs a binary will load, as well as the function calls it will request within those DLLs. We can do this by dumping a PE file’s IAT. Listing 1-4 shows how to use pefile to dump the IAT for ircbot.exe.

$ python

pe = pefile.PE("ircbot.exe")

for entry in pe.DIRECTORY_ENTRY_IMPORT:

print entry.dll

for function in entry.imports:

print '\t',function.name

Listing 1-4: Extracting imports from ircbot.exe Listing 1-4 should produce the output shown in

Listing 1-5 (truncated for brevity). Basic Static Malware Analysis 7 KERNEL32.DLL GetLocalTime ExitThread CloseHandle  WriteFile  CreateFileA ExitProcess  CreateProcessA GetTickCount GetModuleFileNameA --snip-- Listing 1-5: Contents of the IAT of ircbot.exe, showing library functions used by this malware As you can see in Listing 1-5, this output is valuable for malware analysis because it lists a rich array of functions that the malware declares and will reference. For example, the first few lines of the output tell us that the malware will write to files using WriteFile , open files using the CreateFileA call , and create new processes using CreateProcessA . Although this is fairly basic information about the malware, it’s a start in understanding the malware’s behavior in more detail.

Examining Malware Images

To understand how malware may be designed to game a target, let’s look at the icons contained in its .rsrc section. For example, malware binaries are often designed to trick users into clicking them by masquerading as Word documents, game installers, PDF files, and so on. You also find images in the malware suggesting programs of interest to the attackers themselves, such as network attack tools and programs run by attackers for the remote control of compromised machines. I have even seen binaries containing desktop icons of jihadists, images of evil-looking cyberpunk cartoon characters, and images of Kalashnikov rifles. For our sample image analysis, let’s consider a malware sample the security company Mandiant identified as having been crafted by a Chinese state-sponsored hacking group. You can find this sample malware in this chapter’s data directory under the name fakepdfmalware.exe. This sample uses an Adobe Acrobat icon to trick users into thinking it is an Adobe Acrobat document, when in fact it’s a malicious PE executable. Before we can extract the images from the fakepdfmalware.exe binary using the Linux command line tool wrestool, we first need to create a directory to hold the images we’ll extract. Listing 1-6 shows how to do all this.

$ mkdir images $ wrestool –x fakepdfmalware.exe –output=images $ icotool –x –o images images/*.ico

Listing 1-6: Shell commands that extract images from a malware sample 8 Chapter 1 We first use mkdir images to create a directory to hold the extracted images. Next, we use wrestool to extract image resources (-x) from fakepdfmalware.exe to /images and then use icotool to extract (-x) and convert (-o) any resources in the Adobe .ico icon format into .png graphics so that we can view them using standard image viewer tools. If you don’t have wrestool installed on your system, you can download it at http://www .nongnu.org/icoutils/. Once you’ve used wrestool to convert the images in the target executable to the PNG format, you should be able open them in your favorite image viewer and see the Adobe Acrobat icon at various resolutions. As my example here demonstrates, extracting images and icons from PE files is relatively straightforward and can quickly reveal interesting and useful information about malware binaries. Similarly, we can easily extract printable strings from malware for more information, which we’ll do next.

Examining Malware Strings

Strings are sequences of printable characters within a program binary. Malware analysts often rely on strings in a malicious sample to get a quick sense of what may be going on inside it. These strings often contain things like HTTP and FTP commands that download web pages and files, IP addresses and hostnames that tell you what addresses the malware connects to, and the like. Sometimes even the language used to write the strings can hint at a malware binary’s country of origin, though this can be faked. You may even find text in a string that explains in leetspeak the purpose of a malicious binary. Strings can also reveal more technical information about a binary. For example, you may find information about the compiler used to create it, the programming language the binary was written in, embedded scripts or HTML, and so on. Although malware authors can obfuscate, encrypt, and compress all of these traces, even advanced malware authors often leave at least some traces exposed, making it particularly important to examine strings dumps when analyzing malware.

Using the strings Program

The standard way to view all strings in a file is to use the command line tool strings, which uses the following syntax: $ strings filepath | less This command prints all strings in a file to the terminal, line by line. Adding | less at the end prevents the strings from just scrolling across the terminal. By default, the strings command finds all printable strings with a minimum length of 4 bytes, but you can set a different minimum length and change various other parameters, as listed in the commands manual page. I recommend simply using the default minimum string length of 4, Basic Static Malware Analysis 9 but you can change the minimum string length using the –n option. For example, strings –n 10 filepath would extract only strings with a minimum length of 10 bytes.

Analyzing Your strings Dump

Now that we dumped a malware program’s printable strings, the challenge is to understand what the strings mean. For example, let’s say we dump the strings to the ircbotstring.txt file for ircbot.exe, which we explored earlier in this chapter using the pefile library, like this: $ strings ircbot.exe > ircbotstring.txt The contents of ircbotstring.txt contain thousands of lines of text, but some of these lines should stick out. For example, Listing 1-7 shows a bunch of lines extracted from the string dump that begin with the word DOWNLOAD. [DOWNLOAD]: Bad URL, or DNS Error: %s. [DOWNLOAD]: Update failed: Error executing file: %s. [DOWNLOAD]: Downloaded %.1fKB to %s @ %.1fKB/sec. Updating. [DOWNLOAD]: Opened: %s. --snip-- [DOWNLOAD]: Downloaded %.1f KB to %s @ %.1f KB/sec. [DOWNLOAD]: CRC Failed (%d != %d). [DOWNLOAD]: Filesize is incorrect: (%d != %d). [DOWNLOAD]: Update: %s (%dKB transferred). [DOWNLOAD]: File download: %s (%dKB transferred). [DOWNLOAD]: Couldn't open file: %s. Listing 1-7: The strings output showing evidence that the malware can download files specified by the attacker onto a target machine These lines indicate that ircbot.exe will attempt to download files specified by an attacker onto the target machine. Let’s try analyzing another one. The string dump shown in Listing 1-8 indicates that ircbot.exe can act as a web server that listens on the target machine for connections from the attacker.  GET  HTTP/1.0 200 OK Server: myBot Cache-Control: no-cache,no-store,max-age=0 pragma: no-cache Content-Type: %s Content-Length: %i Accept-Ranges: bytes Date: %s %s GMT Last-Modified: %s %s GMT Expires: %s %s GMT Connection: close HTTP/1.0 200 OK  Server: myBot 10 Chapter 1 Cache-Control: no-cache,no-store,max-age=0 pragma: no-cache Content-Type: %s Accept-Ranges: bytes Date: %s %s GMT Last-Modified: %s %s GMT Expires: %s %s GMT Connection: close HH:mm:ss ddd, dd MMM yyyy application/octet-stream text/html Listing 1-8: The strings output showing that the malware has an HTTP server to which the attacker can connect Listing 1-8 shows a wide variety of HTTP boilerplates used by ircbot.exe to implement an HTTP server. It’s likely that this HTTP server allows the attacker to connect to a target machine via HTTP to issue commands, such as the command to take a screenshot of the victim’s desktop and send it back to the attacker. We see evidence of HTTP functionality throughout the listing. For example, the GET method  requests data from an internet resource. The line HTTP/1.0 200 OK  is an HTTP string that returns the status code 200, indicating that all went well with an HTTP network transaction, and Server: myBot  indicates that the name of the HTTP server is myBot, a giveaway that ircbot.exe has a built-in HTTP server. All of this information is useful in understanding and stopping a particular malware sample or malicious campaign. For example, knowing that a malware sample has an HTTP server that outputs certain strings when you connect to it allows you to scan your network to identify infected hosts.

Summary

In this chapter, you got a high-level overview of static malware analysis, which involves inspecting a malware program without actually running it. You learned about the PE file format that defines Windows .exe and .dll files, and you learned how to use the Python library pefile to dissect a real-world malware ircbot.exe binary. You also used static analysis techniques such as image analysis and strings analysis to extract more information from malware samples. Chapter 2 continues our discussion of static malware analysis with a focus on analyzing the assembly code that can be recovered from malware.

3.Beyond Basic Static Analysis: x86 Disassembly

To thoroughly understand a malicious program, we often need to go beyond basic static analysis of its sections, strings, imports, and images. This involves reverse engineering a program’s assembly code. Indeed, disassembly and reverse engineering lie at the heart of deep static analysis of malware samples. Because reverse engineering is an art, technical craft, and science, a thorough exploration is beyond the scope of this chapter. My goal here is to introduce you to reverse engineering so that you can apply it to malware data science. Understanding this methodology is essential for successfully applying machine learning and data analysis to malware. In this chapter I start with the concepts you’ll need to understand x86 disassembly. Later in the chapter I show how malware authors attempt to bypass disassembly and discuss ways to mitigate these anti-analysis and anti-detection maneuvers. But first, let’s review some common disassembly methods as well as the basics of x86 assembly language.

Disassembly Methods

Disassembly is the process by which we translate malware’s binary code into valid x86 assembly language. Malware authors generally write malware programs in a high-level language like C or C++ and then use a compiler to compile the source code into x86 binary code. Assembly language is the human-readable representation of this binary code. Therefore, disassembling a malware program into assembly language is necessary to understand how it behaves at its core. Unfortunately, disassembly is no easy feat because malware authors regularly employ tricks to thwart would-be reverse engineers. In fact, perfect disassembly in the face of deliberate obfuscation is an unsolved problem in computer science. Currently, only approximate, error-prone methods exist for disassembling such programs. For example, consider the case of self-modifying code, or binary code that modifies itself as it executes. The only way to disassemble this code properly is to understand the program logic by which the code modifies itself, but that can be exceedingly complex. Because perfect disassembly is currently impossible, we must use imperfect methods to accomplish this task. The method we’ll use is linear disassembly, which involves identifying the contiguous sequence of bytes in the Portable Executable (PE) file that corresponds to its x86 program code and then decoding these bytes. The key limitation of this approach is that it ignores subtleties about how instructions are decoded by the CPU in the course of program execution. Also, it doesn’t account for the various obfuscations malware authors sometimes use to make their programs harder to analyze. The other methods of reverse engineering, which we won’t cover here, are the more complex disassembly methods used by industrial-grade disassemblers such as IDA Pro. These more advanced methods actually simulate or reason about program execution to discover which assembly instructions a program might reach as a result of a series of conditional branches. Although this type of disassembly can be more accurate than linear disassembly, it’s far more CPU intensive than linear disassembly methods, making it less suitable for data science purposes where the focus is on disassembling thousands or even millions of programs. Before you can begin analysis using linear disassembly, however, you’ll need to review the basic components of assembly language.

Basics of x86 Assembly Language

Assembly language is the lowest-level human-readable programming language for a given architecture, and it maps closely to the binary instruction format of a particular CPU architecture. A line of assembly language is almost always equivalent to a single CPU instruction. Because assembly is so low level, you can often retrieve it easily from a malware binary by using the right tools.

Gaining basic proficiency in reading disassembled malware x86 code is easier than you might think. This is because most malware assembly code spends most of its time calling into the operating system by way of the Windows operating system’s dynamic-link libraries (DLLs), which are loaded into program memory at runtime. Malware programs use DLLs to do most of the real work, such as modifying the system registry, moving and copying files, making network connections and communicating via network protocols, and so on. Therefore, following malware assembly code often involves understanding the ways in which function calls are made from assembly and understanding what various DLL calls do. Of course, things can get much more complicated, but knowing this much can reveal a lot about the malware. In the following sections I introduce some important assembly language concepts. I also explain some abstract concepts like control flow and control flow graphs. Finally, we disassemble the ircbot.exe program and explore how its assembly and control flow can give us insight into its purpose. There are two major dialects of x86 assembly: Intel and AT&T. In this book I use Intel syntax, which can be obtained from all major disassemblers and is the syntax used in the official Intel documentation of the x86 CPU. Let’s start by taking a look at CPU registers

CPU Registers

Registers are small data storage units on which x86 CPUs perform computations. Because registers are located on the CPU itself, register access is orders of magnitude faster than memory access. This is why core computational operations, such as arithmetic and condition testing instructions, all target registers. It’s also why the CPU uses registers to store information about the status of running programs. Although many registers are available to experienced x86 assembly programmers, we’ll just focus on a few important ones here.

General-Purpose Registers

General-purpose registers are like scratch space for assembly programmers. On a 32-bit system, each of these registers contains 32, 16, or 8 bits of space against which we can perform arithmetic operations, bitwise operations, byte order–swapping operations, and more. In common computational workflows, programs move data into registers from memory or from external hardware devices, perform some operations on this data, and then move the data back out to memory for storage. For example, to sort a long list, a program typically pulls list items in from an array in memory, compares them in the registers, and then writes the comparison results back out to memory. To understand some of the nuances of the general-purpose register model in the Intel 32-bit architecture

Figure 2-1: Registers in the x86 architecture

The vertical axis shows the layout of the general-purpose registers, and the horizontal axis shows how EAX, EBX, ECX, and EDX are subdivided. EAX, EBX, ECX, and EDX are 32-bit registers that have smaller, 16-bit registers inside them: AX, BX, CX, and DX. As you can see in the figure, these 16-bit registers can be subdivided into upper and lower 8-bit registers: AH, AL, BH, BL, CH, CL, DH, and DL. Although it’s sometimes useful to address the subdivisions in EAX, EBX, ECX, and EDX, you’ll mostly see direct references to EAX, EBX, ECX, and EDX.

Stack and Control Flow Registers

The stack management registers store critical information about the program stack, which is responsible for storing local variables for functions, arguments passed into functions, and control information relating to the program control flow. Let’s go over some of these registers. In simple terms, the ESP register points to the top of the stack for the currently executing function, whereas the EBP register points to the bottom of the stack for the currently executing function. This is crucial information for modern programs, because it means that by referencing data relative to the stack rather than using its absolute address, procedural and object-oriented code can access local variables more gracefully and efficiently. Although you won’t see direct references to the EIP register in x86 assembly code, it’s important in security analysis, particularly in the context of vulnerability research and buffer-overflow exploit development. This is because EIP contains the memory address of the currently executing instruction. Attackers can use buffer-overflow exploits to corrupt the value of the EIP register indirectly and take control of program execution. Beyond Basic Static Analysis: x86 Disassembly 15 In addition to its role in exploitation, EIP is also important in the analysis of malicious code deployed by malware. Using a debugger we can inspect EIP’s value at any moment, which helps us understand what code malware is executing at any particular time. EFLAGS is a status register that contains CPU flags, which are bits that store status information about the state of the currently executing program. The EFLAGS register is central to the process of making conditional branches, or changes in execution flow resulting from the outcome of if/then-style program logic, within x86 programs. Specifically, whenever an x86 assembly program checks whether some value is greater or less than zero and then jumps to a function based on the outcome of this test, the EFLAGS register plays an enabling role, as described in more detail in “Basic Blocks and Control Flow Graphs” on page 19.

Arithmetic Instructions Instructions

operate on general-purpose registers. You can perform simple computations with the general-purpose registers using arithmetic instructions. For example, add, sub, inc, dec, and mul are examples of arithmetic instructions you’ll encounter frequently in malware reverse engineering. Table 2-1 lists some examples of basic instructions and their syntax.

Instructions Description

add ebx, 100 Adds 100 to the value in EBX and then stores the result in EBX

sub ebx, 100 Subtracts 100 from the value in EBX and then stores the result in EBX

inc ah Increments the value in AH by 1

dec al Decrements the value in AL by 1

The add instruction adds two integers and stores the result in the first operand specified, whether this is a memory location or a register according to the following syntax. Keep in mind only one argument can be a memory location. The sub instruction is similar to add, except it subtracts integers. The inc instruction increments a register or memory location’s integer value, whereas dec decrements a register or memory location’s integer value

Data Movement Instructions

The x86 processor provides a robust set of instructions for moving data between registers and memory. These instructions provide the underlying mechanisms that allow us to manipulate data. The staple memory movement instruction is the mov instruction. Table 2-2 shows how you can use the mov instruction to move data around. 16 Chapter 2 Table 2-2: Data Movement Instructions

Instructions Description

mov ebx,eax Moves the value in register EAX into register EBX

mov eax, [0x12345678] Moves the data at memory address 0x12345678 into the EAX register

mov edx, 1 Moves the value 1 into the register EDX

mov [0x12345678], eax Moves the value in EAX into the memory location 0x12345678

Related to the mov instruction, the lea instruction loads the absolute memory address specified into the register used for getting a pointer to a memory location. For example, lea edx, [esp-4] subtracts 4 from the value in ESP and loads the resulting value into EDX.

Stack Instructions

The stack in x86 assembly is a data structure that allows you to push and pop values onto and off of it. This is similar to how you would add and remove plates on and off the top of a stack of plates. Because control flow is often expressed through C-style function calls in x86 assembly and because these function calls use the stack to pass arguments, allocate local variables, and remember what part of the program to return to after a function finishes executing, the stack and control flow need to be understood together. The push instruction pushes values onto the program stack when the programmer wants to save a register value onto the stack, and the pop instruction deletes values from the stack and places them into a designated register. The push instruction uses the following syntax to perform its operations: push 1 In this example, the program points the stack pointer (the register ESP) to a new memory address, thereby making room for the value (1), which is now stored at the top location on the stack. Then it copies the value from the argument to the memory location the CPU has just made room for on the top of the stack. Let’s contrast this with pop: pop eax The program uses pop to pop the top value off the stack and move it into a specified register. In this example, pop eax pops the top value off the stack and moves it into eax. An unintuitive but important detail to understand about the x86 program stack is that it grows downward in memory, so that the highest value on the stack is actually stored at the lowest address in stack memory. This Beyond Basic Static Analysis: x86 Disassembly 17 becomes very important to remember when you analyze assembly code that references data stored on the stack, as it can quickly get confusing unless you know the stack’s memory layout. Because the x86 stack grows downward in memory, when the push instruction allocates space on the program stack for a new value, it decrements the value of ESP so that it points to a lower location in memory and then copies a value from the target register into that memory location, starting at the top address of the stack and growing up. Conversely, the pop instruction actually copies the top value off of the stack and then increments the value of ESP so it points to a higher memory location.

Control Flow Instructions

An x86 program’s control flow defines the network of possible instruction execution sequences a program may execute, depending on the data, device interactions, and other inputs the program might receive. Control flow instructions define a program’s control flow. They are more complicated than stack instructions but still quite intuitive. Because control flow is often expressed through C-style function calls in x86 assembly, the stack and control flow are closely related. They’re also related because these function calls use the stack to pass arguments, allocate local variables, and remember what part of the program to return to after a function finishes executing. The call and ret control flow instructions are the most important in terms of how programs call functions in x86 assembly and how programs return from functions after these functions are done executing. The call instruction calls a function. Think of this as a function you might write in a higher-level language like C to allow the program to return to the instruction after the call instruction is invoked and the function has finished executing. You can invoke the call instruction using the following syntax, where address denotes the memory location where the function’s code begins: call address The call instruction does two things. First, it pushes the address of the instruction that will execute after the function call returns onto the top of the stack so that the program knows what address to return to after the called function finishes executing. Second, call replaces the current value of EIP with the value specified by the address operand. Then, the CPU begins execution at the new memory location pointed to by EIP. Just as call initiates a function call, the ret instruction completes it. You can use the ret instruction on its own and without any parameter, as shown here: ret 18 Chapter 2 When invoked, ret pops the top value off the stack, which we expect to be the saved program counter value (EIP) that the call instruction pushed onto the stack when the call instruction was invoked. Then it places the popped program counter value back into EIP and resumes execution. The jmp instruction is another important control flow construction, which operates more simply than call. Instead of worrying about saving EIP, jmp simply tells the CPU to move to the memory address specified as its parameter and begin execution there. For example, jmp 0x12345678 tells the CPU to start executing the program code stored at memory location 0x12345678 on the next instruction. You may be wondering how you can make jmp and call instructions execute in a conditional way, such as “if the program has received a network packet, execute the following function.” The answer is that x86 assembly doesn’t have high-level constructs like if, then, else, else if, and so on. Instead, branching to an address within a program’s code typically requires two instructions: a cmp instruction, which checks the value in some register against some test value and stores the result of that test in the EFLAGS register, and a conditional branch instruction. Most conditional branch instructions start with a j, which allows the program to jump to a memory address, and are post-fixed with letters that stand for the condition being tested. For example, jge tells the program to jump if greater than or equal to. This means that the value in the register being tested must be greater than or equal to the test value. The cmp instruction uses the following syntax: cmp register, memory location, or literal, register, memory location, or literal As stated earlier, cmp compares the value in the specified general-purpose register with value and then stores the result of that comparison in the EFLAGS register. The various conditional jmp instructions are then invoked as follows: j* address As you can see, we can prefix j to any number of conditional test instructions. For example, to jump only if the value tested is greater than or equal to the value in the register, use the following instruction: jge address Note that unlike the case of the call and ret instructions, the jmp family of instructions never touches the program stack. In fact, in the case of the jmp family of instructions, the x86 program is responsible for tracking its own execution flow and potentially saving or deleting information about what addresses it has visited and where it should return to after a particular sequence of instructions has executed.

Basic Blocks and Control Flow Graphs

Although x86 programs look sequential when we scroll through their code in a text editor, they actually have loops, conditional branches, and unconditional branches (control flow). All of these give each x86 program a network structure. Let’s use the simple toy assembly program in Listing 2-1 to see how this works.

setup: # symbol standing in for address of instruction on the next line

 mov eax, 10 loopstart: # symbol standing in for address of the instruction on the next line

 sub eax, 1

 cmp 0, eax jne $loopstart

loopend: # symbol standing in for address of the instruction on the next line

mov eax, 1

# more code would go here

Listing 2-1: Assembly program for understanding control flow graph As you can see, this program initializes a counter to the value 10, stored in register EAX . Next, it does a loop in which the value in EAX is decremented by 1  on each iteration. Finally, once EAX has reached a value of 0 , the program breaks out of the loop. In the language of control flow graph analysis, we can think of these instructions as comprising three basic blocks. A basic block is a sequence of instructions that we know will always execute contiguously. In other words, a basic block always ends with either a branching instruction or an instruction that is the target of a branch, and it always begins with either the first instruction of the program, called the program’s entry point, or a branch target. In Listing 2-1, you can see where the basic blocks of our simple program begin and end. The first basic block is composed of the instruction mov eax, 10 under setup:. The second basic block is composed of lines beginning with sub eax, 1 through jne $loopstart under loopstart:, and the third starts at mov eax, 1 under loopend:. We can visualize the relationships between the basic blocks using the graph in Figure 2-2. (We use the term graph synonymously with the term network; in computer science, these terms are interchangeable.) loopstart: sub eax, 1 cmp 0, eax jne $loopstart setup: mov eax, 10 loopend: move eax, 1 Figure 2-2: A visualization of the control flow graph of our simple assembly program 20 Chapter 2 If one basic block can ever flow into another basic block, we connect it, as shown in Figure 2-2. The figure shows that the setup basic block leads to the loopstart basic block, which repeats 10 times before it transitions to the loopend basic block. Real-world programs have control flow graphs such as these, but they’re much more complicated, with thousands of basic blocks and thousands of interconnections.

Disassembling ircbot.exe Using pefile and capstone

Now that you have a good understanding of the basics of assembly language, let’s disassemble the first 100 bytes of ircbot.exe’s assembly code using linear disassembly. To do this, we’ll use the open source Python libraries pefile (introduced in Chapter 1) and capstone, which is an open source disassembly library that can disassemble 32-bit x86 binary code. You can install both of these libraries with pip using the following commands: pip install pefile pip install capstone Once these two libraries are installed, we can leverage them to disassemble ircbot.exe using the code in Listing 2-2. #!/usr/bin/python import pefile from capstone import * # load the target PE file pe = pefile.PE("ircbot.exe") # get the address of the program entry point from the program header entrypoint = pe.OPTIONAL_HEADER.AddressOfEntryPoint # compute memory address where the entry code will be loaded into memory entrypoint_address = entrypoint+pe.OPTIONAL_HEADER.ImageBase # get the binary code from the PE file object binary_code = pe.get_memory_mapped_image()[entrypoint:entrypoint+100] # initialize disassembler to disassemble 32 bit x86 binary code disassembler = Cs(CS_ARCH_X86, CS_MODE_32) # disassemble the code for instruction in disassembler.disasm(binary_code, entrypoint_address): print "%s\t%s" %(instruction.mnemonic, instruction.op_str) Listing 2-2: Disassembling ircbot.exe This should produce the following output:  push ebp mov ebp, esp Beyond Basic Static Analysis: x86 Disassembly 21 push -1 push 0x437588 push 0x41982c  mov eax, dword ptr fs:[0] push eax mov dword ptr fs:[0], esp  add esp, -0x5c push ebx push esi push edi mov dword ptr [ebp - 0x18], esp  call dword ptr [0x496308] --snip-- Don’t worry about understanding all of the instructions in the disassembly output: that would involve an understanding of assembly that goes beyond the scope of this book. However, you should feel comfortable with many of the instructions in the output and have some sense of what they do. For example, the malware pushes the value in register EBP onto the stack , saving its value. Then it proceeds to move the value in ESP into EBP and pushes some numerical values onto the stack. The program moves some data in memory into the EAX register , and it adds the value -0x5c to the value in the ESP register . Finally, the program uses the call instruction to call a function stored at the memory address 0x496308 . Because this is not a book on reverse engineering, I won’t go into any more depth here about what the code means. What I’ve presented is a start to understanding how assembly language works. For more information on assembly language, I recommend the Intel programmer’s manual at http://www.intel.com/ content/www/us/en/processors/architectures-software-developer-manuals.html.

Factors That Limit Static Analysis

In this chapter and Chapter 1, you learned about a variety of ways in which static analysis techniques can be used to elucidate the purpose and methods of a newly discovered malicious binary. Unfortunately, static analysis has limitations that render it less useful in some circumstances. For example, malware authors can employ certain offensive tactics that are far easier to implement than to defend against. Let’s take a look at some of these offensive tactics and see how to defend against them.

Packing

Malware packing is the process by which malware authors compress, encrypt, or otherwise mangle the bulk of their malicious program so that it appears inscrutable to malware analysts. When the malware is run, it unpacks itself and then begins execution. The obvious way around malware packing is to actually run the malware in a safe environment, a dynamic analysis technique I’ll cover in Chapter 3.

note Software packing is also used by benign software installers for legitimate reasons. Benign software authors use packing to deliver their code because it allows them to compress program resources to reduce software installer download sizes. It also helps them thwart reverse engineering attempts by business competitors, and it provides a convenient way to bundle many program resources within a single installer file.

Resource Obfuscation

Another anti-detection, anti-analysis technique malware authors use is resource obfuscation. They obfuscate the way program resources, such as strings and graphical images, are stored on disk, and then deobfuscate them at runtime so they can be used by the malicious program. For example, a simple obfuscation would be to add a value of 1 to all bytes in images and strings stored in the PE resources section and then subtract 1 from all of this data at runtime. Of course, any number of obfuscations are possible here, all of which make life difficult for malware analysts attempting to make sense of a malware binary using static analysis. As with packing, one way around resource obfuscation is to just run the malware in a safe environment. When this is not an option, the only mitigation for resource obfuscation is to actually figure out the ways in which malware has obfuscated its resources and to manually deobfuscate them, which is what professional malware analysts often do.

Anti-disassembly Techniques

A third group of anti-detection, anti-analysis techniques used by malware authors are anti-disassembly techniques. These techniques are designed to exploit the inherent limitations of state-of-the-art disassembly techniques to hide code from malware analysts or make malware analysts think that a block of code stored on disk contains different instructions than it actually does. An example of an anti-disassembly technique involves branching to a memory location that the malware author’s disassemblers will interpret as a different instruction, essentially hiding the malware’s true instructions from reverse engineers. Anti-disassembly techniques have huge potential and there’s no perfect way to defend against them. In practice, the two main defenses against these techniques are to run malware samples in a dynamic environment and to manually figure out where anti-disassembly strategies manifest within a malware sample and how to bypass them.

Dynamically Downloaded Data

A final class of anti-analysis techniques malware authors use involves externally sourcing data and code. For example, a malware sample may load code dynamically from an external server at malware startup time. If this is the case, static analysis will be useless against such code. Similarly, malware may source decryption keys from external servers at startup time and then use these keys to decrypt data or code that will be used in the malware’s execution. Beyond Basic Static Analysis: x86 Disassembly 23 Obviously, if the malware is using an industrial-strength encryption algorithm, static analysis will not be sufficient to recover the encrypted data and code. Such anti-analysis and anti-detection techniques are quite powerful, and the only way around them is to acquire the code, data, or private keys on the external servers by some means and then use them in one’s analysis of the malware in question.

Summary

This chapter introduced x86 assembly code analysis and demonstrated how we can perform disassembly-based static analysis on ircbot.exe using open source Python tools. Although this is not meant to be a complete primer on x86 assembly, you should now feel comfortable enough that you have a starting place for figuring out what’s going on in a given malware assembly dump. Finally, you learned ways in which malware authors can defend against disassembly and other static analysis techniques, and how you can mitigate these anti-analysis and anti-detection strategies. In Chapter 3, you’ll learn to conduct dynamic malware analysis that makes up for many of the weaknesses of static malware analysis.

Join and follow us

Translate