Improve the way you make use of ZFS in your company.
Did you know you can rely on Klara engineers for anything from a ZFS performance audit to developing new ZFS features to ultimately deploying an entire storage system on ZFS?
ZFS Support ZFS DevelopmentAdditional Articles
Here are more interesting articles on ZFS that you may find useful:
- GPL 3: The Controversial Licensing Model and Potential Solutions
- ZFS High Availability with Asynchronous Replication and zrep
- 8 Open Source Trends to Keep an Eye Out for in 2024
- OpenZFS Storage Best Practices and Use Cases – Part 3: Databases and VMs
- OpenZFS Storage Best Practices and Use Cases – Part 2: File Serving and SANs
Our newly launched Sysadmin Series focuses on revealing the work that goes into administering IT infrastructures from all points of view. Whether it is building secure environments, or troubleshooting deep issues or tracing cyberattacks – it’s all part of the daily life of a sysadmin.
Before We Started Looking for the Bitcoin Miner
What first looked like some CPU oddity in a monitoring graph became a full-fledged hunt for a bitcoin miner. Figuring out what happened and how they were operating is part of the investigation. Collecting evidence to make a case soon became akin to a detective story.
Read on to find out how we approached this from the sysadmin point-of-view and what was learned from the experience.
Several years ago, we started educating students in a database lab about the (then) new concept of NoSQL databases. The lecture around it had run only a couple of times, and since these database systems were quite new, the labs were organized less strictly than they are today. This meant that the students were allowed and even encouraged to freely experiment with these newfangled database systems, rather than only following a set of objectives.
The students were working in groups of two students, and had been given a server for each group which had three NoSQL databases preinstalled. The students could log into these systems at any time during the semester via VPN and SSH to prepare some of the exercises at home.
They were not given any administrative access on the system (this seldom goes well and is also rarely needed). They could start and stop each database, as well as load data into it from their home directories. Their tasks were essentially to compare these NoSQL databases regarding their load performance, how they structure the data and what the query interface was like compared to traditional SQL databases like MySQL or Postgres.
At the beginning of the semester, there is always a bit of fluctuation in the number of students taking a course. Some drop out early because they find that the courses they picked are too demanding. Other students may or may not then take the freshly-opened seats. In this particular class, these fluctuations left an odd number of students—so one of the late-comers was assigned to work on a system alone.
The Need for System Monitoring and Auditing
As a sysadmin who had set up these systems before the semester starts, I typically would visit the first two sessions of each group that was using the database lab. During these first labs, we assigned students to systems, checked that they can log in, and verified they had all of the permissions and tools they need to solve their assignments.
After that the initial user setup, there is rarely a need to revisit such a lab, because most problems are resolved by either emailing with individuals or centrally applying certain last-minute changes to all systems over the network.
As sysadmin responsible for setting up these systems, one of the earliest things I did when I got the job was to set up a central monitoring system. Systems monitoring typically involves three things:
- Uptime checks – is the system actually running and reachable on the network?
- Metrics – what is the CPU load, how much disk space is available, how many users are logged in? All of these questions are answered by collecting system metrics on a regular basis (i.e. every 5 minutes) and storing them in a time-series database. The time-series database helps to see long term trends, and allows me to go back in time to see what happened with the system on Sunday at 3 a.m. when I was asleep.
- Logs: system logs store a great number of events (informational messages, warnings, errors, and even critical debug messages) which can tell you a lot about what applications are doing. Having access to these logs in a central place is both convenient (reducing the need to log into individual servers) and somewhat tamper-safe (making it more difficult for an attacker to alter a compromised system’s logs).
Together, these components give a system administrator a comprehensive overview of what each individual machine is doing. Some of these systems even allow a grand overview about the state of all monitored systems at once, and highlight those that show irregularities.
I had used a couple of these systems in the past, experimenting with their capabilities. In this case, the Munin graphs—generated based on the collected system metrics of each system—were crucial to ultimately figuring out what was going on.
An Early Morning Discovery
Several weeks into the semester, the labs were running without any major incidents. I had made a habit of looking at system metrics in the morning. It’s fascinating for me to observe live systems that people are working with: CPUs spinning up and down, disk space getting used up over time, and memory getting both allocated and freed up again. There is a lot to learn about how systems behave.
As I was scrolling through these graphs on various systems, I suddenly noticed something odd. One system had a strange orange coloring in the CPU graph that similar systems did not exhibit. The graph’s legend showed that this was CPU nice time (in orange on cpu-day.png). This had been going on for a number of days, and only recently dropped back down to the expected amount of CPU idle time (i.e. no workload on the system).
Curious about this strange behavior, I looked at the weekly CPU graphs, which had the same long periods of heavy CPU nice time. Going back further, the monthly graph confirmed that this happened somewhere around calendar week 19. Note that these graphs become more aggregated the further one looks back into history. The aggregation saves disk space for older metrics, while still preserving the average values from those periods.
At this point, I decided to log into the system and look around. One of the first system utilities that I ran was top(1) as it provides a good general overview of system processes, memory, and CPU (which I was after). At this point, I was mostly suspecting something innocent, like an errant process that did not exit properly (although this would not explain why so much of the non-idle CPU time was in the nice state).
Glancing at the top(1) output for a few seconds, I recognized both the NoSQL database processes that I had set up for the lab—but I also foundan unexpected process called cpuminer. Any kind of morning sleepiness was quickly gone by the ensuing surge of adrenaline when realizing what was going on. Someone was mining bitcoins on that system!
A Few Words on Bitcoin Mining
This was during a time when cryptocurrency was still in its infancy, and had not received anything like the hype in the press we’ve seen in recent years. I had heard about the concept, and was aware of the practice of mining the hashes in exchange for virtual coins.
Mining for coins works very similarly to well-known distributed computing research projects like SETI@Home and protein folding—it harnesses the normally-idle CPU time of a system when no one is working on it. A client would connect to a server, downloads a small unit of work, and starts grinding on in the background. In order to preserve the usefulness of the system for interactive users, this work is done niced—which means anything the user wants to do takes priority over what the distributed computing client does.
Once a work unit is finished, the client automatically uploads the result and gets the next unit, scoring some points in a global leaderboard in the process.
Bitcoin gave distributed computing a new hype, despite the concept remaining basically unchanged. The big difference is that rather than scoring points on a leaderboard, users earn bitcoin—which could be traded for more directly valuable things like hardware, pizza, or cash. Unfortunately, mining bitcoin also drives power consumption and heat generation—both of which cost the operator real money—through the roof.
While the activity of mining is fine to do on systems that one owns (check your local regulations to be sure), it certainly is not appropriate nor allowed on systems that were intended for teaching and research—especially when you’re not the one paying their power bills. Note that we’re not debating about the validity of cryptocurrency as a whole. We are mostly covering this from a system administration point of view.
Collecting Information
After the initial shock, the next step was to find out who was actually doing this. Was this someone who hacked into the system from the outside without my knowledge? A port committer or maintainer’s first task is often to find ways to “de-Linuxify” software originally developed for a particular Linux environment. Sometimes, this can be as simple as changing paths to prefix /usr/local as the install destination.
I knew how long the mining had been going on (calendar week 19 from the monitoring graph), but had no idea yet who was running it. Fortunately, I did not have to search long—the USER column in top (anonymized here) and a subsequent ps(1) command each listed both the command and the user ID running it.
All users in the university are uniquely identified by their login ID that they get during initial onboarding as employees or students. This ID does not change, and stays with these users for as long as they have any kind of association with the university.
This was important evidence to have, so I took screenshots of both the top and ps outputs. Why? Because someone could blame this on me as the all-powerful sysadmin and I would need evidence to protect me from such accusations. Also, the CPU time on that process confirmed what the graphs had been showing as total runtime of the miner.
That’s an important lesson to learn: Trust, but verify. The graphs could have easily been skewed by the aggregation and may have droppped certain timeframes. Having two sources of evidence in a case like this gives you more confidence to back up your claims.
Also, as contrary as it may run to your territorial pride, you should resist the urge to kill(1) the process right away. The damage has been done already—adding a couple more seconds of CPU time to collect valuable information is usually a wise trade to make. I waited to kill the process until I’d gathered more data from the ps(1) output.
ps didn’t just show me the process’s runtime—it also showed me the complete command line, including any parameters. In this case, those parameters included the user and a remote website address. The user was a separate one than the user ID on the system, which could mean that an existing account was now using this extra system for mining in addition to an already running system at home or somewhere else.
Armed with the URL of the website the process sent data to, I checked it out in a browser. Sure enough, it was a central location for submitting bitcoin mining results and getting credits for mining certain coins. The coin in question was not a popular coin, and one would have ever heard about in the press. Was this a serious attempt to make money, or simple child’s play?
I couldn’t be sure, so I took screenshots of both the website itself and the current exchange rates for the coin that the process was mining. In the case of a lawsuit, this information can be important to determine what damages to sue for, and in what amount.
There was another important clue to be gleaned from the ps results—our rogue process had ./ in front of it, which normal processes (like system daemons) do not have. This typically means that the binary in question was run from an interactive session, and a directory outside the system PATH.Executing a find(1) to look for a binary called “cpuminer” on the whole system quickly turned up a result in that particular user’s home directory.
Note that we did not employ any kind of system freezing to avoid destroying or altering evidence like they do in a forensical analysis of a compromised system. Forcibly quiescing a system can prevent an attacker from either doing more damage or destroying evidence—but it also alerts them that they’ve likely been discovered.
This particular miner was built to use idle CPU time without requiring any kind of installation, which allowed the user to run it from their own home directory without any special privileges. A cursory web search for “cpuminer” turned up a GitHub repository for the project complete with instructions on how to run it. Another screenshot later, I had already collected a good amount of evidence to make a case against the person behind that particular user ID.
But the job wasn’t quite done yet. I knew which user ID the process was running under—but I didn’t know whether the process was started by the legitimate user, or by an attacker who gained illegitimate access to the user’s credentials. The next step was to check the logs.
The most detailed logs were to be found on the system itself—and they were probably intact, since there was no evidence of a root compromise. But a particularly clever attacker might hide their ability to become root for exactly this reason—so I also checked in with the IT department, which managed the VPN and firewall. Both my logs and the IT department’s logs confirmed that this particular user had really logged in during the times in question and of course also during their regular lab times.The login times were consistent with the attendance of the student in the lab.
This was enough to convince me that the student was the true culprit, not just a patsy for an unknown attacker. As part of the paperwork every applicant has to fill out when becoming a student, one piece of paper had them sign that they will not use the IT systems they are given access to in a manner other than for research and study purposes. Clearly, this was not part of the class assignment, nor was bitcoin mining any part of the NoSQL databases they were covering.
All’s Well That Ends Well?
First, I sent an email to our IT department to block that particular mining URL on the central firewall, which was implemented quickly enough. Then I killed the running cpuminer process and confirmed that CPU nice time returned to normal afterward. Blocking the student from logging in via SSH (they had no other means) to cover their tracks was next.
Finally, I prepared an email to the class professor and the dean of the CS department. I attached some of the collected evidence, and described the steps that I had proactively taken in blocking the user from doing further “damage”.
Detailing both the collected evidence and the steps taken in response was important to show that the student was the actual attacker, and that they could take some time to read their mail and reply without concern that harm was still being done.
The professor replied swiftly and justly, copying both myself and the student. This reply made it clear that I took the correct action in detecting and stopping the activity. Furthermore, the student immediately lost access to the system as well as to their seat in the class and lab. This forced them to wait to retake the class the next year, in turn delaying their ability to complete a degree. This was deemed sufficient punishment, and the university decided not to press legal charges.
But talking to both the central IT department (who had to deal similar cases in the past) and the professor calmed my fury. Removing the student from this class was certainly justified, but going beyond that would have seemed more vengeful than just.Instead—after thanking me for discovering the problem—the professor counseled me to turn my anger into a lesson learned. You can’t always prevent an attack from happening, but you can learn from it to prevent (or at least mitigate) similar attacks in the future. This is why doing post-mortems after such events are important—not as finger-pointing sessions, but as valuable learning exercises which improve our skills as system administrators.
Epilogue
I took the professor’s advice and secured my systems even further by increasing the precision and scope of my monitoring efforts. I’d discovered my attacker through manual inspection alone—but now, I set up automatic warning thresholds and notification triggers that could help me find similar issues more quickly and reliably in the future.
Unlike many security measures, resource monitoring is transparent to the user—so my increased security didn’t make the systems more difficult for students to use. I also chose to avoid automatic attempts to impose hard lockdowns on student accounts suspected of unauthorized “experiments”. We want students to both learn and experiment—so locking them down tightly when working on these systems would defeat the purpose of the lab itself. Why punish other students for actions that previous students did?
The professor running the class also learned from this incident and changed the labs accordingly. When this incident happened, the machines were assigned individually to groups of students. Now, the entire group of machines forms a cluster that all the students in the class can work with. That both demonstrates the power of distributed NoSQL databases, and prevents any students from having a system “all to themselves”—which might otherwise tempt them into the sort of “extracurricular activity” we’d just dealt with.
This incident taught me social lessons as well as technical ones. Luckily, successful attacks are rare—but I’m calmer now when they happen, which helps me to react properly and collect the evidence without taking any rash decisions. I’ve also discovered and learned to use additional monitoring systems which both provide me with better insights, and do a better job of not overloading me with notifications about the slightest CPU spikes. The more data you collect, the more important smart filtering becomes! Individual tuning of warning thresholds for a particular system is often necessary, since that system’s sysadmin typically knows its workload and capabilities better than anyone else can.
A year later, our rogue student came back and took the class again. Once again, I was present in the first labs to help with any trouble students may have—so I caught a glimpse of the person associated with that user ID, which I will probably never forget.
Thankfully, this was a year later, and my initial anger had mostly subsidedThey’d gotten their punishment and I had other work needing my attention, so why bring this incident up again? I kept a watchful eye on the systems they were using (at least those under my control), but nothing happened during the whole semester. The student graduated a few years later, and left the university shortly after.
Benedict Reuschling
Big Data Cluster Admin, Educator and Learner, FreeBSD Committer, Co-Host @bsdnow, but most of all: human. Things I see, do, and think about.
Learn About KlaraMaximizing your FreeBSD performance starts with understanding its current state.
A FreeBSD performance audit can help you identify areas for improvement and optimize your systems.