is the premise true? #5

AnharMiah · 2021-04-30T14:55:54Z

AnharMiah
Apr 30, 2021

First off thank you, this seems like very interesting research!

Hope these questions doesn't come off as rude:

The research behind Jaws aims to build awareness that unknown interpreters can be dangerous.

but that would require said VM to be actually installed on the target machine in the first place?

Since Jaws code is composed entirely of whitespace characters, it can easily coexist with other programming languages to create polyglot code.

Since most languages have code formatters and linters, some even auto format on save can it really survive?

then you have whitespace sensitive languages such as Python that casts doubt on this premise?

also note: under emacs one can use the M-x fixup-whitespace

lawndoc · 2021-05-03T02:57:09Z

lawndoc
May 3, 2021
Maintainer

Great questions! Some of this can be answered by reading my whitepaper, but well explained threats/examples are definitely lacking in my GitHub repo's documentation.

but that would require said VM to be actually installed on the target machine in the first place?

The VM would need to be installed or (more likely) be a component of a larger malware program. If you consider an interpreter such as the Jaws VM being part of a trojan or a c2 agent, it would seem benign from a static analysis standpoint. After all, an interpreter isn't inherently malicious. Initial Jaws code could be written to fetch more instructions from an arbitrary network location. That initial code could be stored just about anywhere.

then you have whitespace sensitive languages such as Python that casts doubt on this premise?

Even Python ignores arbitrary whitespace when it's on its own line(s). Jaws' headers and footers create the ability to stop and start interpretation of whitespace much as needed. Non-whitespace controlled languages are trivial to inject as long as you replace all the existing whitespace between Jaws headers and footers with valid Jaws instructions. Consider the following example I just whipped up and added to the tests directory of this project:

#include <stdio.h>
	      		    
int     		  	 
main()     		 	  
{
	 printf("What does this do?");
 	 return 0;
	 }

In this example, the spaces in the parts #include <stdio.h> and the two statements inside the brackets are ignored because of the 2 sets of header and footer instructions. What does this program do? Check it out and try it yourself if you trust me 😉

I kid, it prints 420. It's also valid C code that prints "What does this do?" when compiled. Adding a lot of complexity to either program would probably require further tool development to automate injection, but you get the gist of what's possible.

Since most languages have code formatters and linters, some even auto format on save can it really survive?

This is actually part of the beauty of Jaws. You can inject Jaws code into a file after it's gone through a code-formatter and distribute it in that state if you want. Let's say Jaws was injected into a Python script. To a malware analyst, it would initially look like it's the Python code that is doing the "bad stuff." When it comes time to do manual analysis on a Python script with a bunch of blank lines added, what would be the first thing you would do? Assume it's a pathetic attempt at obfuscation and delete it?

Obviously with enough time, it would become clear what's going on. But the point of Jaws isn't to try to stump malware analysts, it's to discredit automated static analysis. As I mention in the root directory's README, behavioral analysis would be the only way to detect something like this unless you had pre-existing signatures for this particular interpreter. But there are plenty of ways to create an interpreter, and infinite ways to design a "hidden" programming language. What are we going to do, flag anything that has some kind of interpreter functionality?

I propose that we could move on from static analysis and focus on improving behavior-based detection tools. I'm talking about cutting edge developments like finite state machines embedded in a blocking EDR or some hybrid of the static and dynamic analysis (identification + validation). At some point, bad stuff will do something bad whether you could tell it was going to beforehand or not.

0 replies

lawndoc · 2021-05-03T03:23:48Z

lawndoc
May 3, 2021
Maintainer

I plan to update the main README to elaborate on some of this. I may even copy some of my explanation here. When you have your work broken up between GitHub, a blog, and a whitepaper, I guess some important parts get lost in translation 😅

Thanks again for your questions, @AnharMiah, I love seeing interest in my work!

0 replies

AnharMiah · 2021-05-04T09:02:10Z

AnharMiah
May 4, 2021
Author

@doctormay6 thanks for the detailed response!

That makes more sense now, my only question now is that I can see that visual inspection would certainly fail this is true.

On the point regarding Python and adding extra whitespaces on the same line: I actually have "render whitespace" enabled on in my text editor and out of habit will remove any extra whitespaces at the end of lines! of course that doesn't mean other developers are as pedantic!

since the VM is mostly a map between opcode to machine I/O wouldn't virus detection software simply add that as their signature?

Even if the interpreter is "streamed" over a network, without the full VM you have a non-functional VM, but once you have a full VM then its back to the above and detectable via some kind of signature?

EDIT:

actually I think I've answered my own question, as you mentioned one could create multiple interpreters so it would be a "cat and mouse" game.

The only issue here would be that each "Jaws" code would only work with a certain interpreter and would be incompatible with any other one.

I guess what you would need is a "meta Jaws VM" generator, you could probably use it in conjunction with some kind of seed/cipher that then generates the unique VM for each original Jaw code?

0 replies

lawndoc · 2021-05-05T03:02:20Z

lawndoc
May 5, 2021
Maintainer

I guess what you would need is a "meta Jaws VM" generator, you could probably use it in conjunction with some kind of seed/cipher that then generates the unique VM for each original Jaw code?

Yep, there are a few ways to go about making it polymorphic. That's the first one that came to my mind too. Static interpreters are usually kind of a "cat and mouse" game by nature. AI-based static analysis tries to overcome this weakness to detect things that look bad, but there are two problems with that approach for a Jaws-type scenario:

You can't train your model to detect virtual machines because they aren't inherently malicious
You can't train your model to detect arbitrary languages because they are infinitely variable in structured

0 replies

AnharMiah · 2021-05-05T08:25:28Z

AnharMiah
May 5, 2021
Author

thanks @doctormay6 !

In regards to training models to detect VMs, I think there is a possibility where in this case it may not be needed. Given that the scope of Jaws will always have to parse whitespace characters, that constraint alone could be used to detect potential Jaws VM instances, this might work well since there aren't too many legitimate things that parse whitespace and the ones that do could be whitelisted?

0 replies

lawndoc · 2021-05-06T17:47:40Z

lawndoc
May 6, 2021
Maintainer

You make a good point, but that only works for languages similar to Jaws. The overall weakness of static analysis is interpreted code, not necessary polyglot code. Let's think about unknown interpreted languages in general. There are multiple ways to hide the code of an interpreted language other than polyglot code. To name a few: it could be stored as a string in the calling executable, it could be appended after the footer inside an image file, or it could be streamed over a network connection. Creativity is the biggest limiter here.

Let's compare an arbitrary interpreted language to shellcode injection. An executable could be written that accepts a string containing shellcode, and the processor will execute it. Similarly, an executable containing an interpreter would accept a string containing an unknown language, and it would also be executed. The difference would be that the string containing the shellcode could be identified and its functionality could be inferred beforehand. Whereas nothing could be inferred about the unknown language's string of code. Unless a static analysis tool included some form of the interpreter in its functionality, it wouldn't be learning what the code does until it starts executing instructions.

To sum up what I'm trying to say, the weakness remains without the requirement of whitespace characters because interpreted code doesn't need to look anything like Jaws. To be honest, the whitespace part of the design was partly to make my project more interesting or attention grabbing. The variation of interpretable context-free languages is nearly infinite, so you really would have to be able to detect any instance of an interpreter.

Even without considering interpreted malware, static analysis is already a game of cat and mouse as you said. But bad programs will always do bad things, and in my opinion we should be focusing on detecting/preventing behaviors, not traits.

0 replies

AnharMiah · 2021-05-07T08:29:06Z

AnharMiah
May 7, 2021
Author

hi @doctormay6 thanks for the detailed expansion!

I agree that makes sense that we could have any arbitrary language with a custom VM for it and that would be basically impossible to detect because we simply don't have enough information to use as a detection trigger per say.

Having said that, in terms of "hiding" an instance of such a language:

The variation of interpretable context-free languages is nearly infinite, so you really would have to be able to detect any instance of an interpreter

would restrict that set such that its not (a) the original language and (b) one that isn't easily seen by humans, I think those constraints does add limits to that nearly infinite set.

Of course as you mentioned hiding languages is but one creative way and there are so many other places it would be hidden and you're absolutely right.

But here is another thought: given an arbitrary string with length and encoding set we can compute its information entropy.

Based on Shannon's Information theory given that whatever language instance that is devised will have some complexity we can I believe statistically detect between absolute random versus a language instance?

What are you thoughts?

0 replies

lawndoc · 2021-05-11T13:47:12Z

lawndoc
May 11, 2021
Maintainer

I don't have much experience with entropy to be honest, but I would think that there wouldn't be much variation between the entropy of a string containing an unknown programming language versus a string containing a natural language or strings containing structured information used by an application. I see lots of room for false positives with that approach as well. What do you think?

1 reply

AnharMiah May 17, 2021
Author

so I think that's the key, we can use what is known as "specified complexity" in order to determine if some random string contains actual real structure/language and not some some random noise. Since once the Jaws code is embedded in some string (network, media, binary) we can filter those strings between something that seems to contains a structure vs just pure noise, even if we don't know what that language is.

The false positive would be on anything else that is also a language, what we could do is a "pre-parse step" that checks against a list of whitelisted languages. The whitelist removes strings that are "whitelisted" and then you run your detection over the remaining strings.

The only way around this would be to apply cryptographic encryption such as using XOR'ed one time pad encryption. The only problem now is that you would need to somehow pass your one time pad (OTP) as well as well as a way to decrypt it. This would be almost pointless because there would be no real secure way to hide the OTP!

lawndoc · 2021-05-11T13:52:05Z

lawndoc
May 11, 2021
Maintainer

One quick note, I opened a discussions tab for the repo and will convert this issue to a discussion. We can continue over there if you'd like. Thanks!
Edit: tagging you to make sure you see that I moved it @AnharMiah

1 reply

AnharMiah May 17, 2021
Author

hi @doctormay6 sorry for the late response, I was on holiday!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is the premise true? #5

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

is the premise true? #5

AnharMiah Apr 30, 2021

Replies: 9 comments · 2 replies

lawndoc May 3, 2021 Maintainer

lawndoc May 3, 2021 Maintainer

AnharMiah May 4, 2021 Author

lawndoc May 5, 2021 Maintainer

AnharMiah May 5, 2021 Author

lawndoc May 6, 2021 Maintainer

AnharMiah May 7, 2021 Author

lawndoc May 11, 2021 Maintainer

AnharMiah May 17, 2021 Author

lawndoc May 11, 2021 Maintainer

AnharMiah May 17, 2021 Author

AnharMiah
Apr 30, 2021

Replies: 9 comments 2 replies

lawndoc
May 3, 2021
Maintainer

lawndoc
May 3, 2021
Maintainer

AnharMiah
May 4, 2021
Author

lawndoc
May 5, 2021
Maintainer

AnharMiah
May 5, 2021
Author

lawndoc
May 6, 2021
Maintainer

AnharMiah
May 7, 2021
Author

lawndoc
May 11, 2021
Maintainer

AnharMiah May 17, 2021
Author

lawndoc
May 11, 2021
Maintainer

AnharMiah May 17, 2021
Author