Gathering information with specialized search engines is cool; you can extract useful data to build your own blocklist, for example. In this post, I am going to try something different. I’m attempting to create a specific proof of concept (POC): the goal is to gather information about possibly malicious infrastructure and then have an LLM (Large Language Model) analyze it. I’m not planning to use an agent-based architecture—just a Python script that queries search engines and then asks an LLM to analyze the different exposed services.
Infrastructures elements
In order to achieve my goal, I’m going to use an API. For this, the FOA API has been used. The API allows me to track possibly malicious servers with a specific request and then creates an array containing the IPs and ports. The second element is the LLM: I managed to use a fine-tuned LLM: AlicanKiraz0/SenecaLLM_x_Qwen2.5-7B-CyberSecurity-Q2_K-GGUF.
Let’s code a little PoC
Well, this idea is part of a bigger project, I’m currently designing a bigger tool for Threat Hunting purpose.
Firt step: gathering data
The query
In order to gather some data, we need a request, let’s consider the following request:
This request will provide us a result composed of hosts running a web server based on python and containing an open-dir page with an bash_history. this’ll be enought for the POC.
import fofa if __name__ == "__main__": tab = [['IP','Port']] key = '{REDACTED}' # Input fofa key client = fofa.Client(key = key, email = "{REDACTED}") # I query_str = 'product="SimpleHTTP" && title="for /" && ".bash_history"' data = client.search(query_str, size=5, page=1, fields="ip,port") for ip, port in data["results"]: #print("%s,%s, %s" % (ip,port)) tab.append([ip,port]) print(tab)
Here some basic code in order to request FOFA. Here, for this example, I retrieve 2 specifics fields from 5 hosts and then append them into my array. Since I can’t retrieve HTML body with FOFA I’m going to do it with “requests”:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
for pair in tab[1:]: ip, port = pair url = f'http://{ip}:{port}'
try: response = requests.get(url, timeout=5)
if response.status_code == 200: body_content = response.text print(f"Body content for {ip}:{port}:\n{body_content}...") else: print(f"Failed to retrieve content for {ip}:{port}, status code: {response.status_code}")
except requests.exceptions.RequestException as e: print(f"Error retrieving {ip}:{port}: {e}")
The code above is a test, I made a request with the elements in my array and then print the body content, I’ve got this kind of result:
The results are starting to get interesting. Let’s try manually asking an AI for some analysis. I have Ollama installed with the specific model mentioned in the introduction. Let’s run it in the terminal and check its response. Here is the first prompt:
Hello, act as a cybersecurity analyst, I am going to send you some HTML BODY from suspicious open-directory websites. I need you to analyze this html body. You need to return to me if there is hacker's tools in this directory, if there is specifics files that seems to be related to hacking. Dress an array please.
The HTML body you provided displays the contents of an open directory for a suspicious website. From this listing, several directories and files that are commonly associated with offensive hacking activities stand out:
### Key Directories: - **`.bash_history`, `.bashrc`, `.profile`:** Configuration files for the Bash shell. - **`.cargo`, `.rustup`:** Files related to Rust and Cargo package management. - **`.config`, `.cache`:** User-specific configuration and cache directories. - **`.docker`, `rengine`:** Docker-related tools or configurations. - **`nuclei-templates`, `.jaeles`:** Directories for storing hacking templates and tools. - **`.local`, `.pki`, `.semgrep`, `.google-cookie`:** Miscellaneous personal and security-related data. - **`osmedeus-base`, `reconftw`:** Reconnaissance tools, possibly part of automated hacking workflows. - **`Tools`:** A directory likely containing various offensive hacking tools.
### Key Files: - **`.bash_history`, `.bashrc`, `.profile`:** User-specific configuration files for the Bash shell. - **`log.txt`:** Log file that may contain information on previous hacking activities or results. - **`rengine/rengine.sh`:** An executable script named "rengine," which could be a custom hacking tool. - **`update.txt`, `release-3.3.6.zip`:** Files indicating potential updates, possibly for security tools like Hadoop releases. - **`site1.txt`, `site2.txt`, `site3.txt`:** Lists of target websites or domains, likely used in mass scanning campaigns.
### Suspicious Activity: The presence of directories and files such as `.ghauri`, `.osmedeus`, `.paramspider`, etc., suggests that the site is hosting resources for various automation tools commonly employed by offensive cybersecurity professionals. The `Tools` directory alone contains numerous potentially malicious or otherwise sensitive items, raising serious red flags.
### Possible Threats: - **Unsecured DevOps Tools:** The presence of `.cargo`, `.rustup`, and other package managers implies the use of development or testing environments that could have been left unsecured. - **Automated Scanning Scripts:** Files like `paramspider` and directories such as `rengine` suggest the use of automated scanning tools, which are often associated with malicious probing for vulnerabilities. - **Sensitive Data:** The presence of personal files (`.google-cookie`) or logging mechanisms (log.txt) could indicate that this directory was used to record sensitive data or store results from previous hacking activities.
Given the contents of this directory, it appears highly likely that this website is hosting assets intended for offensive cybersecurity operations. This type of infrastructure can lead to widespread exploitation if not properly secured. Immediate steps should be taken by security researchers and organizations to identify the perpetrators and neutralize the threat.
As you can see, this is quite precise. Here is a second example, of a directory containing a CVE PoC file:
Thank you for providing the HTML body. Based on the contents listed, there are indications that this directory might be used by an individual or group for offensive hacking activities. Here is a breakdown of what I observed:
### Hacker Tools and Files: 1. **.bash_history** and **.bashrc**: Configuration files for Bash shell. 2. **336.txt**, **4.txt**, **site1.txt**, **site2.txt**, **site3.txt**: Potential lists of targets or specific information. 3. **CVE-2023-4966/**: A directory named after a known CVE (Common Vulnerabilities and Exposures), which suggests that this individual might be targeting or investigating this vulnerability. 4. **hadoop-rel-release-3.3.6/**: Hadoop release files, potentially for exploiting or scanning. 5. **log.txt**: Log file containing information on previous hacking activities or results. 6. **MinKhoy/**: A directory named after a hacker who has been associated with vulnerabilities and exploits. 7. **nuclei-templates/**: Directory likely used to store vulnerability templates for the Nuclei tool, which is an open-source project that allows users to perform mass scanning on websites. 8. **osmedeus-base/**, **reconftw/**, **paramspider/**: Reconnaissance tools commonly found in hacker workspaces. 9. **release-3.3.6.zip**: A zip file containing a release version of Hadoop, likely for exploitation purposes. 10. **rengine/** and **rengine/rengine.sh**: A custom hacking tool named "rengine" along with its script. 11. **Tools/**: Directory likely to contain various offensive hacking tools such as Nuclei, ReEngine, etc.
### Miscellaneous Files: - **.google-cookie**: Google cookie file, which might be used for automated web browsing or data collection.
### Recommendations: Given the presence of these directories and files, it is highly probable that this directory is associated with malicious activities. The individual or group responsible appears to be using various open-source tools and scripts in their hacking efforts. Immediate steps should be taken, such as reporting the suspicious activity to the website owner and potentially reaching out to cybersecurity organizations if necessary.
Let’s complete this puzzle!
So, what remains to be done ? - Customize the system prompt and the temperature of the AI model - Making the HTML body of each request as an input for the LLM - exploit the LLM output and put all the information in a CSV
Langchain interaction
First let’s initialize the model: - We need to specify the base_url for langchain - We need t ospecify the model with is local name - We need to specify the temperature -> I choose 0.6 not to low, not to high.
1 2 3 4 5 6 7 8 9
base_url = "http://localhost:11434" model = "llama3.2:1b"
Then I did some prompting engineering with system message and message template:
1 2 3 4 5 6 7 8 9 10 11
system = SystemMessagePromptTemplate.from_template('You are a cyber security specialist, your main goal is to analyze HTTP HEADERS looking for hacking tools in open directory webpages, if there is specifics files that seems to be related to hacking (like software,CVE,exploit files, hacking tools). Dress an array please.') question = HumanMessagePromptTemplate.from_template('Analyze this HTML Body please: {body}, give me all informaiton about what is contained in this one') question question.format(body="html body")
At this point here is the kind of response provided by the model:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
his HTML body represents a directory listing for the root URL of a web server. It contains various files and directories that could be used by hackers to exploit vulnerabilities or conduct malicious activities on the target system. Here are some notable entries:
1. **Hacking Tools and Exploits:** - `bluekeep/`: Directory likely containing information or tools related to exploiting the BlueKeep vulnerability. - `brute_roundcube/`: Directory for brute-forcing RoundCube login credentials. - `exploit1337.py`: Python script used for an unknown exploit. - `exploits/:` A directory housing multiple exploits, possibly categorized by CVEs or types.
2. **Sensitive Files:** - `.bash_history`, `.bashrc`, `.profile`, `.python_history`: Configuration and history files that might contain commands executed on the server. - `.ssh/`: Directory for SSH keys and configurations. - `hosts_all_sort`: A sorted list of hosts, which could be used in scanning activities.
3. **VPN Configurations:** - `123.ovpn`, `jir.ovpn`: OpenVPN configuration files that might contain credentials or server details.
As you can see, this is an acceptable output and quite accurate. I think that with an agentic AI framework, it would be possible to achieve greater information granularity. Exporting the results to a human-exploitable format would also be a nice addition! However, this is just a proof of concept (PoC). I might release a complete tool for this kind of information gathering and intelligence production in a few months.