Port Scraper
I created this post to explain to you the steps that I did to setup a simple tool to scan a huge range of IPs and store the results in a MYSQL database using Go.
FF: the door in this image is inspired in a Portuguese word “porta” that means door (very similar to port)
Project structure
portscraper
├── config
│ ├── config.toml
│ └── test.csv
├── internal
│ ├── config
│ │ └── ...
│ ├── scan
│ │ └── ...
│ └── sqldb
│ ├── storage
│ │ └── ...
│ └── ...
├── docker-compose.yml
├── Dockerfile
└── main.go
Config
This folder contains the configuration file config.toml and a sample file test.csv.
You can check the test.csv file in here.
The struct of this input file is compatible with one provided by MaxMind. You can download it in here geolite2-city-ipv4.csv.gz.
The config.toml is generated at the start (if doesn’t exist):
[database]
clear_db_table = false
db_host = 'mariadb'
db_name = 'port_scraper'
db_password = 'root'
db_port = '3306'
db_user = 'root'
[general]
log_disable_colors = false
log_disable_timestamp = false
log_level = 2
n_routines = 50
version = 'v.1.0.2'
[scraper]
file_path = './config/test.csv'
port_range = ['22', '80', '8080', '443', '8443', '1883', '8883', '9092', '1880', '3000', '8123', '32400', '10011', '3306', '27017', '5432', '6379', '8086', '1521', '9200', '25565', '27015']
user_agent = 'Mozilla/5.0 (compatible; PortScraper/1.0; +https://YOURDOMAIN.COM)'
I think most of the parameters are self explanatory. However there are two parameters that require further explanation: n_routines and user_agent.
The parameter n_routines is the number of threads that are used to scan a line (in test.csv) of a range of IPs. For example, if n_routines is set to 50, and the program is scanning a range of IPs from 127.0.0.1 to 127.0.0.255, it will perform 50 verifications simultaneously until it reaches the final IP in the range (127.0.0.255), at which point it will wait for all unfinished verifications to complete, before moving on to the next line in the input file (test.csv).
The user_agent parameter specifies a header that is included in the HTTP request, identifying the application or device that is making the request. This allows the server to identify the source of the request and potentially take appropriate action (e.g. to block certain types of user agents).
Internal
config
To load the configurations, I chose the Viper library (one of the most used in Golang projects). All the configurations are stored in a global object and are accessible through the following three functions:
- GetScraperConfig
- GetDBEnv
- GetGeneralConfig
In this section, the logger configs are also loaded. For the logger, I chose the Logrus library, which allows an easy selection of different log levels (e.g., Info, Debug, etc.).
scan
This my favorite section 😎
I will enumerate each step to make it easier to explain:
- Read the range of IPs from the CSV file. To save RAM, the program will read the rows one at a time.
- Start a loop through all IPs in the read range. For each IP and port, launch a routine in a concurrent group limited by n_routines.
- For each IP and port, the scan process will follow the diagram below:
Parse the TCP request to collect more information
In this section, a precise data collection based on the port is performed. I will give two examples that I have at this moment:
Port 22 (SSH):
Take a look at this Wireshark capture below:
The goal is to capture the protocol version exchange (“2.0-OpenSSH…” marked in the image).
RFC 4253 specified that:
When the connection has been established, both sides MUST send an identification string
This identification string MUST be:
SSH-protoversion-softwareversion SP comments CR LF
A simple way to parse this, is to use a Regex expression to obtain a string between “SSH-” (HEX: 53 53 48 2D) and CR (HEX: 0D). Code below:
Port 3306 (MYSQL):
For the MYSQL:
The goal is to collect protocol version and the way to do it is almost the same. Mysql HandshakeV10 starts with HEX 0A (Dec: 10) and is followed by a string until null character. So is possible to use a Regex expression like this to obtain the version: \x0A([^\x00].*?)\x00
HTTP request to collect the webpage title
To collect the titles in the pages I am using goquery. Where I look for the title element.
HTTP request to collect the favicon
Some applications have dynamic titles, but their favicons usually remain the same. Therefore, it is possible to identify the type of application based on its favicon. To facilitate my work, I found a list of Shodan favicon hashes online. After some research, I discovered that the hash keys are encrypted using the MurmurHash algorithm. To integrate this idea in the project, the process will be: if a valid favicon is found, I use an encryption function to calculate the hash and store it.
HTTPs request to collect TLS: DNSNames & Organizations
If you didn’t notice, I am performing this search based on IP addresses, not domain names. However, if I want to determine the organization behind an IP address, I can analyze the TLS certificates and record the common name (CN) and organization (O) fields. There are other parameters that can be collected from TLS certificates, but I chose to implement only these two.
sqldb
In this section is where all database interactions are.
In the storage folder, you will find the data models and functions used to interact with the database. The sqldb.go file includes the database initialization, connection, and statement wrappers. The tables.go file handles the creation of the scan table (if it hasn’t been created previously) and the static tables such as the favicon hash list and port service list (if they haven’t been modified).
Dockerfile & docker-compose
I will document this section in another Post (stay tuned).
Get started ➡ Post Docker + Port Scraper