What is the meaning of a crawler in a file sharing network? Very simple for me. Something that collects all
the files that are shared by clients on that dc network. But collecting all the shared files is a huge task. So, I
reduced my task to just collecting metadata and organising that information. Metadata about shared files in a dc
network is nothing but the file lists that the client softwares create. The main point of crawling is for stats.
Since a db is needed to store all the data, I decided to use PostgreSQL. The reason being that I had some experience with
it as we use the same in OWTF. Writing SQL queries is a pain for this pet project, so I decided to use SQLAlchemy ORM.
There are only two models for this crawler.
Both are joined by a many to many relationship as same file can be with multiple users and a user will have multiple files.
The crawler is here. The db settings are present to the end of
Get the crawler script, better if you get the whole repo
Configure the database settings, i.e create the db role and db if necessary. There is a helper script called db_setup.sh script
in the repo
Since this was an experimental project, if you wish to listen/crawl a particular hub, you have to edit the crawler. Find the following
line and change the ip and port to the desired value.
Now, you are ready. Use one of the following commands