Ideas for a security tool in python?
Well right now I am coding a web spider in python that grabs all the links on a page and stores it into a text file. It works limitedly at the moment. It still needs to be able to loop back through all the links it finds, go to that link and then get those links and so on. But at the moment, I can give it a URL manually and it will grab all the links. So perhaps if you could help me with that, then that would be cool. Thanks :)
digitalchameleon wrote: What type of security/pen testing tools and scripts would people here like to see written? I'll most likely be writing it in python, and it can be something that connects some common tools, or a small application you think would be useful.
Virtual Machines are usually cool. If however you don't want your tool to be multi-platform and not in python you could always make an NDIS IM driver (for windows) that allows traffic manipulation sort of like a packet filter but instead of just filtering it allows you to manipulate it. I've wanted to make one for awhile.. the main use (for me) would be to cheat at online games.
chislam wrote: Well right now I am coding a web spider in python that grabs all the links on a page and stores it into a text file. It works limitedly at the moment. It still needs to be able to loop back through all the links it finds, go to that link and then get those links and so on. But at the moment, I can give it a URL manually and it will grab all the links. So perhaps if you could help me with that, then that would be cool. Thanks :)
http://en.wikipedia.org/wiki/Queue_%28data_structure%29
Make a FIFO queue and while queue is not empty, take next element off of queue, process it and repeat. It may be better to make it priority queue and sort based host, that way you could just keep the connection active and just get the next relative path for that host.
Yeah I see how that could speed up the spidering, however; I am still a newb in Python. Just started learning it last week with the built in documents for tutorials. Based on that and then googling for the functions that I need, I have been able to come up with what I have so far. So I'll try and get it to work the way that I had first thought of. Then once it's working, I'll try and optimize it. Thanks though :)
chislam wrote: Well right now I am coding a web spider in python that grabs all the links on a page and stores it into a text file. It works limitedly at the moment. It still needs to be able to loop back through all the links it finds, go to that link and then get those links and so on. But at the moment, I can give it a URL manually and it will grab all the links. So perhaps if you could help me with that, then that would be cool. Thanks :)
Awesome idea.
Sounds like what you really need here is a recursive function. A recursive function is one that calls itself again, and possibly again and again. For example, consider:
def CountDown(intNumber=5):
[tab]if intNumber==0:
[tab][tab]return
[tab]print intNumber
[tab]CountDown(intNumber-1)
This code will print the number 5, and then call itself with 5-1. It's called self will print 4 (5-1), and call itself with 4-1. The pattern continues.
Here's an example of how you're code could work.
def CrawlPage(strAddress):
[tab]#get the links on a page
[tab]#put them in a global variable,
[tab]#do not allow double entries.
[tab]#we'll call the list of links lstLinks
[tab]for strLink in lstLinks:
[tab][tab]CrawlPage(strLink)
The hard part with recursive functions is making them stop correctly, or at all. The first example has a stop condition. If the number is 0, it stops. You have to decide where you want your spider to stop. Should it do a domain, an address, or even start with it's root as a path. I would recomend against testing this on anybody else's server without their permission, because if you fall into an infinite loop, it might look like DoS traffic. My old IP is banned from photobucket for a somewhat more agressive, though similar, python script. Got what I was looking for though. :)
Anyway, before I get off topic, I'd totally be up for taking a look at your code, or just answering any other questions you might have.
Chinchilla3k. Nice idea. I've been thinking about (one day) working on a toolkit for reverse engineering new protocols, and I think that what you're talking about could be based on code from this project. Right now I've been trying to get into raw socket programming in order to write netcat like functions into my remote shell, so I have some learning to do first.
As for the first in first out, couldn't that end up going in circles? If a page has a 'home' link on it that leads to the root of the site, wouldn't that page get added to the queue. And when the main page was crawled, wouldn't the script just start over from there? Perhaps I don't understand, I've only used python's queue module once or twice.
digitalchameleon wrote: Awesome idea.
Sounds like what you really need here is a recursive function. A recursive function is one that calls itself again, and possibly again and again. For example, consider:
def CountDown(intNumber=5):
[tab]if intNumber==0:
[tab][tab]return
[tab]print intNumber
[tab]CountDown(intNumber-1)
This code will print the number 5, and then call itself with 5-1. It's called self will print 4 (5-1), and call itself with 4-1. The pattern continues.
Here's an example of how you're code could work.
def CrawlPage(strAddress):
[tab]#get the links on a page
[tab]#put them in a global variable,
[tab]#do not allow double entries.
[tab]#we'll call the list of links lstLinks
[tab]for strLink in lstLinks:
[tab][tab]CrawlPage(strLink)
The hard part with recursive functions is making them stop correctly, or at all. The first example has a stop condition. If the number is 0, it stops. You have to decide where you want your spider to stop. Should it do a domain, an address, or even start with it's root as a path. I would recomend against testing this on anybody else's server without their permission, because if you fall into an infinite loop, it might look like DoS traffic. My old IP is banned from photobucket for a somewhat more agressive, though similar, python script. Got what I was looking for though. :)
Anyway, before I get off topic, I'd totally be up for taking a look at your code, or just answering any other questions you might have.
Chinchilla3k. Nice idea. I've been thinking about (one day) working on a toolkit for reverse engineering new protocols, and I think that what you're talking about could be based on code from this project. Right now I've been trying to get into raw socket programming in order to write netcat like functions into my remote shell, so I have some learning to do first.
As for the first in first out, couldn't that end up going in circles? If a page has a 'home' link on it that leads to the root of the site, wouldn't that page get added to the queue. And when the main page was crawled, wouldn't the script just start over from there? Perhaps I don't understand, I've only used python's queue module once or twice.
To avoid this problem I would make a hash table and use the hashes of visited links as an index into the table.
If memory is not a concern, you could have a sorted linked list, so lookup atmost will take O(log n) and insert is O(1).
I do not see how my suggestion implied a toolkit for reverse engineering new protocols. Infact I do not see why such a thing would be needed, because alot of the most used protocols are well documented.
EDIT: Also, why recurse when you could loop? Recursion would just cause unnecessary function calls.
Chinchilla3k you are not a newbie, whatever your 'rank' might say.
I guess I went with a recursive solution because the problem seemed similar to a problem I had, involving finding duplicate files, which I solved recursively. Also a recursive solution seems like it would allow one to easily produce a visual map of the site, by showing how pages link together. It seemed simple, and I didn't immediately see a solution using loops. Could you explain yours further, perhaps a bit of psuedocode?
As for your suggestion, I don't think I'd even know where to start.
Chislam, if you have any questions about the python library, feel free to pm me. I'm usually online too much. I'd love to see the code for the working beta you have.
cheers.
richohealey wrote: hmmmm…. goes to code webspider… a simple link trawler would be quite easy i think. Yup, that's the plan. Anything to contribute?
Edit: I've got a few frames that I could modify for this project, adapting them to this wouldn't be too hard. What I need is the actual core of the code, something to take one page and parse it into a list of links. I assume that urllib is being used to fetch the pages. And you could use HTMLParser to find the links, or perhaps even a regular expression. I think a big constraint here is going to be bandwidth, keeping connections alive is a good idea.
digitalchameleon wrote: I guess I went with a recursive solution because the problem seemed similar to a problem I had, involving finding duplicate files, which I solved recursively. Also a recursive solution seems like it would allow one to easily produce a visual map of the site, by showing how pages link together. It seemed simple, and I didn't immediately see a solution using loops. Could you explain yours further, perhaps a bit of psuedocode?
I can do better than pseudo code. I made a web spider for a small project I had awhile ago. I had to look for the CD that had it. The timestamps on the files say July 2005, but I think I did this a year earlier than that because I remember starting this a couple months after the Mydoom outbreak. The code is in C, there are no executable files, and it compiles just fine. The web spider is in scan.c in the function Spider() - it uses a conditional loop. I coded this when I was 15 - and naive - so a lot of the code may look lame.
A few notes.. the website where it starts spidering is hardcoded in main()… if you compile it and run it, you'll see it start spidering at http://www.hackthissite.org/ and it'll spit out emails when it finds them.
digitalchameleon wrote: Do you have another link for the source, that one is… difficult. Also, how hard would it be to let the spider wander off site, to see what other pages a site links to. I honestly would've prefered pseudocode, I've never done anything beyond tutorial stuff in c.
Thanks
The new link is in the previous post.
In short, the pseudo code would look like this
while(link = dequeue(LinkQueue) {
socket = connect(hostname(link))
send(socket, buildhttprequest(link))
data = receive(socket)
while(newlink = getlink(data)) {
if(!(checkforduplicate(newlink))) {
enqueue(LinkQueue, newlink)
}
}
close(socket)
}
It shouldn't be hard to get it to wander off the site nor should it be hard to not get it to wander off the site. If your spider only handles links relative to the current host then it's incomplete. Some webmasters use full links to link to pages on their own server.
i.e If HBH used full link and you wanted to spider it but your spider only handled relative links. When going to the next page the link your spider would be handling
http://www.hellboundhackers.orghttp://www.hellboundhackers.org/search.php
So, it should be easy if you designed your spider properly.
I don't know why I didn't see that immediately.
I've never assigned a variable as the condition for a while loop. Dunno why, just never occoured to me. And getlink() would have to modify variables other than the one returned (it would have to modify data to remove the link it found, no?). I try to avoid this.
Still, I don't know why I couldn't see this solution immediately, or why a recursive solution seemed most obvious to me. I feel somewhat blind.
Do you have experience working in python? You seem more likely to have experience working in asm.