Welcome to HBH! If you have tried to register and didn't get a verification email, please using the following link to resend the verification email.
Manipulating Websites with VB.Net
Manipulating Websites with VB.Net
By ghost | 10987 Reads |
0
0
(Corrected the formatting in the article so that it is easier to read.)
This article will discuss how to manipulate the DOM objects on a website using VB.Net 2005. In no way will it be a definitive exploration of that concept. Rather, it will provide you with the theory and a working example (with explanations) that should encourage you to learn and discover more, and this is very easily possible with the links I will provide in this article.
The first object you will be using will be the Microsoft WebBrowser 2.0 object; this object comes ready-to-use with the standard installation of Visual Studio .Net 2005 (should also be the case with VS .Net Express 2005). This object has quite a bit more functionality and more ease of use than the old AxWebBrowser object. When manipulating the DOM, the WebBrowser object will primarily be used to navigate and provide the browser document to the next object for manipulation.
The next object, which ends up doing the VAST majority of the work, is the MSHTML object. To use this object, you will need to add a reference to it by clicking the following: Project (in the menu bar), Add Reference, the COM tab, Microsoft HTML Object Library. You can also add the reference, using the Add Reference interface and the Browse tab, by browsing to %SystemRoot%\Windows\system32\mshtml.tlb. By using the HTMLdocument property of the mshtml object, you are able to take the browser document from a WebBrowser (or Internet Explorer) object and manipulate it as if it were the actual browser document itself; that is, when you affect the mshtml.HTMLdocument, it affects the WebBrowser.
Other than that, you will need a decent knowledge of how the DOM is structured. Google will be your best friend with this. There are two main methods that deserve attention when intending to manipulate the DOM: manipulating by object index, or manipulating by object properties. If you manipulate by object index, you are basically referencing a static DOM object on the page according to the order in which it occurs on the page (much like an array). Alternatively, if you manipulate by object properties, you will be looking for objects that have a particular name, id, class, value, etc.
Alright, well, now that we got the theory, we can get to the code!
Imports mshtml
Public Class Form1
Dim page as mshtml.HTMLdocument
Dim Elements as mshtml.IHTMLElementCollection
Since we added the reference to MSHTML, we want to use the Imports mshtml statement so that we do not have to preface our mshtml statements with the namespace each time. Below Public Class Form1, we have our global variables. While it is not good practice to use globals in code, it simplifies learning a new concept. Our first global variable, page, is created as our mshtml.HTMLdocument object. This will be our container for the WebBrowser object (which we will call Browser1, and will assume is placed on the main form using the GUI in VS2005). The second variable, Elements, is created as an HTML Element collection. That is, we will be able to reference entire types of elements on our page variable using this variable.
More code:
Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
Browser1.Navigate("http://www.hellboundhackers.org")
Application.DoEvents()
While Browser1.ReadyState <> WebBrowserReadyState.Complete
Application.DoEvents()
End While
page = Browser1.document.Domdocument
Elements = page.getElementsByTagName("input")
For Each obj As mshtml.IHTMLElement in Elements
If obj.getAttribute("name") = "user_name" Then
obj.value = "Zephyr_Pure"
ElseIf obj.getAttribute("name") = "user_pass" Then
obj.value = "mypassword"
End If
Next
page.forms.item(0).submit()
End Sub
Okay, this subroutine is set to execute when the form loads. We tell the WebBrowser object (Browser1) to navigate to HBH. Next, we execute Application.DoEvents() to get the WebBrowser wheels turning. We wait for Browser1 to complete loading the page (which is when it will have the ReadyState = WebBrowserReadyState.Complete), then we make our mshtml.HTMLdocument object equal to the WebBrowser Domdocument. Next, we set our HTML element collection equal to all of the input tags in our Domdocument.
Next, we check all of the input tags for the user_name and user_pass objects, and fill in the correct information accordingly. Finally, we submit the login form (which is the first form on the page, so it is form 0 since it counts from 0). We could also reference that login form by calling it by name, like so:
page.forms.item("loginform").submit()
Although that is a simple and short demonstration of both index-based and property-based reference, it serves its purpose in providing a general understanding of the concepts. As always, more research is needed to better understand the concepts.
As a final demonstration, the code below will enumerate the forms on a page, telling you which index the form is, as well as the form name. Obviously, it can be extended to include form classes, form actions, etc. Here is the code:
Sub EnumForms()
Browser1.Navigate("http://anysitewithforms.com")
Application.DoEvents()
While Browser1.ReadyState <> WebBrowserReadyState.Complete
Application.DoEvents()
End While
page = Browser1.document.Domdocument
Dim formlist as String = ""
For x as Integer = 0 to page.forms.length - 1
formlist &= "Form #" & CStr(x) & ": " &
page.forms.item(x).name & Chr(13)
Next
MsgBox(formlist)
End Sub
As before, we tell the Browser to navigate and load the requested webpage. When the Browser document has loaded, we set our mshtml.HTMLdocument object equal to the Browser Domdocument. We need somewhere to store our form list, so we will use a string variable to hold our carriage-return-separated list of form items. Next, we iterate through all of the forms on the page by looping from 0 to page.forms.length - 1; the length property returns how many forms are in the page and, since they start from 0, we have to subtract 1 from the total. For each form object, we will construct a string saying which index it is (after Form #) and the name of the form, followed by a line feed (ASCII character 13 is a line feed -- you can also use VbCRLF, but I grew up using ASCII chars). Finally, we display a message box with the carriage-return-separated list so we can see all the forms on the page.
As stated before, this is only a broad introduction to MSHTML and VB.Net manipulating the DOM. There are many ways that this could be extended and, hopefully, I will find the time to write more articles about those very things. For now, however, I will leave any further research and expansion up to you, the readers. Below are the two most important references I used for this article. I hope you enjoyed it.
Renegade Minds Web Form Submitter: http://renegademinds.com/Default.aspx?tabid=47
MSDN MSHTML Reference: http://msdn.microsoft.com/workshop/browser/mshtml/reference/reference.asp
Comments
ghost 17 years ago
Well, I appreciate the positive feedback. However, if you say "I could do it better", some creative criticism would help so that, in the future, I can. :)