To do HTTP automation, as far as I can think of for now, there are two main issues: a basic tool to communicate with the HTTP server, and HTML parser to understand (or retrieve information from) what you get from the server.
==============
I. HTTP CLIENT
==============
The basic tool is an http client implementation. For the practice I have so far is using the http package of Tcl language. But there's also libwww, written in C, and it may be better and more comprehensive, but more difficult to use I guess.
—————
1. GET and POST
—————
Communicating with the http server is basically sending queries to the server. Clicking a link, submitting forms, etc., are all queries with specific URLs and with or without body contents. There are mainly two kinds of queries, GET and POST. GET are normally used to retrieve a URL from the server. It is used in our case to simulate mouse clicks on links. It also can do some simply submitting jobs like in CGI. POST method is more comprehensive. It is used to simulate submitting contents to a server, like authentication (logging in), etc.
The queried contents are sent back by the server and is directly followed by the queries (although some delays and the queries may block your program procedure a little sometimes).
Take Tcl http package for example, to query a URL using GET:
set token [::HTTP::geturl $url]
Using POST:
set token [::HTTP::geturl $url -query $query]
where $query is where the body contents stored.
To URL encode a content, use:
set query [::HTTP::formatQuery $key1 $value1 $key2 $value2 …]
———–
2. Cookies
———–
It is often needed to use cookies. Cookies are sent by the server in HTTP header (like Set-Cookies: XXXX), and they have to be contained in the client's query package headers (Like Cookies: XXXX;XXXX;XXXX). The Tcl way to retrieve cookies from server's HTTP packet header:
————————
package require http
set login [::http::formatQuery email spammer@hotmail.com password fooFoo!]
set tok [::http::geturl http://mysite.net/register/user-login.tcl -query $login]
upvar \#0 $tok state
set cookies [list]
foreach {name value} $state(meta) {
if { $name eq "Set-Cookie" } {
lappend cookies [lindex [split $value {;}] 0]
}
}
::http::cleanup $tok
————————-
To set the cookies in the following queries:
————————-
set tok2 [::http::geturl http://mysite.net/some/restricted_page.html -headers [list Cookie [join $cookies {;}]]]
… your code
::http::cleanup $tok2
————————-
Here's a complete example showing how to communicate with HTTP servers pragmatically:
————————-
package require http;
proc get_url {token} {
set contents [::http::data $token];
set regresult [regexp {URL=(..*)\'} $contents dummy url];
if { $regresult != 1 } {
puts "get_url:error!";
}
return $url;
}
set domain {http://uni14.ogame.org};
# format the submitting contents
set login [::http::formatQuery v 2 universe {uni14.ogame.org} login {your-username} pass {your-pass}];
# POST to the server
set token [http::geturl $domain/game/reg/login2.php -query $login];
# Retrieve cookies
upvar #0 $token state;
set cookies [list];
foreach {name value} $state(meta) {
if { $name == "Set-Cookie" } {
lappend cookies [lindex [split $value {;}] 0];
}
}
set url [get_url $token];
puts $url
# Using cookies to query following data
set token [::http::geturl ${domain}${url} -headers [list Cookie [join $cookies {;}]]];
puts [::http::data $token];
————————-
===============
II. HTML PARSER
===============
Once we have methods to simulate actions of web browsers, we only need to consider how to retrieve useful information from what we get from the server. I think of this issue quite a lot and initially thought of using regular expressions. However, it seems that regexps are not safe or stable to do this task. Then I googled and found that HTML parsers might be the correct choice. On this issue, I still know little. And more studies are needed. And this is my following task.
Read Full Post »