Skip to content

Commit

Permalink
Merge pull request scrapinghub#811 from ArturGaspar/request_body_in_har
Browse files Browse the repository at this point in the history
Provide request body in request info.
  • Loading branch information
kmike committed Oct 23, 2018
2 parents c4bf9eb + 6786b1f commit fe8c67d
Show file tree
Hide file tree
Showing 19 changed files with 403 additions and 41 deletions.
33 changes: 27 additions & 6 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -355,14 +355,26 @@ You can use online `HAR viewer`_ to visualize information returned from
this endpoint; it will be very similar to "Network" tabs in Firefox and Chrome
developer tools.

Currently this endpoint doesn't expose raw request contents;
only meta-information like headers and timings is available.
Response contents is included when
:ref:`'response_body' <arg-response-body-0>` option is set to 1.
Request and response contents are included when
:ref:`'request_body' <arg-request-body-0>` and
:ref:`'response_body' <arg-response-body-0>` options are set to 1,
respectively.

Due to the HAR_ format specification lacking a method of encoding binary
request data, a non-standard ``encoding`` field is included in ``postData``,
which, similarly to the field of same name in HAR responses, has the value
``base64`` when the request body has been encoded as such.

Arguments for this endpoint are the same as for `render.html`_, plus the
following:

.. _arg-request-body-0:

request_body : int : optional
Possible values are ``1`` and ``0``. When ``request_body=1``,
request content is included in HAR records. Default is ``request_body=0``.


.. _arg-response-body-0:

response_body : int : optional
Expand Down Expand Up @@ -446,8 +458,17 @@ har : integer : optional
If this option is ON the result will contain the same data
as `render.har`_ provides under 'har' key.

By default, response content is not included. To enable it use
:ref:`'response_body' <arg-response-body>` option.
By default, request and response contents are not included. To enable each,
use :ref:`'request_body' <arg-request-body>` and
:ref:`'response_body' <arg-response-body>` options respectively.

.. _arg-request-body:

request_body : int : optional
Possible values are ``1`` and ``0``. When ``request_body=1``,
request content is included in HAR records. Default is
``request_body=0``. This option has no effect when
both :ref:`'har' <arg-har>` and :ref:`'history' <arg-history>` are 0.

.. _arg-response-body:

Expand Down
26 changes: 23 additions & 3 deletions docs/scripting-ref.rst
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,24 @@ Enable or disable browser plugins (e.g. Flash).
Plugins are disabled by default.


.. _splash-request-body-enabled:

splash.request_body_enabled
----------------------------

Enable or disable storage of request content.

**Signature:** ``splash.request_body_enabled = true/false``

By default Splash doesn't keep bodies of each request in memory. It means that
request content is not available in :ref:`splash-request-info` and in HAR_
exports. To make request content available to a Lua script set
``splash.request_body_enabled = true``.

Note that request body in :ref:`splash-request-info` is not available in the
callback :ref:`splash-on-response-headers` or in the request of the response
returned by :ref:`splash-http-get` and :ref:`splash-http-post`.

.. _splash-response-body-enabled:

splash.response_body_enabled
Expand Down Expand Up @@ -1453,12 +1471,14 @@ all existing logs and start recording from scratch:
return {har1=har1, har2=har2}
end
By default, response content is not returned in HAR data. To enable it, use
:ref:`splash-response-body-enabled` option or
By default, request and response contents are not included in HAR data. To
enable request contents, use :ref:`splash-request-body-enabled` option. To
enable response contents, use :ref:`splash-response-body-enabled` option or
:ref:`splash-request-enable-response-body` method.

See also: :ref:`splash-har-reset`, :ref:`splash-on-response`,
:ref:`splash-response-body-enabled`, :ref:`splash-request-enable-response-body`.
:ref:`splash-request-body-enabled`, :ref:`splash-response-body-enabled`,
:ref:`splash-request-enable-response-body`.

.. _HAR: http://www.softwareishard.com/blog/har-12-spec/

Expand Down
7 changes: 7 additions & 0 deletions splash/browser_tab.py
Original file line number Diff line number Diff line change
Expand Up @@ -212,6 +212,7 @@ def _set_default_webpage_options(self, web_page):

self.set_js_enabled(True)
self.set_plugins_enabled(defaults.PLUGINS_ENABLED)
self.set_request_body_enabled(defaults.REQUEST_BODY_ENABLED)
self.set_response_body_enabled(defaults.RESPONSE_BODY_ENABLED)
self.set_indexeddb_enabled(defaults.INDEXEDDB_ENABLED)
self.set_webgl_enabled(defaults.WEBGL_ENABLED)
Expand Down Expand Up @@ -254,6 +255,12 @@ def set_custom_headers(self, headers):
"""
self.web_page.custom_headers = headers

def get_request_body_enabled(self):
return self.web_page.request_body_enabled

def set_request_body_enabled(self, val):
self.web_page.request_body_enabled = val

def get_response_body_enabled(self):
return self.web_page.response_body_enabled

Expand Down
3 changes: 3 additions & 0 deletions splash/defaults.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,9 @@
# plugins (e.g. flash)
PLUGINS_ENABLED = False

# request content
REQUEST_BODY_ENABLED = False

# response content
RESPONSE_BODY_ENABLED = False

Expand Down
43 changes: 39 additions & 4 deletions splash/har/qt.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,19 +134,54 @@ def reply2har(reply, content=None):
res["content"]["size"] = len(content)
res["content"]["text"] = base64.b64encode(content).decode('latin1')
res["content"]["encoding"] = 'base64'

return res


def request2har(request, operation, outgoing_data=None):
def _har_postdata(body, content_type):
"""
Build the postData value for HAR, from a binary body and a content type.
"""

postdata = {"mimeType": content_type or "?"}

if content_type == "application/x-www-form-urlencoded":
# application/x-www-form-urlencoded is valid ASCII, see
# <https://url.spec.whatwg.org/#concept-urlencoded-serializer>.
try:
postdata["text"] = body.decode('ascii')
except UnicodeDecodeError:
pass

# This is non-standard. The HAR format does not specify how to handle
# binary request data.
if "text" not in postdata:
postdata["encoding"] = "base64"
postdata["text"] = base64.b64encode(body).decode('ascii')

return postdata


def request2har(request, operation, content=None):
""" Serialize QNetworkRequest to HAR. """
return {
har = {
"method": OPERATION_NAMES.get(operation, '?'),
"url": str(request.url().toString()),
"httpVersion": "HTTP/1.1",
"cookies": request_cookies2har(request),
"queryString": querystring2har(request.url()),
"headers": headers2har(request),
"headersSize": headers_size(request),
"bodySize": outgoing_data.size() if outgoing_data is not None else -1,
"bodySize": -1
}
if content is not None:
har["bodySize"] = len(content)
content_type = request.header(QNetworkRequest.ContentTypeHeader)
har["postData"] = _har_postdata(content, content_type)
else:
content_length = request.header(QNetworkRequest.ContentLengthHeader)
if content_length is not None:
har["bodySize"] = content_length
return har
12 changes: 6 additions & 6 deletions splash/har_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ def get_entry(self, req_id):
entry = self.log.get_mutable_entry(req_id)
return entry

def _initial_entry_data(self, start_time, operation, request, outgoingData):
def _initial_entry_data(self, start_time, operation, request, content):
"""
Return initial values for a new HAR entry.
"""
Expand All @@ -62,14 +62,13 @@ def _initial_entry_data(self, start_time, operation, request, outgoingData):
'start_time': start_time,
'request_start_sending_time': start_time,
'request_sent_time': start_time,
'response_start_time': start_time,
# 'outgoingData': outgoingData,
'response_start_time': start_time
},
'_splash_processing_state': self.REQUEST_CREATED,

# standard fields
"startedDateTime": format_datetime(start_time),
"request": request2har(request, operation, outgoingData),
"request": request2har(request, operation, content),
"response": {
"bodySize": -1,
},
Expand Down Expand Up @@ -98,7 +97,8 @@ def store_url(self, url):
def store_timing(self, name):
self.log.store_timing(name)

def store_new_request(self, req_id, start_time, operation, request, outgoingData):
def store_new_request(self, req_id, start_time, operation, request,
content):
"""
Store information about a new QNetworkRequest.
"""
Expand All @@ -107,7 +107,7 @@ def store_new_request(self, req_id, start_time, operation, request, outgoingData
start_time=start_time,
operation=operation,
request=request,
outgoingData=outgoingData
content=content
))

def store_new_reply(self, req_id, reply):
Expand Down
15 changes: 13 additions & 2 deletions splash/kernel/inspections/splash-auto.json
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,17 @@
"details": "Plugins are disabled by default.",
"params": null
},
"splash.request_body_enabled": {
"name": null,
"header": "splash.request_body_enabled",
"content": "Enable or disable storage of request content.\n\n**Signature:** ``splash.request_body_enabled = true/false``\n\nBy default Splash doesn't keep bodies of each request in memory. It means that\nrequest content is not available in :ref:`splash-request-info` and in HAR_\nexports. To make request content available to a Lua script set\n``splash.request_body_enabled = true``.\n\nNote that request data in :ref:`splash-request-info` is not available in the\ncallback :ref:`splash-on-response-headers` or in the request of the response\nreturned by :ref:`splash-http-get` and :ref:`splash-http-post`.",
"short": "Enable or disable storage of request content.",
"signature": "splash.request_body_enabled = true/false",
"returns": null,
"async": null,
"details": "By default Splash doesn't keep bodies of each request in memory. It means that\nrequest content is not available in :ref:`splash-request-info` and in HAR_\nexports. To make request content available to a Lua script set\n``splash.request_body_enabled = true``.\n\nNote that request data in :ref:`splash-request-info` is not available in the\ncallback :ref:`splash-on-response-headers` or in the request of the response\nreturned by :ref:`splash-http-get` and :ref:`splash-http-post`.",
"params": null
},
"splash.response_body_enabled": {
"name": null,
"header": "splash.response_body_enabled",
Expand Down Expand Up @@ -299,12 +310,12 @@
"splash:har": {
"name": "har",
"header": "splash:har",
"content": "**Signature:** ``har = splash:har{reset=false}``\n\n**Parameters:**\n\n* reset - optional; when ``true``, reset HAR records after taking a snapshot.\n\n**Returns:** information about pages loaded, events happened,\nnetwork requests sent and responses received in HAR_ format.\n\n**Async:** no.\n\nUse :ref:`splash-har` to get information about network requests and\nother Splash activity.\n\nIf your script returns the result of ``splash:har()`` in a top-level\n``\"har\"`` key then Splash UI will give you a nice diagram with network\ninformation (similar to \"Network\" tabs in Firefox or Chrome developer tools):\n\n.. code-block:: lua\n\n function main(splash)\n assert(splash:go(splash.args.url))\n return {har=splash:har()}\n end\n\nBy default, when several requests are made (e.g. :ref:`splash-go` is called\nmultiple times), HAR data is accumulated and combined into a single object\n(logs are still grouped by page).\n\nIf you want only updated information use ``reset`` parameter: it drops\nall existing logs and start recording from scratch:\n\n.. code-block:: lua\n\n function main(splash, args)\n assert(splash:go(args.url1))\n local har1 = splash:har{reset=true}\n assert(splash:go(args.url2))\n local har2 = splash:har()\n return {har1=har1, har2=har2}\n end\n\nBy default, response content is not returned in HAR data. To enable it, use\n:ref:`splash-response-body-enabled` option or\n:ref:`splash-request-enable-response-body` method.\n\nSee also: :ref:`splash-har-reset`, :ref:`splash-on-response`,\n:ref:`splash-response-body-enabled`, :ref:`splash-request-enable-response-body`.\n\n.. _HAR: http://www.softwareishard.com/blog/har-12-spec/",
"content": "**Signature:** ``har = splash:har{reset=false}``\n\n**Parameters:**\n\n* reset - optional; when ``true``, reset HAR records after taking a snapshot.\n\n**Returns:** information about pages loaded, events happened,\nnetwork requests sent and responses received in HAR_ format.\n\n**Async:** no.\n\nUse :ref:`splash-har` to get information about network requests and\nother Splash activity.\n\nIf your script returns the result of ``splash:har()`` in a top-level\n``\"har\"`` key then Splash UI will give you a nice diagram with network\ninformation (similar to \"Network\" tabs in Firefox or Chrome developer tools):\n\n.. code-block:: lua\n\n function main(splash)\n assert(splash:go(splash.args.url))\n return {har=splash:har()}\n end\n\nBy default, when several requests are made (e.g. :ref:`splash-go` is called\nmultiple times), HAR data is accumulated and combined into a single object\n(logs are still grouped by page).\n\nIf you want only updated information use ``reset`` parameter: it drops\nall existing logs and start recording from scratch:\n\n.. code-block:: lua\n\n function main(splash, args)\n assert(splash:go(args.url1))\n local har1 = splash:har{reset=true}\n assert(splash:go(args.url2))\n local har2 = splash:har()\n return {har1=har1, har2=har2}\n end\n\nBy default, request and response contents are not included in HAR data. To\nenable request contents, use :ref:`splash-request-body-enabled` option. To\nenable response contents, use :ref:`splash-response-body-enabled` option or\n:ref:`splash-request-enable-response-body` method.\n\nSee also: :ref:`splash-har-reset`, :ref:`splash-on-response`,\n:ref:`splash-request-body-enabled`, :ref:`splash-response-body-enabled`,\n:ref:`splash-request-enable-response-body`.\n\n.. _HAR: http://www.softwareishard.com/blog/har-12-spec/",
"short": "",
"signature": "har = splash:har{reset=false}",
"returns": "information about pages loaded, events happened,\nnetwork requests sent and responses received in HAR_ format.",
"async": "no.",
"details": "Use :ref:`splash-har` to get information about network requests and\nother Splash activity.\n\nIf your script returns the result of ``splash:har()`` in a top-level\n``\"har\"`` key then Splash UI will give you a nice diagram with network\ninformation (similar to \"Network\" tabs in Firefox or Chrome developer tools):\n\n.. code-block:: lua\n\n function main(splash)\n assert(splash:go(splash.args.url))\n return {har=splash:har()}\n end\n\nBy default, when several requests are made (e.g. :ref:`splash-go` is called\nmultiple times), HAR data is accumulated and combined into a single object\n(logs are still grouped by page).\n\nIf you want only updated information use ``reset`` parameter: it drops\nall existing logs and start recording from scratch:\n\n.. code-block:: lua\n\n function main(splash, args)\n assert(splash:go(args.url1))\n local har1 = splash:har{reset=true}\n assert(splash:go(args.url2))\n local har2 = splash:har()\n return {har1=har1, har2=har2}\n end\n\nBy default, response content is not returned in HAR data. To enable it, use\n:ref:`splash-response-body-enabled` option or\n:ref:`splash-request-enable-response-body` method.\n\nSee also: :ref:`splash-har-reset`, :ref:`splash-on-response`,\n:ref:`splash-response-body-enabled`, :ref:`splash-request-enable-response-body`.\n\n.. _HAR: http://www.softwareishard.com/blog/har-12-spec/",
"details": "Use :ref:`splash-har` to get information about network requests and\nother Splash activity.\n\nIf your script returns the result of ``splash:har()`` in a top-level\n``\"har\"`` key then Splash UI will give you a nice diagram with network\ninformation (similar to \"Network\" tabs in Firefox or Chrome developer tools):\n\n.. code-block:: lua\n\n function main(splash)\n assert(splash:go(splash.args.url))\n return {har=splash:har()}\n end\n\nBy default, when several requests are made (e.g. :ref:`splash-go` is called\nmultiple times), HAR data is accumulated and combined into a single object\n(logs are still grouped by page).\n\nIf you want only updated information use ``reset`` parameter: it drops\nall existing logs and start recording from scratch:\n\n.. code-block:: lua\n\n function main(splash, args)\n assert(splash:go(args.url1))\n local har1 = splash:har{reset=true}\n assert(splash:go(args.url2))\n local har2 = splash:har()\n return {har1=har1, har2=har2}\n end\n\nBy default, request and response contents are not included in HAR data. To\nenable request contents, use :ref:`splash-request-body-enabled` option. To\nenable response contents, use :ref:`splash-response-body-enabled` option or\n:ref:`splash-request-enable-response-body` method.\n\nSee also: :ref:`splash-har-reset`, :ref:`splash-on-response`,\n:ref:`splash-request-body-enabled`, :ref:`splash-response-body-enabled`,\n:ref:`splash-request-enable-response-body`.\n\n.. _HAR: http://www.softwareishard.com/blog/har-12-spec/",
"params": "* reset - optional; when ``true``, reset HAR records after taking a snapshot."
},
"splash:har_reset": {
Expand Down
Loading

0 comments on commit fe8c67d

Please sign in to comment.