Extraction Rules
Customize your response by adding extraction rules.
WebScrapingAPI allows you to extract specific sections of the webpage. You can do so by using the extract_rules
parameter.
This parameter's value can be a string
(the CSS selector or XPath) or a stringified object
. In the second case, the parameter accepts the following options:
selector
Required
string
The CSS selector or the XPath.
selector_type
string
The type of the selector
option. Accepted values are css
and xpath
. The default value is xpath
if the selector
option starts with /
, and css
otherwise.
output
string
The output format of the selected element. Accepted values are:
- html
- returns HTML format
- text
- (default) returns text format
- @[attr]
- returns the attribute of the element
- table_json
- returns the JSON format of a table
- table_array
- returns the array format of a table
- another extract_rules
object - used to parse nested elements.
all
int
Returns all possible elements. The default value for this parameter is "1"
.
clean
int
Removes leading and trailing white spaces, line terminator characters, and newlines from the result. The default value for this parameter is "1"
.
A full example of how this parameter would look in production is:
or:
Extraction Rules Integration Examples
Extract Content Based on CSS Rules
GET
https://api.webscrapingapi.com/v2
The following examples shows how the extraction_rules
parameter is used in order to extract specific elements from the targeted website.
Query Parameters
api_key*
String
<YOUR_API_KEY>
url*
String
https://webscrapingapi.com
extract_rules
Object
{
"title": {
"selector": "h1",
"output": "html"
},
"subtitle": {
"selector": ".font-light.max-w-6x",
"output": "text"
}
}
The full GET request for the extract_rules
should be:
Important! The url
& extract_rules
parameters have to be encoded.
( i.e. &url=https%3A%2F%2Fwww.webscrapingapi.com%2F&extract_rules=%7B%22title%22%3A%20%7B%22selector... )
More extract_rules
object examples
extract_rules
object examplesHere are more examples that should help you better understand how the object passed to the extract_rules
parameter should look like:
<div class="title">
This is my title
</div>
{"title": ".title"}
Return the text content of the elements having the CSS class .title
{
"title": [
"This is my title"
]
}
<div>
<a href="https://www.webscrapingapi.com/product/">
Product
</a>
<a href="https://www.webscrapingapi.com/pricing/">
Pricing
</a>
</div>
{
"links": {
"selector": "a",
"output": "@href",
"all": "1"
}
}
Return the href
attribute of all links on page
{
"links": [
"https://www.webscrapingapi.com/product/","https://www.webscrapingapi.com/pricing/"
]
}
<div>
<img src="https://www.webscrapingapi.com/assets/images/icons/full.svg?v=41d081a6f0"
>
</div>
{
"image": {
"selector": "img",
"output": "@src",
"all": 0,
}
}
Return the src
attribute of the first image available on page
{
"image": [
"https://www.webscrapingapi.com/assets/images/icons/full.svg?v=41d081a6f0"
]
}
<table class="ants">
<thead>
<tr>
<th>Region</th>
<th>No. species</th>
</tr>
</thead>
<tbody>
<tr>
<td>Europe</td>
<td>180</td>
</tr>
</tbody>
</table>
{
"table": {
"selector": ".ants",
"output": "table_json",
"all": 0
}
}
Return the JSON format of the first table having the CSS class .ants
{
"table": [
{
"Region: "Europe",
"No. species": "180"
}
]
}
<table class="ants">
<thead>
<tr>
<th>Region</th>
<th>No. species</th>
</tr>
</thead>
<tbody>
<tr>
<td>Europe</td>
<td>180</td>
</tr>
</tbody>
</table>
{
"table": {
"selector": ".ants",
"output": "table_array",
"all": 0
}
}
Return the array format of the first table having the CSS class .ants
{
"table": [
["Europe", "180"]
]
}
<ul>
<li>
<p class="name">Item1</p>
<p class="price">100</p>
</li>
<li>
<p class="name">Item2</p>
<p class="price">1000</p>
</li>
</ul>
{
"items": {
"selector": "li",
"output": {
"name": {
"selector": ".name",
"all": 0,
"price": {
"selector": ".price",
"all": 0
}
},
"all": 1
}
}
Return the name and the price of each list item.
{
"items": [
{
"name": "Item1",
"price": "100"
},
{
"name": "Item2",
"price": "1000"
}
]
}
Last updated