Getting HTML inside JSON in Scrapy
Scraping HTML xor JSON page with Scrapy (or Request and/or BeautifulSoup4 for that matter) is rather straight-forward.
## JSON
response_json = response.json()## HTML
## Subsequent manipulation is done with xpath
response_html = responseThen, how about HTML inside a JSON document?
Selector() to the rescue
Let's say a scraped JSON document has a section named html which includes HTML tags. How to scrape it and make the HTML tags manipulated with xpath?
Scrapy's Selector() is to be used upon that JSON object.
import scrapy
response_json = response.json()
response_json['html'] = scrapy.Selector(text=response_json['html'],type='html')Here, when using Selector(), be sure to assign with text=. Unless, error might occur while type assignment is optional.