Read HTML content

Question

1

Read HTML content

asked 2016-12-18 19:03:35 +0200

alko89
131 ●5 ●7 ●10

updated 2016-12-18 19:06:08 +0200

I want to develop an app for Sailfish and I need to read some data from a website. What do I use for writing a simple crawler?

I'm reading my website using QNetworkRequest. But I don't know what to use to parse the HTML. This is what I have now:

void CCrawler::replyFinished(QNetworkReply* pReply) {

QByteArray data = pReply->readAll();
QString str(data);

//QWebPage page;
QWebFrame *frame = new QWebFrame();

frame->mainFrame()->setHtml(str);

QWebElement document = frame->documentElement();
QWebElementCollection elements = document.findAll("a");
foreach (QWebElement element, elements)
    qDebug() << element.toInnerXml();

}

But I get an error:

invalid use of incomplete type class QWebFrame

edit retag flag offensive close delete

Comments

I have made app for reading my countries currency information, it's written in python, this is how i read HTML

I could not post this like code, how to do that?

url = "https://www.nbg.gov.ge/index.php?m=582" html_page = urllib2.urlopen(url) soup = BeautifulSoup(html_page) table = soup.find("table", border="0", style="width:100%;")

cur = [] usd = [] pas = [] aiw = [] nishani = ""

for row in table.findAll('tr')[1:]: col = row.findAll('td') cur.append(col[0])

for line in cur[2].findAll('tr'): usd.append(line) pas.append(line.text)

for line in cur[2].findAll('img'): aiw.append(line)

notify = pas[-3] raodenoba = str(notify[21:])

AnonUser10082 ( 2016-12-19 10:23:56 +0200 )edit

Thanks, but I already have a crawler in Python for this. I just want to do the same with c++.

alko89 ( 2016-12-19 11:31:56 +0200 )edit

hey..can this kind of a crawler crawl how-to pages like those at knowhownonprofit.org??

hivy ( 2017-07-28 12:33:07 +0200 )edit

I have made app for reading my countries currency information, it's written in python, this is how i read HTML
I could not post this like code, how to do that?
url = "https://www.nbg.gov.ge/index.php?m=582" html_page = urllib2.urlopen(url) soup = BeautifulSoup(html_page) table = soup.find("table", border="0", style="width:100%;")
cur = [] usd = [] pas = [] aiw = [] nishani = ""
for row in table.findAll('tr')[1:]: col = row.findAll('td') cur.append(col[0])
for line in cur[2].findAll('tr'): usd.append(line) pas.append(line.text)
for line in cur[2].findAll('img'): aiw.append(line)
notify = pas[-3] raodenoba = str(notify[21:])
AnonUser10082 ( 2016-12-19 10:23:56 +0200 )edit
Thanks, but I already have a crawler in Python for this. I just want to do the same with c++.
alko89 ( 2016-12-19 11:31:56 +0200 )edit
hey..can this kind of a crawler crawl how-to pages like those at knowhownonprofit.org??
hivy ( 2017-07-28 12:33:07 +0200 )edit

Answer 1

1

answered 2016-12-19 15:43:57 +0200

Aldrog

196 ●5 ●9 ●13

https://github.com/Aldrog

From the error I can assume you forgot to include QWebFrame header #include <QWebFrame>. You also have to add QT += webkitwidgets in your pro file.

Yet I'm not sure if it's allowed in harbour nor if it's a good solution in general (QWebFrame seems to be tied to visual representation). My suggestion is to use QXmlStreamReader instead:

QXmlStreamReader xml(pReply);
while (!xml.atEnd()) {
    xml.readNext();
    if(xml.name() == "a") {
        qDebug() << xml.tokenString();
        xml.skipCurrentElement();
    }
}

edit flag offensive delete publish link

Comments

QXmlStreamReader does not return any a elements for the website I want to parse...you can try it out yourself: http://capoeiralyrics.info/

alko89 ( 2016-12-19 18:42:39 +0200 )edit

Well, forgot that HTML is an ill-formed XML and does not necessarily have all tags closed. Not sure how can this be resolved.

Aldrog ( 2016-12-19 22:10:18 +0200 )edit

You can try QXmlQuery, it's said to also work with "non-XML data that has been modeled to look like XML" which I assume covers HTML.

Aldrog ( 2016-12-19 22:34:15 +0200 )edit

2

I think using HTML Tidy/libtidy (C library) to convert from HTML to XML, and then treating it as XHTML could be an option?

bruce_one ( 2016-12-21 04:30:48 +0200 )edit

Sorry for my late reply, I didn't get the time. Anyway, I ended up using QRegularExpression. For the purpuse of such a simple site I didn't find it worth including an additional library.

alko89 ( 2016-12-27 00:12:17 +0200 )edit

Read HTML content

Comments

1 Answer

Comments

Question tools

Stats

Related questions

Read HTML content

Comments

1 Answer

Comments

Question tools

Public thread

Stats

Related questions