{"id":34878,"date":"2024-08-15T11:45:25","date_gmt":"2024-08-15T04:45:25","guid":{"rendered":"http:\/\/jupitek.maudemo.vip\/index.php\/2024\/08\/15\/scrape-a-website-with-beautiful-soup\/"},"modified":"2024-08-15T11:45:25","modified_gmt":"2024-08-15T04:45:25","slug":"scrape-a-website-with-beautiful-soup","status":"publish","type":"post","link":"https:\/\/jupitek.maudemo.vip\/index.php\/2024\/08\/15\/scrape-a-website-with-beautiful-soup\/","title":{"rendered":"Scrape m\u1ed9t website v\u1edbi Beautiful Soup"},"content":{"rendered":"<h2 id=\"what-is-beautiful-soup\">What is Beautiful Soup?<a href=\"https:\/\/www.linode.com\/docs\/guides\/how-to-scrape-a-website-with-beautiful-soup\/#what-is-beautiful-soup\"><\/a><\/h2>\n<p><a href=\"https:\/\/www.crummy.com\/software\/BeautifulSoup\/\" target=\"_blank\" rel=\"noreferrer noopener\">Beautiful Soup<\/a>&nbsp;l\u00e0 m\u1ed9t th\u01b0 vi\u1ec7n Python ph\u00e2n t\u00edch c\u00fa ph\u00e1p c\u00e1c t\u00e0i li\u1ec7u HTML ho\u1eb7c XML th\u00e0nh m\u1ed9t c\u1ea5u tr\u00fac c\u00e2y gi\u00fap d\u1ec5 d\u00e0ng t\u00ecm ki\u1ebfm v\u00e0 tr\u00edch xu\u1ea5t d\u1eef li\u1ec7u. N\u00f3 th\u01b0\u1eddng \u0111\u01b0\u1ee3c s\u1eed d\u1ee5ng \u0111\u1ec3 thu th\u1eadp d\u1eef li\u1ec7u t\u1eeb c\u00e1c trang web.<\/p>\n<p>Beautiful Soup c\u00f3 giao di\u1ec7n Python \u0111\u01a1n gi\u1ea3n v\u00e0 kh\u1ea3 n\u0103ng chuy\u1ec3n \u0111\u1ed5i m\u00e3 h\u00f3a t\u1ef1 \u0111\u1ed9ng gi\u00fap b\u1ea1n d\u1ec5 d\u00e0ng l\u00e0m vi\u1ec7c v\u1edbi d\u1eef li\u1ec7u trang web.<\/p>\n<p>C\u00e1c trang web l\u00e0 c\u00e1c t\u00e0i li\u1ec7u c\u00f3 c\u1ea5u tr\u00fac v\u00e0 Beautiful Soup cung c\u1ea5p cho b\u1ea1n c\u00e1c c\u00f4ng c\u1ee5 \u0111\u1ec3 \u0111i qua c\u1ea5u tr\u00fac ph\u1ee9c t\u1ea1p \u0111\u00f3 v\u00e0 tr\u00edch xu\u1ea5t c\u00e1c bit th\u00f4ng tin \u0111\u00f3. Trong h\u01b0\u1edbng d\u1eabn n\u00e0y, b\u1ea1n s\u1ebd vi\u1ebft m\u1ed9t t\u1eadp l\u1ec7nh Python \u0111\u1ec3 thu th\u1eadp gi\u00e1 xe m\u00e1y tr\u00ean Craigslist. T\u1eadp l\u1ec7nh s\u1ebd \u0111\u01b0\u1ee3c thi\u1ebft l\u1eadp \u0111\u1ec3 ch\u1ea1y theo c\u00e1c kho\u1ea3ng th\u1eddi gian \u0111\u1ec1u \u0111\u1eb7n b\u1eb1ng c\u00e1ch s\u1eed d\u1ee5ng m\u1ed9t c\u00f4ng vi\u1ec7c cron v\u00e0 d\u1eef li\u1ec7u k\u1ebft qu\u1ea3 s\u1ebd \u0111\u01b0\u1ee3c xu\u1ea5t sang b\u1ea3ng t\u00ednh Excel \u0111\u1ec3 ph\u00e2n t\u00edch xu h\u01b0\u1edbng. B\u1ea1n c\u00f3 th\u1ec3 d\u1ec5 d\u00e0ng \u0111i\u1ec1u ch\u1ec9nh c\u00e1c b\u01b0\u1edbc n\u00e0y cho c\u00e1c trang web ho\u1eb7c truy v\u1ea5n t\u00ecm ki\u1ebfm kh\u00e1c b\u1eb1ng c\u00e1ch thay th\u1ebf c\u00e1c URL kh\u00e1c nhau v\u00e0 \u0111i\u1ec1u ch\u1ec9nh t\u1eadp l\u1ec7nh cho ph\u00f9 h\u1ee3p.<\/p>\n<h2 id=\"install-beautiful-soup\">C\u00e0i \u0111\u1eb7t Beautiful Soup<a href=\"https:\/\/www.linode.com\/docs\/guides\/how-to-scrape-a-website-with-beautiful-soup\/#install-beautiful-soup\"><\/a><\/h2>\n<h3 id=\"install-python\">C\u00e0i \u0111\u1eb7t Python<\/h3>\n<p>1.T\u1ea3i xu\u1ed1ng v\u00e0 c\u00e0i \u0111\u1eb7t Miniconda:<\/p>\n<pre class=\"wp-block-code\"><code>curl -OL https:\/\/repo.continuum.io\/miniconda\/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh<\/code><\/pre>\n<p>2.B\u1ea1n s\u1ebd \u0111\u01b0\u1ee3c nh\u1eafc nhi\u1ec1u l\u1ea7n trong qu\u00e1 tr\u00ecnh c\u00e0i \u0111\u1eb7t. Xem l\u1ea1i c\u00e1c \u0111i\u1ec1u kho\u1ea3n v\u00e0 \u0111i\u1ec1u ki\u1ec7n v\u00e0 ch\u1ecdn &#8220;c\u00f3&#8221; cho m\u1ed7i l\u1eddi nh\u1eafc.<\/p>\n<p>3.Kh\u1edfi \u0111\u1ed9ng l\u1ea1i phi\u00ean shell \u0111\u1ec3 nh\u1eefng thay \u0111\u1ed5i trong PATH c\u00f3 hi\u1ec7u l\u1ef1c.<\/p>\n<p>4.Ki\u1ec3m tra phi\u00ean b\u1ea3n Python c\u1ee7a b\u1ea1n:<\/p>\n<pre class=\"wp-block-code\"><code>python --version\n<\/code><\/pre>\n<h3 id=\"install-beautiful-soup-and-dependencies\">C\u00e0i \u0111\u1eb7t Beautiful Soup v\u00e0 Dependencies<a href=\"https:\/\/www.linode.com\/docs\/guides\/how-to-scrape-a-website-with-beautiful-soup\/#install-beautiful-soup-and-dependencies\"><\/a><\/h3>\n<p>1.C\u1eadp nh\u1eadt h\u1ec7 th\u1ed1ng c\u1ee7a b\u1ea1n:<\/p>\n<pre class=\"wp-block-code\"><code><code> sudo apt update &amp;&amp; sudo apt upgrade<\/code><\/code><\/pre>\n<p>2.C\u00e0i \u0111\u1eb7t phi\u00ean b\u1ea3n m\u1edbi nh\u1ea5t c\u1ee7a Beautiful Soup b\u1eb1ng pip:<code> <\/code><\/p>\n<pre class=\"wp-block-code\"><code><code>pip install beautifulsoup4<\/code><\/code><\/pre>\n<p>3.C\u00e0i \u0111\u1eb7t c\u00e1c ph\u1ee5 thu\u1ed9c:<code> <\/code><\/p>\n<pre class=\"wp-block-code\"><code><code>pip install tinydb urllib3 xlsxwriter lxml<\/code><\/code><\/pre>\n<h2 id=\"build-a-web-scraper\">X\u00e2y d\u1ef1ng m\u1ed9t Web Scraper<a href=\"https:\/\/www.linode.com\/docs\/guides\/how-to-scrape-a-website-with-beautiful-soup\/#build-a-web-scraper\"><\/a><\/h2>\n<h3 id=\"required-modules\">C\u00e1c m\u00f4-\u0111un b\u1eaft bu\u1ed9c<a href=\"https:\/\/www.linode.com\/docs\/guides\/how-to-scrape-a-website-with-beautiful-soup\/#required-modules\"><\/a><\/h3>\n<p>L\u1edbp&nbsp;<code>BeautifulSoup<\/code>from&nbsp;<code>bs4<\/code>s\u1ebd x\u1eed l\u00fd vi\u1ec7c ph\u00e2n t\u00edch c\u00fa ph\u00e1p c\u00e1c trang web.&nbsp;<code>datetime<\/code>Module cung c\u1ea5p kh\u1ea3 n\u0103ng x\u1eed l\u00fd ng\u00e0y th\u00e1ng.&nbsp;<code>Tinydb<\/code>cung c\u1ea5p API cho c\u01a1 s\u1edf d\u1eef li\u1ec7u NoSQL v\u00e0&nbsp;<code>urllib3<\/code>module \u0111\u01b0\u1ee3c s\u1eed d\u1ee5ng \u0111\u1ec3 th\u1ef1c hi\u1ec7n c\u00e1c y\u00eau c\u1ea7u http. Cu\u1ed1i c\u00f9ng,&nbsp;<code>xlsxwriter<\/code>API \u0111\u01b0\u1ee3c s\u1eed d\u1ee5ng \u0111\u1ec3 t\u1ea1o b\u1ea3ng t\u00ednh excel.<\/p>\n<p>M\u1edf&nbsp;<code>craigslist.py<\/code>trong tr\u00ecnh so\u1ea1n th\u1ea3o v\u0103n b\u1ea3n v\u00e0 th\u00eam c\u00e1c c\u00e2u l\u1ec7nh nh\u1eadp c\u1ea7n thi\u1ebft:<\/p>\n<pre class=\"wp-block-code\"><code>from bs4 import BeautifulSoup\nimport datetime\nfrom tinydb import TinyDB, Query\nimport urllib3\nimport xlsxwriter<\/code><\/pre>\n<h3 id=\"add-global-variables\">Th\u00eam Bi\u1ebfn To\u00e0n C\u1ea7u<a href=\"https:\/\/www.linode.com\/docs\/guides\/how-to-scrape-a-website-with-beautiful-soup\/#add-global-variables\"><\/a><\/h3>\n<p>Sau c\u00e1c c\u00e2u l\u1ec7nh import, h\u00e3y th\u00eam c\u00e1c bi\u1ebfn to\u00e0n c\u1ee5c v\u00e0 t\u00f9y ch\u1ecdn c\u1ea5u h\u00ecnh:<\/p>\n<pre class=\"wp-block-code\"><code>urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)\n\nurl = 'https:\/\/elpaso.craigslist.org\/search\/mcy?sort=date'\ntotal_added = 0<\/code><\/pre>\n<p><code>url<\/code>l\u01b0u tr\u1eef URL c\u1ee7a trang web c\u1ea7n thu th\u1eadp v\u00e0&nbsp;<code>total_added<\/code>s\u1ebd \u0111\u01b0\u1ee3c s\u1eed d\u1ee5ng \u0111\u1ec3 theo d\u00f5i t\u1ed5ng s\u1ed1 k\u1ebft qu\u1ea3 \u0111\u01b0\u1ee3c th\u00eam v\u00e0o c\u01a1 s\u1edf d\u1eef li\u1ec7u.&nbsp;<code>urllib3.disable_warnings()<\/code>Ch\u1ee9c n\u0103ng n\u00e0y b\u1ecf qua m\u1ecdi c\u1ea3nh b\u00e1o v\u1ec1 ch\u1ee9ng ch\u1ec9 SSL.<\/p>\n<h3 id=\"retrieve-the-webpage\">L\u1ea5y l\u1ea1i trang web<a href=\"https:\/\/www.linode.com\/docs\/guides\/how-to-scrape-a-website-with-beautiful-soup\/#retrieve-the-webpage\"><\/a><\/h3>\n<p>H\u00e0m n\u00e0y&nbsp;<code>make_soup<\/code>th\u1ef1c hi\u1ec7n y\u00eau c\u1ea7u GET t\u1edbi url m\u1ee5c ti\u00eau v\u00e0 chuy\u1ec3n \u0111\u1ed5i HTML k\u1ebft qu\u1ea3 th\u00e0nh \u0111\u1ed1i t\u01b0\u1ee3ng BeautifulSoup:<\/p>\n<pre class=\"wp-block-code\"><code>def make_soup(url):\n    http = urllib3.PoolManager()\n    r = http.request(\"GET\", url)\n    return BeautifulSoup(r.data,'lxml')<\/code><\/pre>\n<p>Th\u01b0&nbsp;<code>urllib3<\/code>vi\u1ec7n c\u00f3 kh\u1ea3 n\u0103ng x\u1eed l\u00fd ngo\u1ea1i l\u1ec7 tuy\u1ec7t v\u1eddi; n\u1ebfu&nbsp;<code>make_soup<\/code>c\u00f3 b\u1ea5t k\u1ef3 l\u1ed7i n\u00e0o, h\u00e3y ki\u1ec3m tra&nbsp;<a href=\"https:\/\/urllib3.readthedocs.io\/en\/latest\/\" target=\"_blank\" rel=\"noreferrer noopener\">t\u00e0i li\u1ec7u urllib3<\/a>&nbsp;\u0111\u1ec3 bi\u1ebft th\u00f4ng tin chi ti\u1ebft.<\/p>\n<p>Beautiful Soup c\u00f3 nhi\u1ec1u tr\u00ecnh ph\u00e2n t\u00edch c\u00fa ph\u00e1p kh\u00e1c nhau c\u00f3 s\u1eb5n, \u00edt nhi\u1ec1u nghi\u00eam ng\u1eb7t v\u1ec1 c\u00e1ch c\u1ea5u tr\u00fac trang web. Tr\u00ecnh ph\u00e2n t\u00edch c\u00fa ph\u00e1p&nbsp;<em>lxml<\/em>&nbsp;\u0111\u1ee7 cho t\u1eadp l\u1ec7nh v\u00ed d\u1ee5 trong h\u01b0\u1edbng d\u1eabn n\u00e0y, nh\u01b0ng t\u00f9y thu\u1ed9c v\u00e0o nhu c\u1ea7u c\u1ee7a b\u1ea1n, b\u1ea1n c\u00f3 th\u1ec3 c\u1ea7n ki\u1ec3m tra c\u00e1c t\u00f9y ch\u1ecdn kh\u00e1c \u0111\u01b0\u1ee3c m\u00f4 t\u1ea3 trong t\u00e0i&nbsp;<a href=\"https:\/\/www.crummy.com\/software\/BeautifulSoup\/bs4\/doc\/\" target=\"_blank\" rel=\"noreferrer noopener\">li\u1ec7u ch\u00ednh th\u1ee9c<\/a>&nbsp;.<\/p>\n<h3 id=\"process-the-soup-object\">X\u1eed l\u00fd \u0111\u1ed1i t\u01b0\u1ee3ng s\u00fap<a href=\"https:\/\/www.linode.com\/docs\/guides\/how-to-scrape-a-website-with-beautiful-soup\/#process-the-soup-object\"><\/a><\/h3>\n<p>M\u1ed9t \u0111\u1ed1i t\u01b0\u1ee3ng c\u1ee7a l\u1edbp&nbsp;<code>BeautifulSoup<\/code>\u0111\u01b0\u1ee3c t\u1ed5 ch\u1ee9c theo c\u1ea5u tr\u00fac c\u00e2y. \u0110\u1ec3 truy c\u1eadp d\u1eef li\u1ec7u b\u1ea1n quan t\u00e2m, b\u1ea1n s\u1ebd ph\u1ea3i quen thu\u1ed9c v\u1edbi c\u00e1ch d\u1eef li\u1ec7u \u0111\u01b0\u1ee3c t\u1ed5 ch\u1ee9c trong t\u00e0i li\u1ec7u HTML g\u1ed1c. Truy c\u1eadp trang web ban \u0111\u1ea7u trong tr\u00ecnh duy\u1ec7t, nh\u1ea5p chu\u1ed9t ph\u1ea3i v\u00e0 ch\u1ecdn&nbsp;<strong>Xem ngu\u1ed3n trang<\/strong>&nbsp;(ho\u1eb7c&nbsp;<strong>Ki\u1ec3m tra<\/strong>&nbsp;, t\u00f9y thu\u1ed9c v\u00e0o tr\u00ecnh duy\u1ec7t c\u1ee7a b\u1ea1n) \u0111\u1ec3 xem l\u1ea1i c\u1ea5u tr\u00fac d\u1eef li\u1ec7u m\u00e0 b\u1ea1n mu\u1ed1n tr\u00edch xu\u1ea5t:<\/p>\n<pre class=\"wp-block-code\"><code>&lt;li class=\"result-row\" data-pid=\"6370204467\"&gt;\n  &lt;a href=\"https:\/\/elpaso.craigslist.org\/mcy\/d\/ducati-diavel-dark\/6370204467.html\" class=\"result-image gallery\" data-ids=\"1:01010_8u6vKIPXEsM,1:00y0y_4pg3Rxry2Lj,1:00F0F_2mAXBoBiuTS\"&gt;\n    &lt;span class=\"result-price\"&gt;$12791&lt;\/span&gt;\n  &lt;\/a&gt;\n  &lt;p class=\"result-info\"&gt;\n    &lt;span class=\"icon icon-star\" role=\"button\"&gt;\n    &lt;span class=\"screen-reader-text\"&gt;favorite this post&lt;\/span&gt;\n    &lt;\/span&gt;\n    &lt;time class=\"result-date\" datetime=\"2017-11-01 19:38\" title=\"Wed 01 Nov 07:38:13 PM\"&gt;Nov  1&lt;\/time&gt;\n    &lt;a href=\"https:\/\/elpaso.craigslist.org\/mcy\/d\/ducati-diavel-dark\/6370204467.html\" data-id=\"6370204467\" class=\"result-title hdrlnk\"&gt;Ducati Diavel | Dark&lt;\/a&gt;\n    &lt;span class=\"result-meta\"&gt;\n            &lt;span class=\"result-price\"&gt;$12791&lt;\/span&gt;\n            &lt;span class=\"result-tags\"&gt;\n            pic\n            &lt;span class=\"maptag\" data-pid=\"6370204467\"&gt;map&lt;\/span&gt;\n            &lt;\/span&gt;\n            &lt;span class=\"banish icon icon-trash\" role=\"button\"&gt;\n            &lt;span class=\"screen-reader-text\"&gt;hide this posting&lt;\/span&gt;\n            &lt;\/span&gt;\n    &lt;span class=\"unbanish icon icon-trash red\" role=\"button\" aria-hidden=\"true\"&gt;&lt;\/span&gt;\n    &lt;a href=\"#\" class=\"restore-link\"&gt;\n            &lt;span class=\"restore-narrow-text\"&gt;restore&lt;\/span&gt;\n            &lt;span class=\"restore-wide-text\"&gt;restore this posting&lt;\/span&gt;\n    &lt;\/a&gt;\n    &lt;\/span&gt;\n  &lt;\/p&gt;\n&lt;\/li&gt;<\/code><\/pre>\n<p>3. Ch\u1ecdn c\u00e1c \u0111o\u1ea1n tr\u00edch trang web b\u1eb1ng c\u00e1ch ch\u1ec9 ch\u1ecdn c\u00e1c th\u1ebb&nbsp;<strong>li<\/strong>&nbsp;html v\u00e0 thu h\u1eb9p h\u01a1n n\u1eefa c\u00e1c l\u1ef1a ch\u1ecdn b\u1eb1ng c\u00e1ch ch\u1ec9 ch\u1ecdn nh\u1eefng th\u1ebb&nbsp;<strong>li<\/strong>&nbsp;c\u00f3 l\u1edbp&nbsp;<strong>result-row<\/strong>&nbsp;. Bi\u1ebfn&nbsp;<strong>results<\/strong>&nbsp;ch\u1ee9a t\u1ea5t c\u1ea3 c\u00e1c \u0111o\u1ea1n tr\u00edch trang web ph\u00f9 h\u1ee3p v\u1edbi ti\u00eau ch\u00ed n\u00e0y:<\/p>\n<pre class=\"wp-block-code\"><code>'pid': result&#91;'data-pid']\n<\/code><\/pre>\n<p>4.C\u00e1c thu\u1ed9c t\u00ednh d\u1eef li\u1ec7u kh\u00e1c c\u00f3 th\u1ec3 \u0111\u01b0\u1ee3c l\u1ed3ng s\u00e2u h\u01a1n trong c\u1ea5u tr\u00fac HTML v\u00e0 c\u00f3 th\u1ec3 \u0111\u01b0\u1ee3c truy c\u1eadp b\u1eb1ng c\u00e1ch k\u1ebft h\u1ee3p k\u00fd hi\u1ec7u ch\u1ea5m v\u00e0 m\u1ea3ng. V\u00ed d\u1ee5, ng\u00e0y k\u1ebft qu\u1ea3 \u0111\u01b0\u1ee3c \u0111\u0103ng \u0111\u01b0\u1ee3c l\u01b0u tr\u1eef trong&nbsp;<code>datetime<\/code>, l\u00e0 thu\u1ed9c t\u00ednh d\u1eef li\u1ec7u c\u1ee7a ph\u1ea7n&nbsp;<code>time<\/code>t\u1eed, l\u00e0 ph\u1ea7n t\u1eed con c\u1ee7a&nbsp;<code>p<\/code>th\u1ebb l\u00e0 ph\u1ea7n t\u1eed con c\u1ee7a&nbsp;<code>result<\/code>. \u0110\u1ec3 truy c\u1eadp gi\u00e1 tr\u1ecb n\u00e0y, h\u00e3y s\u1eed d\u1ee5ng \u0111\u1ecbnh d\u1ea1ng sau:<\/p>\n<pre class=\"wp-block-code\"><code>'date': result.p.time&#91;'datetime']\n<\/code><\/pre>\n<p>5.\u0110\u00f4i khi th\u00f4ng tin c\u1ea7n thi\u1ebft l\u00e0 n\u1ed9i dung th\u1ebb (n\u1eb1m gi\u1eefa th\u1ebb b\u1eaft \u0111\u1ea7u v\u00e0 th\u1ebb k\u1ebft th\u00fac). \u0110\u1ec3 truy c\u1eadp n\u1ed9i dung th\u1ebb, BeautifulSoup cung c\u1ea5p ph\u01b0\u01a1ng&nbsp;<code>string<\/code>ph\u00e1p:<\/p>\n<pre class=\"wp-block-code\"><code>&lt;span class=\"result-price\"&gt;$12791&lt;\/span&gt; <\/code><\/pre>\n<p>c\u00f3 th\u1ec3 truy c\u1eadp b\u1eb1ng:<\/p>\n<pre class=\"wp-block-code\"><code><code>'cost': clean_money(result.a.span.string.strip()) <\/code><\/code><\/pre>\n<p>Gi\u00e1 tr\u1ecb \u1edf \u0111\u00e2y \u0111\u01b0\u1ee3c x\u1eed l\u00fd th\u00eam b\u1eb1ng c\u00e1ch s\u1eed d\u1ee5ng&nbsp;<code>strip()<\/code>h\u00e0m Python, c\u0169ng nh\u01b0 m\u1ed9t h\u00e0m t\u00f9y ch\u1ec9nh&nbsp;<code>clean_money<\/code>lo\u1ea1i b\u1ecf d\u1ea5u \u0111\u00f4 la.<\/p>\n<p>6.H\u1ea7u h\u1ebft c\u00e1c m\u1eb7t h\u00e0ng \u0111\u01b0\u1ee3c b\u00e1n tr\u00ean Craigslist \u0111\u1ec1u c\u00f3 h\u00ecnh \u1ea3nh c\u1ee7a m\u1eb7t h\u00e0ng. Ch\u1ee9c n\u0103ng t\u00f9y ch\u1ec9nh&nbsp;<code>clean_pic<\/code>\u0111\u01b0\u1ee3c s\u1eed d\u1ee5ng \u0111\u1ec3 g\u00e1n URL c\u1ee7a h\u00ecnh \u1ea3nh \u0111\u1ea7u ti\u00ean cho&nbsp;<strong>pic<\/strong>&nbsp;:<\/p>\n<pre class=\"wp-block-code\"><code><code>'pic': clean_pic(result.a&#91;'data-ids'])<\/code><\/code><\/pre>\n<p>7.Si\u00eau d\u1eef li\u1ec7u c\u00f3 th\u1ec3 \u0111\u01b0\u1ee3c th\u00eam v\u00e0o b\u1ea3n ghi. V\u00ed d\u1ee5, b\u1ea1n c\u00f3 th\u1ec3 th\u00eam m\u1ed9t tr\u01b0\u1eddng \u0111\u1ec3 theo d\u00f5i th\u1eddi \u0111i\u1ec3m m\u1ed9t b\u1ea3n ghi c\u1ee5 th\u1ec3 \u0111\u01b0\u1ee3c t\u1ea1o:<\/p>\n<pre class=\"wp-block-code\"><code><code>'createdt': datetime.datetime.now().isoformat()<\/code><\/code><\/pre>\n<p>8.S\u1eed d\u1ee5ng \u0111\u1ed1i t\u01b0\u1ee3ng Query \u0111\u1ec3 ki\u1ec3m tra xem b\u1ea3n ghi \u0111\u00e3 t\u1ed3n t\u1ea1i trong c\u01a1 s\u1edf d\u1eef li\u1ec7u hay ch\u01b0a tr\u01b0\u1edbc khi ch\u00e8n n\u00f3. \u0110i\u1ec1u n\u00e0y tr\u00e1nh t\u1ea1o ra c\u00e1c b\u1ea3n ghi tr\u00f9ng l\u1eb7p.<\/p>\n<pre class=\"wp-block-code\"><code>Result = Query()\ns1 = db.search(Result.pid == rec&#91;\"pid\"])\n\nif not s1:\n    total_added += 1\n    print (\"Adding ... \", total_added)\n    db.insert(rec)<\/code><\/pre>\n<h3 id=\"error-handling\">X\u1eed l\u00fd l\u1ed7i<a href=\"https:\/\/www.linode.com\/docs\/guides\/how-to-scrape-a-website-with-beautiful-soup\/#error-handling\"><\/a><\/h3>\n<p>C\u00f3 hai lo\u1ea1i l\u1ed7i quan tr\u1ecdng c\u1ea7n x\u1eed l\u00fd. \u0110\u00e2y kh\u00f4ng ph\u1ea3i l\u00e0 l\u1ed7i trong t\u1eadp l\u1ec7nh m\u00e0 l\u00e0 l\u1ed7i trong c\u1ea5u tr\u00fac \u0111o\u1ea1n m\u00e3 khi\u1ebfn API c\u1ee7a Beautiful Soup \u0111\u01b0a ra l\u1ed7i.<\/p>\n<p>An&nbsp;<code>AttributeError<\/code>s\u1ebd \u0111\u01b0\u1ee3c n\u00e9m ra khi k\u00fd hi\u1ec7u d\u1ea5u ch\u1ea5m kh\u00f4ng t\u00ecm th\u1ea5y th\u1ebb anh ch\u1ecb em v\u1edbi th\u1ebb HTML hi\u1ec7n t\u1ea1i. V\u00ed d\u1ee5, n\u1ebfu m\u1ed9t \u0111o\u1ea1n m\u00e3 c\u1ee5 th\u1ec3 kh\u00f4ng c\u00f3 th\u1ebb neo, th\u00ec kh\u00f3a&nbsp;<strong>cost<\/strong>&nbsp;s\u1ebd n\u00e9m ra l\u1ed7i, v\u00ec n\u00f3 chuy\u1ec3n ngang v\u00e0 do \u0111\u00f3 y\u00eau c\u1ea7u th\u1ebb neo.<\/p>\n<p>L\u1ed7i c\u00f2n l\u1ea1i l\u00e0&nbsp;<code>KeyError<\/code>. L\u1ed7i n\u00e0y s\u1ebd x\u1ea3y ra n\u1ebfu thi\u1ebfu thu\u1ed9c t\u00ednh th\u1ebb HTML b\u1eaft bu\u1ed9c. V\u00ed d\u1ee5, n\u1ebfu kh\u00f4ng c\u00f3 thu\u1ed9c t\u00ednh&nbsp;<strong>data-pid<\/strong>&nbsp;trong \u0111o\u1ea1n m\u00e3, kh\u00f3a&nbsp;<strong>pid<\/strong>&nbsp;s\u1ebd g\u00e2y ra l\u1ed7i.<\/p>\n<p>N\u1ebfu m\u1ed9t trong hai l\u1ed7i n\u00e0y x\u1ea3y ra khi ph\u00e2n t\u00edch k\u1ebft qu\u1ea3, k\u1ebft qu\u1ea3 \u0111\u00f3 s\u1ebd b\u1ecb b\u1ecf qua \u0111\u1ec3 \u0111\u1ea3m b\u1ea3o \u0111o\u1ea1n m\u00e3 kh\u00f4ng \u0111\u00fang \u0111\u1ecbnh d\u1ea1ng kh\u00f4ng \u0111\u01b0\u1ee3c ch\u00e8n v\u00e0o c\u01a1 s\u1edf d\u1eef li\u1ec7u:<\/p>\n<pre class=\"wp-block-code\"><code>except (AttributeError, KeyError) as ex:\n    pass<\/code><\/pre>\n<h3 id=\"cleaning-functions\">Ch\u1ee9c n\u0103ng l\u00e0m s\u1ea1ch<a href=\"https:\/\/www.linode.com\/docs\/guides\/how-to-scrape-a-website-with-beautiful-soup\/#cleaning-functions\"><\/a><\/h3>\n<p>\u0110\u00e2y l\u00e0 hai h\u00e0m t\u00f9y ch\u1ec9nh ng\u1eafn \u0111\u1ec3 d\u1ecdn d\u1eb9p d\u1eef li\u1ec7u \u0111o\u1ea1n tr\u00edch.&nbsp;<code>clean_money<\/code>H\u00e0m n\u00e0y x\u00f3a m\u1ecdi d\u1ea5u \u0111\u00f4 la kh\u1ecfi \u0111\u1ea7u v\u00e0o c\u1ee7a n\u00f3:<\/p>\n<pre class=\"wp-block-code\"><code>def clean_money(amt):\n    return int(amt.replace(\"$\",\"\"))<\/code><\/pre>\n<p>Ch\u1ee9c n\u0103ng n\u00e0y&nbsp;<code>clean_pic<\/code>t\u1ea1o ra m\u1ed9t URL \u0111\u1ec3 truy c\u1eadp h\u00ecnh \u1ea3nh \u0111\u1ea7u ti\u00ean trong m\u1ed7i k\u1ebft qu\u1ea3 t\u00ecm ki\u1ebfm:<\/p>\n<pre class=\"wp-block-code\"><code>def clean_pic(ids):\n    idlist = ids.split(\",\")\n    first = idlist&#91;0]\n    code = first.replace(\"1:\",\"\")\n    return \"https:\/\/images.craigslist.org\/%s_300x300.jpg\" % code<\/code><\/pre>\n<p>H\u00e0m n\u00e0y tr\u00edch xu\u1ea5t v\u00e0 x\u00f3a id c\u1ee7a h\u00ecnh \u1ea3nh \u0111\u1ea7u ti\u00ean, sau \u0111\u00f3 th\u00eam n\u00f3 v\u00e0o URL c\u01a1 s\u1edf.<\/p>\n<h3 id=\"write-data-to-an-excel-spreadsheet\">Ghi d\u1eef li\u1ec7u v\u00e0o b\u1ea3ng t\u00ednh Excel<a href=\"https:\/\/www.linode.com\/docs\/guides\/how-to-scrape-a-website-with-beautiful-soup\/#write-data-to-an-excel-spreadsheet\"><\/a><\/h3>\n<p>H\u00e0m n\u00e0y&nbsp;<code>make_excel<\/code>l\u1ea5y d\u1eef li\u1ec7u trong c\u01a1 s\u1edf d\u1eef li\u1ec7u v\u00e0 ghi v\u00e0o b\u1ea3ng t\u00ednh Excel.<\/p>\n<p>1.Th\u00eam bi\u1ebfn b\u1ea3ng t\u00ednh:<\/p>\n<pre class=\"wp-block-code\"><code>Headlines = &#91;\"Pid\", \"Date\", \"Cost\", \"Webpage\", \"Pic\", \"Desc\", \"Created Date\"]\nrow = 0<\/code><\/pre>\n<p>Bi\u1ebfn&nbsp;<strong>Headlines<\/strong>&nbsp;l\u00e0 danh s\u00e1ch ti\u00eau \u0111\u1ec1 cho c\u00e1c c\u1ed9t trong b\u1ea3ng t\u00ednh. Bi\u1ebfn&nbsp;<strong>row<\/strong>&nbsp;theo d\u00f5i h\u00e0ng b\u1ea3ng t\u00ednh hi\u1ec7n t\u1ea1i.<\/p>\n<p>2.S\u1eed d\u1ee5ng&nbsp;<code>xlsxwriter<\/code>\u0111\u1ec3 m\u1edf m\u1ed9t b\u1ea3ng t\u00ednh v\u00e0 th\u00eam m\u1ed9t trang t\u00ednh \u0111\u1ec3 nh\u1eadn d\u1eef li\u1ec7u.<\/p>\n<pre class=\"wp-block-code\"><code>workbook = xlsxwriter.Workbook('motorcycle.xlsx')\nworksheet = workbook.add_worksheet()<\/code><\/pre>\n<p>3.Chu\u1ea9n b\u1ecb phi\u1ebfu b\u00e0i t\u1eadp:<\/p>\n<pre class=\"wp-block-code\"><code>worksheet.set_column(0,0, 15) # pid\nworksheet.set_column(1,1, 20) # date\nworksheet.set_column(2,2, 7)  # cost\nworksheet.set_column(3,3, 10)  # webpage\nworksheet.set_column(4,4, 7)  # picture\nworksheet.set_column(5,5, 60)  # Description\nworksheet.set_column(6,6, 30)  # created date<\/code><\/pre>\n<p>2 m\u1ee5c \u0111\u1ea7u ti\u00ean lu\u00f4n gi\u1ed1ng nhau trong&nbsp;<code>set_column<\/code>ph\u01b0\u01a1ng ph\u00e1p. \u0110\u00f3 l\u00e0 v\u00ec n\u00f3 \u0111ang thi\u1ebft l\u1eadp c\u00e1c thu\u1ed9c t\u00ednh c\u1ee7a m\u1ed9t ph\u1ea7n c\u00e1c c\u1ed9t t\u1eeb c\u1ed9t \u0111\u1ea7u ti\u00ean \u0111\u01b0\u1ee3c ch\u1ec9 \u0111\u1ecbnh \u0111\u1ebfn c\u1ed9t ti\u1ebfp theo. Gi\u00e1 tr\u1ecb cu\u1ed1i c\u00f9ng l\u00e0 chi\u1ec1u r\u1ed9ng c\u1ee7a c\u1ed9t t\u00ednh b\u1eb1ng k\u00fd t\u1ef1.<\/p>\n<p>4.Vi\u1ebft ti\u00eau \u0111\u1ec1 c\u1ed9t v\u00e0o b\u1ea3ng t\u00ednh:<\/p>\n<pre class=\"wp-block-code\"><code>for col, title in enumerate(Headlines):\n    worksheet.write(row, col, title)<\/code><\/pre>\n<p>5.Ghi c\u00e1c b\u1ea3n ghi v\u00e0o c\u01a1 s\u1edf d\u1eef li\u1ec7u:<\/p>\n<pre class=\"wp-block-code\"><code>for item in db.all():\n    row += 1\n    worksheet.write(row, 0, item&#91;'pid'] )\n    worksheet.write(row, 1, item&#91;'date'] )\n    worksheet.write(row, 2, item&#91;'cost'] )\n    worksheet.write_url(row, 3, item&#91;'webpage'], string='Web Page')\n    worksheet.write_url(row, 4, item&#91;'pic'], string=\"Picture\" )\n    worksheet.write(row, 5, item&#91;'descr'] )\n    worksheet.write(row, 6, item&#91;'createdt'] )<\/code><\/pre>\n<p>H\u1ea7u h\u1ebft c\u00e1c tr\u01b0\u1eddng trong m\u1ed7i h\u00e0ng c\u00f3 th\u1ec3 \u0111\u01b0\u1ee3c vi\u1ebft b\u1eb1ng&nbsp;<code>worksheet.write<\/code>;&nbsp;<code>worksheet.write_url<\/code>\u0111\u01b0\u1ee3c s\u1eed d\u1ee5ng cho URL danh s\u00e1ch v\u00e0 h\u00ecnh \u1ea3nh. \u0110i\u1ec1u n\u00e0y l\u00e0m cho c\u00e1c li\u00ean k\u1ebft k\u1ebft qu\u1ea3 c\u00f3 th\u1ec3 nh\u1ea5p v\u00e0o \u0111\u01b0\u1ee3c trong b\u1ea3ng t\u00ednh cu\u1ed1i c\u00f9ng.<\/p>\n<p>6.\u0110\u00f3ng b\u1ea3ng t\u00ednh Excel:<\/p>\n<pre class=\"wp-block-code\"><code>    workbook.close()<\/code><\/pre>\n<h3 id=\"main-routine\">Th\u00f3i quen ch\u00ednh<a href=\"https:\/\/www.linode.com\/docs\/guides\/how-to-scrape-a-website-with-beautiful-soup\/#main-routine\"><\/a><\/h3>\n<p>Th\u00f3i quen ch\u00ednh s\u1ebd l\u1eb7p l\u1ea1i qua t\u1eebng trang k\u1ebft qu\u1ea3 t\u00ecm ki\u1ebfm v\u00e0 ch\u1ea1y h\u00e0m&nbsp;<strong>soup_process<\/strong>&nbsp;tr\u00ean t\u1eebng trang. N\u00f3 c\u0169ng theo d\u00f5i t\u1ed5ng s\u1ed1 m\u1ee5c nh\u1eadp c\u01a1 s\u1edf d\u1eef li\u1ec7u \u0111\u01b0\u1ee3c th\u00eam v\u00e0o trong bi\u1ebfn to\u00e0n c\u1ee5c&nbsp;<strong>total_added<\/strong>&nbsp;, \u0111\u01b0\u1ee3c c\u1eadp nh\u1eadt trong h\u00e0m&nbsp;<strong>soup_process<\/strong>&nbsp;v\u00e0 hi\u1ec3n th\u1ecb sau khi qu\u00e1 tr\u00ecnh thu th\u1eadp d\u1eef li\u1ec7u ho\u00e0n t\u1ea5t. Cu\u1ed1i c\u00f9ng, n\u00f3 t\u1ea1o m\u1ed9t c\u01a1 s\u1edf d\u1eef li\u1ec7u TinyDB&nbsp;<code>db.json<\/code>v\u00e0 l\u01b0u tr\u1eef d\u1eef li\u1ec7u \u0111\u00e3 ph\u00e2n t\u00edch; khi qu\u00e1 tr\u00ecnh thu th\u1eadp d\u1eef li\u1ec7u ho\u00e0n t\u1ea5t, c\u01a1 s\u1edf d\u1eef li\u1ec7u \u0111\u01b0\u1ee3c chuy\u1ec3n \u0111\u1ebfn h\u00e0m&nbsp;<strong>make_excel<\/strong>&nbsp;\u0111\u1ec3 ghi v\u00e0o b\u1ea3ng t\u00ednh.<\/p>\n<pre class=\"wp-block-code\"><code>def main(url):\n    total_added = 0\n    db = TinyDB(\"db.json\")\n\n    while url:\n        print (\"Web Page: \", url)\n        soup = soup_process(url, db)\n        nextlink = soup.find(\"link\", rel=\"next\")\n\n        url = False\n        if (nextlink):\n            url = nextlink&#91;'href']\n\n    print (\"Added \",total_added)\n\n    make_excel(db)<\/code><\/pre>\n<p>M\u1ed9t l\u1ea7n ch\u1ea1y m\u1eabu c\u00f3 th\u1ec3 tr\u00f4ng gi\u1ed1ng nh\u01b0 sau. L\u01b0u \u00fd r\u1eb1ng m\u1ed7i trang c\u00f3 ch\u1ec9 m\u1ee5c \u0111\u01b0\u1ee3c nh\u00fang trong URL. \u0110\u00e2y l\u00e0 c\u00e1ch Craigslist bi\u1ebft trang d\u1eef li\u1ec7u ti\u1ebfp theo b\u1eaft \u0111\u1ea7u t\u1eeb \u0111\u00e2u:<\/p>\n<pre class=\"wp-block-code\"><code>$ python3 craigslist.py\nWeb Page:  https:\/\/elpaso.craigslist.org\/search\/mcy?sort=date\nAdding ...  1\nAdding ...  2\nAdding ...  3\nWeb Page:  https:\/\/elpaso.craigslist.org\/search\/mcy?s=120&amp;sort=date\nWeb Page:  https:\/\/elpaso.craigslist.org\/search\/mcy?s=240&amp;sort=date\nWeb Page:  https:\/\/elpaso.craigslist.org\/search\/mcy?s=360&amp;sort=date\nWeb Page:  https:\/\/elpaso.craigslist.org\/search\/mcy?s=480&amp;sort=date\nWeb Page:  https:\/\/elpaso.craigslist.org\/search\/mcy?s=600&amp;sort=date\nAdded  3<\/code><\/pre>\n<h2 id=\"set-up-cron-to-scrape-automatically\">Thi\u1ebft l\u1eadp Cron \u0111\u1ec3 t\u1ef1 \u0111\u1ed9ng thu th\u1eadp<a href=\"https:\/\/www.linode.com\/docs\/guides\/how-to-scrape-a-website-with-beautiful-soup\/#set-up-cron-to-scrape-automatically\"><\/a><\/h2>\n<p>Ph\u1ea7n n\u00e0y s\u1ebd thi\u1ebft l\u1eadp m\u1ed9t t\u00e1c v\u1ee5 cron \u0111\u1ec3 ch\u1ea1y t\u1eadp l\u1ec7nh thu th\u1eadp d\u1eef li\u1ec7u t\u1ef1 \u0111\u1ed9ng theo c\u00e1c kho\u1ea3ng th\u1eddi gian \u0111\u1ec1u \u0111\u1eb7n. D\u1eef li\u1ec7u<\/p>\n<p>1.\u0110\u0103ng nh\u1eadp v\u00e0o m\u00e1y c\u1ee7a b\u1ea1n nh\u01b0 m\u1ed9t ng\u01b0\u1eddi d\u00f9ng b\u00ecnh th\u01b0\u1eddng:<\/p>\n<pre class=\"wp-block-code\"><code> ssh normaluser@&lt;Linode Public IP&gt;\n<\/code><\/pre>\n<p>2.\u0110\u1ea3m b\u1ea3o&nbsp;<code>craigslist.py<\/code>t\u1eadp l\u1ec7nh ho\u00e0n ch\u1ec9nh n\u1eb1m trong th\u01b0 m\u1ee5c g\u1ed1c:<\/p>\n<pre class=\"wp-block-code\"><code>from bs4 import BeautifulSoup\nimport datetime\nfrom tinydb import TinyDB, Query\nimport urllib3\nimport xlsxwriter\n\nurllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)\n\nurl = 'https:\/\/elpaso.craigslist.org\/search\/mcy?sort=date'\ntotal_added = 0\n\ndef make_soup(url):\n    http = urllib3.PoolManager()\n    r = http.request(\"GET\", url)\n    return BeautifulSoup(r.data,'lxml')\n\ndef main(url):\n    global total_added\n    db = TinyDB(\"db.json\")\n\n    while url:\n        print (\"Web Page: \", url)\n        soup = soup_process(url, db)\n        nextlink = soup.find(\"link\", rel=\"next\")\n\n        url = False\n        if (nextlink):\n            url = nextlink&#91;'href']\n\n    print (\"Added \",total_added)\n\n    make_excel(db)\n\ndef soup_process(url, db):\n    global total_added\n\n    soup = make_soup(url)\n    results = soup.find_all(\"li\", class_=\"result-row\")\n\n    for result in results:\n        try:\n            rec = {\n                'pid': result&#91;'data-pid'],\n                'date': result.p.time&#91;'datetime'],\n                'cost': clean_money(result.a.span.string.strip()),\n                'webpage': result.a&#91;'href'],\n                'pic': clean_pic(result.a&#91;'data-ids']),\n                'descr': result.p.a.string.strip(),\n                'createdt': datetime.datetime.now().isoformat()\n            }\n\n            Result = Query()\n            s1 = db.search(Result.pid == rec&#91;\"pid\"])\n\n            if not s1:\n                total_added += 1\n                print (\"Adding ... \", total_added)\n                db.insert(rec)\n\n        except (AttributeError, KeyError) as ex:\n            pass\n\n    return soup\n\ndef clean_money(amt):\n    return int(amt.replace(\"$\",\"\"))\n\ndef clean_pic(ids):\n    idlist = ids.split(\",\")\n    first = idlist&#91;0]\n    code = first.replace(\"1:\",\"\")\n    return \"https:\/\/images.craigslist.org\/%s_300x300.jpg\" % code\n\ndef make_excel(db):\n    Headlines = &#91;\"Pid\", \"Date\", \"Cost\", \"Webpage\", \"Pic\", \"Desc\", \"Created Date\"]\n    row = 0\n\n    workbook = xlsxwriter.Workbook('motorcycle.xlsx')\n    worksheet = workbook.add_worksheet()\n\n    worksheet.set_column(0,0, 15) # pid\n    worksheet.set_column(1,1, 20) # date\n    worksheet.set_column(2,2, 7)  # cost\n    worksheet.set_column(3,3, 10)  # webpage\n    worksheet.set_column(4,4, 7)  # picture\n    worksheet.set_column(5,5, 60)  # Description\n    worksheet.set_column(6,6, 30)  # created date\n\n    for col, title in enumerate(Headlines):\n        worksheet.write(row, col, title)\n\n    for item in db.all():\n        row += 1\n        worksheet.write(row, 0, item&#91;'pid'] )\n        worksheet.write(row, 1, item&#91;'date'] )\n        worksheet.write(row, 2, item&#91;'cost'] )\n        worksheet.write_url(row, 3, item&#91;'webpage'], string='Web Page')\n        worksheet.write_url(row, 4, item&#91;'pic'], string=\"Picture\" )\n        worksheet.write(row, 5, item&#91;'descr'] )\n        worksheet.write(row, 6, item&#91;'createdt'] )\n\n    workbook.close()\n\nmain(url)<\/code><\/pre>\n<p>M\u1ee5c nh\u1eadp m\u1eabu n\u00e0y s\u1ebd ch\u1ea1y ch\u01b0\u01a1ng tr\u00ecnh python v\u00e0o l\u00fac 6:30 s\u00e1ng h\u00e0ng ng\u00e0y.<\/p>\n<pre class=\"wp-block-code\"><code>30 6 * * * \/usr\/bin\/python3 \/home\/normaluser\/craigslist.py\n<\/code><\/pre>\n<p>Ch\u01b0\u01a1ng tr\u00ecnh Python s\u1ebd vi\u1ebft&nbsp;<code>motorcycle.xlsx<\/code>b\u1ea3ng t\u00ednh \u1edf \u0111\u1ecbnh d\u1ea1ng&nbsp;<code>\/home\/normaluser\/<\/code>.<\/p>\n<h2 id=\"retrieve-the-excel-report\">L\u1ea5y l\u1ea1i b\u00e1o c\u00e1o Excel<a href=\"https:\/\/www.linode.com\/docs\/guides\/how-to-scrape-a-website-with-beautiful-soup\/#retrieve-the-excel-report\"><\/a><\/h2>\n<p><strong>Tr\u00ean Linux<\/strong><\/p>\n<p>S\u1eed d\u1ee5ng scp \u0111\u1ec3 sao ch\u00e9p&nbsp;<code>motorcycle.xlsx<\/code>t\u1eeb m\u00e1y t\u1eeb xa \u0111ang ch\u1ea1y ch\u01b0\u01a1ng tr\u00ecnh python c\u1ee7a b\u1ea1n sang m\u00e1y n\u00e0y:<\/p>\n<pre class=\"wp-block-code\"><code>scp normaluser@&lt;Linode Public IP&gt;:\/home\/normaluser\/motorcycle.xlsx .\n<\/code><\/pre>\n<p><strong>Tr\u00ean Windows<\/strong><\/p>\n<p>S\u1eed d\u1ee5ng ch\u1ee9c n\u0103ng sftp t\u00edch h\u1ee3p c\u1ee7a Firefox. Nh\u1eadp URL sau v\u00e0o thanh \u0111\u1ecba ch\u1ec9 v\u00e0 n\u00f3 s\u1ebd y\u00eau c\u1ea7u m\u1eadt kh\u1ea9u. Ch\u1ecdn b\u1ea3ng t\u00ednh t\u1eeb danh s\u00e1ch th\u01b0 m\u1ee5c xu\u1ea5t hi\u1ec7n.<\/p>\n<pre class=\"wp-block-code\"><code>sftp:\/\/normaluser@&lt;Linode Public IP&gt;\/home\/normaluser<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>What is Beautiful Soup? Beautiful Soup&nbsp;l\u00e0 m\u1ed9t th\u01b0 vi\u1ec7n Python ph\u00e2n t\u00edch c\u00fa ph\u00e1p c\u00e1c t\u00e0i li\u1ec7u HTML ho\u1eb7c XML th\u00e0nh m\u1ed9t c\u1ea5u tr\u00fac c\u00e2y gi\u00fap d\u1ec5 d\u00e0ng t\u00ecm ki\u1ebfm v\u00e0 tr\u00edch xu\u1ea5t d\u1eef li\u1ec7u. N\u00f3 th\u01b0\u1eddng \u0111\u01b0\u1ee3c s\u1eed d\u1ee5ng \u0111\u1ec3 thu th\u1eadp d\u1eef li\u1ec7u t\u1eeb c\u00e1c trang web. Beautiful Soup c\u00f3 giao di\u1ec7n Python<\/p>\n","protected":false},"author":1,"featured_media":35510,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[101],"tags":[],"class_list":["post-34878","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-big-data"],"_links":{"self":[{"href":"https:\/\/jupitek.maudemo.vip\/index.php\/wp-json\/wp\/v2\/posts\/34878","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/jupitek.maudemo.vip\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/jupitek.maudemo.vip\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/jupitek.maudemo.vip\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/jupitek.maudemo.vip\/index.php\/wp-json\/wp\/v2\/comments?post=34878"}],"version-history":[{"count":0,"href":"https:\/\/jupitek.maudemo.vip\/index.php\/wp-json\/wp\/v2\/posts\/34878\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/jupitek.maudemo.vip\/index.php\/wp-json\/wp\/v2\/media\/35510"}],"wp:attachment":[{"href":"https:\/\/jupitek.maudemo.vip\/index.php\/wp-json\/wp\/v2\/media?parent=34878"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/jupitek.maudemo.vip\/index.php\/wp-json\/wp\/v2\/categories?post=34878"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/jupitek.maudemo.vip\/index.php\/wp-json\/wp\/v2\/tags?post=34878"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}