Google Image Search Results Scraping

Asked 2 months ago, Updated 2 months ago, 3 views

Google's image search results are not scraping well.
Is it because the image is being output dynamically?
I am using htmlsimpledom.
Here's the code:

include'simple_html_dom.php';

    $arrContextOptions=array(
    "ssl" = > array(
        "verify_peer" = > false,
        "verify_peer_name" = > false,
    ),
);


        $query="Rumba";
        $html2=file_get_html("https://www.google.co.jp/search?q=".$query."&tbm=isch", false, stream_context_create($arrContextOptions));
        $html2 = mb_convert_encoding($html2, 'utf8', 'auto');
        $dom2 = str_get_html($html2);
        // get category
        $dataSrc='data-src';
        $img = $dom2->find('img.rg_i',0);
        var_dump($img);

Please let me know if you know more.Thank you for your cooperation.

php

2022-09-30 11:02

1 Answers

  • file_get_html() results are not strings

    mb_convert_encoding() is a function that converts the encoding of the string, but simple_html_dom.php provides a file_get_html() that returns its own object, not a string.Use the PHP standard function file_get_contents() when retrieving from a URL as a string.

  • Google's search page changes HTML for responses by UserAgent

    The details are omitted, but HTML seems to change depending on the presence or absence of the UserAgent.That's probably why it's different from the DOM tree you saw in your browser.You may need to specify a UserAgent to access it, or you may need to write an analysis to match the HTML you can get without the UserAgent.

file_get_html() results are not strings

mb_convert_encoding() is a function that converts the encoding of the string, but simple_html_dom.php provides a file_get_html() that returns its own object, not a string.Use the PHP standard function file_get_contents() when retrieving from a URL as a string.

Google's search page changes the HTML of the response by UserAgent

The details are omitted, but HTML seems to change depending on the presence or absence of the UserAgent.That's probably why it's different from the DOM tree you saw in your browser.You may need to specify a UserAgent to access it, or you may need to write an analysis to match the HTML you can get without the UserAgent.

Why don't you print out any errors or warnings from PHP and see if the HTML you got is exactly what you want?


2022-09-30 11:02

If you have any answers or tips


© 2022 OneMinuteCode. All rights reserved.