Handling multiple requests using cURL in parallel

​Without thread, parallelism is incomplete. Multithreading​ in PHP with pthreads:

From PHP DOC

pthreads is an Object Orientated API that allows user-land multi-threading in PHP. It includes all the tools you need to create multi-threaded applications targeted at the Web or the Console. PHP applications can create, read, write, execute and synchronize with Threads, Workers and Threaded objects.

The issue is with this is a PECL extension. Not bundled in PHP library by default yet.


pthreads releases are hosted by PECL and the source code by » github, the easiest route to installation is the normal PECL route.

So now the question is there any minimal alternative to achieve the goal?

Sometime back in an interview I was asked to write down a crawler implementation. I’d done that using curl multi-functions​ to get responses from multiple nodes in parallel and then parse those responses; crawl for its internal links etc.

If curl is used in the application layer to call multiple API from backend then calling APIs in parallel helps to gain performance in application layer a lot. Actually, ​when application layer generates a view by querying different API, it just can’t send all the requests sequentially and wait for a long time. A solution to this is to handle the requests and responses in parallel.

Assume, application is sending requests to APIs from these servers:

Google: .33s, Yahoo: .86s, rediff: .01s, bing: .22s and wordpress: 1.1s

So total time will be 2.52s, just to get the API response.

By using curl_multi_exec, these requests can be executed in parallel and app will be limited by the slowest request, which is about 1.1s in this scenario. Notice the time is decreased by 56%!

Let me give a simple example to explain how the same can be done using curl multi functions in PHP.

 true,
        CURLOPT_HEADER => false,
        CURLOPT_CUSTOMREQUEST => 'GET',
        CURLOPT_FOLLOWLOCATION => true,
        CURLOPT_ENCODING => "",
        CURLOPT_CONNECTTIMEOUT => 10,
        CURLOPT_TIMEOUT => 10,
        CURLOPT_SSL_VERIFYHOST => 0,
        CURLOPT_SSL_VERIFYPEER => false,
        CURL_HTTP_VERSION_1_1 => 1
    );

    foreach ($nodes as $key => $url) {     //set options for each node
        $curlArr[$i] = curl_init();
        curl_setopt($curlArr[$i], CURLOPT_URL, $url);
        curl_setopt_array($curlArr[$i], $options);
        curl_multi_add_handle($master, $curlArr[$i]);
        $i++;
    }

    $running = null;

    do {                                //call nodes in parallel; sleep to release memory
        curl_multi_exec($master, $running);
        usleep(1);
    } while ($running > 0);

    $i = 0;
    $response = array();
    $results = array();

    foreach ($nodes as $key => $value) {   //iterate nodes to get response detail
        $response['body'] = curl_multi_getcontent($curlArr[$i]);
        $tmpArr = curl_getinfo($curlArr[$i]);
        $response['httpCode'] = $tmpArr['http_code'];

        curl_multi_remove_handle($master, $curlArr[$i]);
        curl_close($curlArr[$i]);
        
        echo "\n" . $tmpArr['url'] . ": time taken=> " . $tmpArr['total_time'] . "s http code=> " . $tmpArr['http_code'] . " size=> " . $tmpArr['size_download'] . "b connect_time=> " . $tmpArr['connect_time'] . "s namelookup_time=> " . $tmpArr['namelookup_time'] . "s";

        $results[$key] = $response;
        $i++;
    }

    curl_multi_close($master);
    
    return $results;
}

$nodes = array('https://www.google.co.in', 'https://in.yahoo.com', 'http://www.rediff.com', 'http://www.bing.com/', 'https://wordpress.com');
$timeStart = microtime(true);
$response = curlMulti($nodes);
$timeEnd = microtime(true);
$time = $timeEnd - $timeStart;
echo "\nTotal $time s taken to get response from " . count($nodes) . " nodes\n";

Output:

https://www.google.co.in: time taken=> 0.332783s http code=> 200 size=> 53065b connect_time=> 0.084406s namelookup_time=> 0.002629s
https://in.yahoo.com: time taken=> 0.861923s http code=> 200 size=> 62184b connect_time=> 0.17265s namelookup_time=> 0.072086s
http://www.rediff.com: time taken=> 0.015089s http code=> 200 size=> 33270b connect_time=> 0.010206s namelookup_time=> 0.002143s
http://www.bing.com/: time taken=> 0.224837s http code=> 200 size=> 63343b connect_time=> 0.008059s namelookup_time=> 0.002229s
https://wordpress.com: time taken=> 1.108579s http code=> 200 size=> 4786b connect_time=> 0.224247s namelookup_time=> 0.005035s
Total 1.1891958713531 s taken to get response from 5 nodes

When we got the response set then it can be processed as per need. If required, the response set processing could be forked by different child process to gain performance there too. Here is a basic example to explain how multiprocessing can be achieved using PHP.

Update: here, I’ve added an implemented process.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s