Handling multiple requests using cURL in parallel

Without thread, parallelism is incomplete. Multithreading​ in PHP with pthreads:

From PHP DOC

pthreads is an Object Orientated API that allows user-land multi-threading in PHP. It includes all the tools you need to create multi-threaded applications targeted at the Web or the Console. PHP applications can create, read, write, execute and synchronize with Threads, Workers and Threaded objects.

The issue is with this is a PECL extension. Not bundled in PHP library by default yet.


pthreads releases are hosted by PECL and the source code by » github, the easiest route to installation is the normal PECL route.

So now the question is there any minimal alternative to achieve the goal?

Sometime back in an interview I was asked to write down a crawler implementation. I’d done that using curl multi-functions​ to get responses from multiple nodes in parallel and then parse those responses; crawl for its internal links etc.

If curl is used in the application layer to call multiple API from backend then calling APIs in parallel helps to gain performance in application layer a lot. Actually, ​when application layer generates a view by querying different API, it just can’t send all the requests sequentially and wait for a long time. A solution to this is to handle the requests and responses in parallel.

Assume, application is sending requests to APIs from these servers:

Google: 0.13s, Yahoo: 0.86s, rediff: 0.18s, bing: 0.22s and wordpress: 1.1s

So total time will be 2.49s, just to get the API response.

By using curl_multi_exec, these requests can be executed in parallel and app will be limited by the slowest request, which is about 1.1s in this scenario. Notice the combined response time would decrease by 126%!

Update: I’ve added a basic implemented process here.

Output:

$ ./crawler -u https://news.google.co.in --limit 500

Initiating crawl process with - https://news.google.co.in where maximum crawl limit is 500

259 URL has been added into crawl repository. Will be reset to maximum limit, if reached.

750 URL has been added into crawl repository. Will be reset to maximum limit, if reached.

360 URL has been added into crawl repository. Will be reset to maximum limit, if reached.

Crawl limit reached. No more URL would be added into repo

When we got the response set then it can be processed as per need. If required, the response set processing could be forked by different child process to gain performance there too. Here is a basic example to explain how multiprocessing can be achieved using PHP.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s