没做线程池隔离导致的线上故障分析


一、背景

线上服务监控告警某个服务接口大量失败,先访问出问题的机器,拿到异常堆栈,然后临时紧急重启后正常。

异常堆栈信息如下:[DUBBO] Thread pool is EXHAUSTED! Thread Name: DubboServerHandler-xxxx:20880, Pool Size: 200 (active: 200, core: 200, max: 200, largest: 200) ...

初步分析是Dubbo线程池满了导致的报错。

二、具体分析

Dubbo线程池满了,可能的原因有:

  • 突发流量
  • 阻塞

1、突发流量

首先查看服务数据监控,数据曲线正常,首先排除突发流量的原因。

2、阻塞

再通过jstack工具获取Dubbo线程堆栈信息,可以看到大量线程阻塞在join()方法。

"DubboServerHandler-thread-15" #423 daemon prio=5 os_prio=0 tid=0x00007f0e101d7800 nid=0x4f5f waiting on condition [0x00007f0d30047000]
  java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for <0x0000000730d4fb98> (a java.util.concurrent.ComletableFuture$Siganaller)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
    at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
    at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
    at java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1934)
    at com.yy.test.service.impl.TestServiceImpl.executeAfter(TestServiceImpl.java:386)
    at com.yy.test.service.impl.TestServiceImpl.executeFirst(TestServiceImpl.java:104)

查看对应TestServiceImpl类代码,简化如下:

@Service
public class TestServiceImpl {
    // 线程池
    private ExecutorService executorService = Executors.newFixedThreadPool(8);
    
    public void executeFirst() throws Exception {
        // ...
        executorService.execute(this.executeAfter());
        // ...
    }
    
    public void executeAfter() throws Exception{
        // ...
        // 启动两个线程同时执行任务
        CompletableFuture<String> future1 = CompletableFuture.supplyAsync(()->{...}, executorService);
        CompletableFuture<String> future2 = CompletableFuture.supplyAsync(()->{...}, executorService);
        // 同步等待线程执行完成并获取值
        String result1 = future1.get();
        String result2 = future2.get();
        //...
    }
}

当executeBefore()同时收到8个请求时,线程池的8个工作线程都被分配给executeBefore()去异步调用executeAfter(),这时候由于线程池工作线程已满,所以executeAfter()里的future调用只能阻塞,所以就形成了:

  • executeBefore()持有线程,等待executeAfter()执行完释放线程
  • executeAfter()由于没有空间的线程而阻塞,进而导致executeBefore()阻塞

3、结论

业务线程池的线程全都阻塞住,导致上层的Dubbo线程也阻塞住。

三、总结

问题:Dubbo线程池耗尽
排查:业务线程阻塞,导致上层的Dubbo线程阻塞
解决方案:做好线程池隔离,即不同使用场景尽量使用不同的线程池,避免父子任务共用一个线程池而导致互相阻塞的问题。相关优化代码如下

@Service
public class TestServiceImpl {
    // 线程池
    private ExecutorService executorServiceFirst = Executors.newFixedThreadPool(8);
    private ExecutorService executorServiceAfter = Executors.newFixedThreadPool(8);
    
    public void executeFirst() throws Exception {
        // ...
        executorServiceFirst.execute(this.executeAfter());
        // ...
    }
    
    public void executeAfter() throws Exception{
        // ...
        // 启动两个线程同时执行任务
        CompletableFuture<String> future1 = CompletableFuture.supplyAsync(()->{...}, executorServiceAfter);
        CompletableFuture<String> future2 = CompletableFuture.supplyAsync(()->{...}, executorServiceAfter);
        // 同步等待线程执行完成并获取值
        String result1 = future1.get();
        String result2 = future2.get();
        //...
    }
}

文章作者: GaryLee
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 GaryLee !
  目录