네임서버가 프록시서버에게.. 잘 지내니?

이때, 매번 MySQL 데이터베이스에서 IP를 찾아오지 않기 위해 캐시를 사용합니다.

😨 도입 배경

<aside>

왓치덕스는 네임서버와 프록시서버를 구현, 설치가 필요없는 트래픽 분석 서비스를 제공합니다.

</aside>

때는 5주차 피어세션, 마무리를 1주일도 채 안 남긴 시점이었습니다. 저희 프로젝트의 설명을 드린 후 “피드백 주시면 감사하겠습니다”하고 나섰지만 답하기 어려웠던 질문이 있었습니다.

🙋 “너네 프록시서버 꺼지면 우리 서비스도 꺼지는 거 아니냐”

분명 저희 서비스를 안심하고 사용하려면 해결이 시급한 문제였습니다. 부끄럽지만, 다양한 문제를 해결하느라 놓치고 있었습니다. 3차 네임서버로 Gabia, CloudFlare 등의 기본 제공 네임서버를 놓는다고 해결되는 문제가 아니었습니다.

회의를 통해, 네임서버에서 → 프록시서버로 헬스체크를 수행하며 만약 응답이 오지 않는다면 DNS 응답으로 원본 클라이언트 IP를 반환하는 것을 해결책으로 선정했습니다. 또한, DNS 패킷 TTL을 Dynamic DNS에서 자주 쓰이는 60초로 축소하였습니다.

비록 이전에 DNS 패킷 TTL을 하루로 보내 구현한 DAU를 전부 수정해야 하지만, 서비스 신뢰도를 위해 꼭 필요한 결정이었습니다.

🏃 도입 방법

먼저 DNS 응답 data를 구성할 때 Health-Check 결과가 어떻게 사용되는지 볼 수 있습니다. 주입받은 HealthCheckService의 값을 토대로 프록시 서버가 정상 동작 중이라면 프록시서버 IP를 반환합니다.

private async resolveRoute(domainName: string): Promise<ResolvedRoute> {
        try {
            **if (this.healthCheckService.isProxyHealthy()) {
                return { targetIp: this.config.proxyServerIp, isValid: true };
            }**

            let clientIp = await this.cacheQuery.findIpByDomain(domainName);

            if (!clientIp) {
                clientIp = await this.projectQuery.getClientIpByDomain(domainName);
                void this.cacheQuery
                    .cacheIpByDomain(domainName, clientIp)
                    .catch((err) =>
                        logger.error(`Failed to cache IP for domain ${domainName}:`, err),
                    );
            }

            return { targetIp: clientIp, isValid: true };
        } catch (error) {
            logger.error('Failed to resolve route:', error);
            return { targetIp: '', isValid: false };
        }
    }

헬스 체크는 아래와 같이 구현되었습니다. 매 요청마다 /health-check 요청을 보내며 Blocking할 수 없기에, 백그라운드에서 비동기적으로 동작합니다.

export class HealthCheckService {

    private healthCheckInterval: NodeJS.Timeout | null = null;
    private proxyServerHealthy: boolean = true;
    private activeRequest: http.ClientRequest | null = null;
    
    /** 생략... */
    
    **public isProxyHealthy(): boolean {
        return this.proxyServerHealthy;
    }**
   
    public startHealthCheck(): void {
        if (this.healthCheckInterval) {
            return;
        }
        this.checkHealth();
        **this.healthCheckInterval = setInterval(this.checkHealth.bind(this), this.healthCheckIntervalMs);**
    }
    
    private handleHealthCheckFailure(error: Error, message: string): void {
        this.proxyServerHealthy = false;
        this.activeRequest = null;
        logger.error(message, error);
    }
    
    private checkHealth(): void {
        this.cleanActiveRequest();

        try {
            **this.activeRequest = https.request(**
                this.createRequestOptions(),
                **this.handleResponse.bind(this)**
            );

            **this.activeRequest.on('error', (error: Error) =>**
                this.handleHealthCheckFailure(error, 'Proxy server health check failed:')
            );

            this.activeRequest.on('timeout', () => {
                **this.proxyServerHealthy = false;**
                /** 생략 ... */
}

✨ 결론 & 기대효과

별로 의미는 없지만.. 남아있는 유일한 스크린샷

헬스체크가 정상적으로 동작하며, 실패했을 때 도메인에 해당하는 캐싱된 IP를 넘겨주는 것을 확인하였습니다. 여전히 운이 안좋다면 최대 DNS Packet TTL인 60초의 대기 시간이 발생할 수 있습니다. 다만, 적어도 프록시 서버가 다운되었을 때 연관된 모든 서버가 다운되는 일은 발생하지 않게 되어 서비스의 신뢰도를 높였습니다.