Skip to content

Conversation

@dswitzer
Copy link

This fixes and issue in the AsynchronousSPFExecutor where threads would hang when reading an SPF record that was truncated because there were too many DNS lookups required.

This includes a unit test, but I could only figure out a way to implement a test w/a real DNS record, so it could break in the future if the DNS record for the domain gets fixed.

NOTE - I do believe there's a possibility there are other hard DNS exceptions that could cause the same behavior, but since I could only trigger the bug with a live DNS record, could not find a good way to test alternative scenarios.

@chibenwa
Copy link
Contributor

Thanks for the fix.

However the tests failure on CI seems related

14:43:45,854 [ERROR] Failures: 
14:43:45,854 [ERROR]   AsynchronousSPFExecutorIntegrationTest.shouldParseWireParseExceptionRecordTruncated:50 expected:<p[ermerror]> but was:<p[ass]>

@dswitzer
Copy link
Author

Don't you just love "works for me" issues!

The test passes for me, because the test is leading to the DNS exception in my environment, but perhaps the DNS server being used in Github is performing differently.

Here's the stacktrace I get with that test:

org.xbill.DNS.WireParseException: truncated record
        at org.xbill.DNS.Record.newRecord(Record.java:117)
        at org.xbill.DNS.Record.fromWire(Record.java:231)
        at org.xbill.DNS.Message.<init>(Message.java:113)
        at org.xbill.DNS.Message.<init>(Message.java:145)
        at org.xbill.DNS.SimpleResolver.parseMessage(SimpleResolver.java:273)
        at org.xbill.DNS.SimpleResolver.lambda$sendAsync$1(SimpleResolver.java:416)
        at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1150)
        at java.base/java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:483)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
        at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)

I've tested both using Google (8.8.8.8) and Cloudflare (1.1.1.1) for the DNS resolvers and see the error locally.

Could the DNS resolution in Github be the issue?

Do you have a better way to test the use case?

@dswitzer
Copy link
Author

@chibenwa,

Here's the same test, but manually configured to use Cloudflare's DNS as the resolver:

@Test
public void shouldParseWireParseExceptionRecordTruncated() throws UnknownHostException {
    ExtendedResolver resolver = new ExtendedResolver(new String[] { "1.1.1.1" });
    DNSServiceXBillImpl dnsProbe = new DNSServiceXBillImpl(resolver);
    WiringServiceTable wiringService = new WiringServiceTable();
    MacroExpand macroExpand = new MacroExpand(dnsProbe);

    wiringService.put(DNSServiceEnabled.class, dnsProbe);
    wiringService.put(MacroExpandEnabled.class, macroExpand);

    RFC4408SPF1Parser parser = new RFC4408SPF1Parser(new DefaultTermsFactory(wiringService));
    SPFExecutor executor = new AsynchronousSPFExecutor(dnsProbe);

    SPF spf = new SPF(dnsProbe, parser, macroExpand, executor);

    wiringService.put(SPFCheckEnabled.class, spf);

    SPFResult result = spf.checkSPF("170.146.224.15", "HelpDesk@arcofmc.org", "arcofmc.org");
    System.out.println(result.getResult());
    System.out.println(result.getExplanation());
    System.out.println(result.getHeader());
    assertEquals("permerror", result.getResult());
    assertEquals("Received-SPF: permerror (spfCheck: Error in processing SPF Record) client-ip=170.146.224.15; envelope-from=HelpDesk@arcofmc.org; helo=arcofmc.org;",
        result.getHeader());
}

For me this is failing too. Do you want me to update the example to the above to see if that works in Github?

@chibenwa
Copy link
Contributor

Don't you just love "works for me" issues!

Always!

And DNS caching are a great source of them.

For me this is failing too. Do you want me to update the example to the above to see if that works in Github?

s/Github/ Apache CI?g

Yes lets have a try.

Thanks a lot!

@dswitzer
Copy link
Author

Ok, I've updated the test.

Hopefully the test replicates what I'm seeing locally.

@dswitzer
Copy link
Author

So that change did not make a difference.

Any thoughts on why I'm getting the error locally, but Github is not triggering it. Is it possible there's a JVM setting that's affecting the max record lookup that's different than the default 10?

@dswitzer
Copy link
Author

@chibenwa,

Just wanted to follow up on this.

Any thoughts on why I'm getting the error locally, but Github is not triggering it. Is it possible there's a JVM setting that's affecting the max record lookup that's different than the default 10?

@chibenwa
Copy link
Contributor

Hello I'm pretty busy lately, I did not shoot the github notif on purpose so that I remember eventually to take a look at it.

Sorry for the delays and best regards...

@chibenwa
Copy link
Contributor

(Queuing theory: when you are at 99% occupation then latencies skyrockets ^^)

@epinter
Copy link
Contributor

epinter commented Jan 22, 2025

@dswitzer
I was able to reproduce the truncated record in a test, which comes from Record class line 117.
I think you already know, but just to document here, this "truncated record" happens when a DNS packet received by dnsjava has a rdata smaller than the length field in the same packet, even the larger SPF records should be received without this truncation error. To reproduce this, I've created a simple class with a small DNS server and I truncated the end of the answer section (sending the packet to the dns client effectively truncated). In the case of a long TXT record, the DNS server set the TC flag (truncation), then the client should access TCP port to get whole record, but no error is received in this case, it is like a redirection. In my tests reproducing the WireParseException, I received this "truncated record" and also "end of input", none of them cause an infinite loop for me. But there's one situation that causes loop (port unreachable), in my PR #28 there's a fix for this. I'm curious how your environment was causing the "truncated record" for you. Maybe there was another exception beign thrown ?

@dswitzer
Copy link
Author

@epinter,

It's been a few months since I dealt with this, so my memory is a little foggy on the details.

I just know that the scenario in the unit test case I provide was running into an infinite loop, because the exception was preventing the continue handler from being triggered, so it would just stall.

I believe with this specific DNS record, it was violating the too many DNS lookup policy, but I could be wrong. I know I tried with several different DNS servers in my tests and all had the same problem.

Sorry I can't provide more details.

1 similar comment
@dswitzer
Copy link
Author

@epinter,

It's been a few months since I dealt with this, so my memory is a little foggy on the details.

I just know that the scenario in the unit test case I provide was running into an infinite loop, because the exception was preventing the continue handler from being triggered, so it would just stall.

I believe with this specific DNS record, it was violating the too many DNS lookup policy, but I could be wrong. I know I tried with several different DNS servers in my tests and all had the same problem.

Sorry I can't provide more details.

@epinter
Copy link
Contributor

epinter commented Jan 22, 2025

@epinter,

It's been a few months since I dealt with this, so my memory is a little foggy on the details.

I just know that the scenario in the unit test case I provide was running into an infinite loop, because the exception was preventing the continue handler from being triggered, so it would just stall.

I believe with this specific DNS record, it was violating the too many DNS lookup policy, but I could be wrong. I know I tried with several different DNS servers in my tests and all had the same problem.

Sorry I can't provide more details.

One more thing... If you were trying to specify other DNS servers with async executor that was not working, the resolver was completely ignored and it was always using the default resolver. It is fixed in this commit: d171bdb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants