Κεφάλαιο 18 Προχωρημένα Θέματα

18.1. How can I learn more about FreeBSD's internals?
18.2. How can I contribute to FreeBSD?
18.3. What are SNAPs and RELEASEs?
18.4. How do I make my own custom release?
18.5. Why does make world clobber my existing installed binaries?
18.6. Why isn't cvsup.FreeBSD.org a round robin DNS entry to share the load amongst the various CVSup servers?
18.7. Why does my system say «(bus speed defaulted)» when it boots?
18.8. Can I follow -CURRENT with limited Internet access?
18.9. How did you split the distribution into 240k files?
18.10. I have written a kernel extension, who do I send it to?
18.11. How are Plug N Play ISA cards detected and initialized?
18.12. Can you assign a major number for a device driver I have written?
18.13. What about alternative layout policies for directories?
18.14. How can I make the most of the data I see when my kernel panics?
18.15. Why has dlsym() stopped working for ELF executables?
18.16. How can I increase or reduce the kernel address space?

18.1. How can I learn more about FreeBSD's internals?

At this time, there is only one book on FreeBSD-specific OS internals, namely «The Design and Implementation of the FreeBSD Operating System» by Marshall Kirk McKusick and George V. Neville-Neil, ISBN 0-201-70245-2, which focuses on version 5.X of FreeBSD.

Additionally, much general UNIX® knowledge is directly applicable to FreeBSD.

For a list of relevant books, please check the Handbook's Operating System Internals Bibliography.

18.2. How can I contribute to FreeBSD?

Please see the article on Contributing to FreeBSD for specific advice on how to do this. Assistance is more than welcome!

18.3. What are SNAPs and RELEASEs?

There are currently three active/semi-active branches in the FreeBSD CVS Repository. (Earlier branches are only changed very rarely, which is why there are only three active branches of development):

  • RELENG_5 AKA 5-STABLE

  • RELENG_6 AKA 6-STABLE

  • HEAD AKA -CURRENT AKA 7.X-CURRENT

HEAD is not an actual branch tag, like the other two; it is simply a symbolic constant for «the current, non-branched development stream» which we simply refer to as «-CURRENT».

Right now, «-CURRENT» is the 7.X development stream; the 5-STABLE branch, RELENG_5, forked off from «-CURRENT» in October 2004, and the 6-STABLE branch, RELENG_6, forked off from «-CURRENT» in November 2005.

18.4. How do I make my own custom release?

Please see the Release Engineering article.

18.5. Why does make world clobber my existing installed binaries?

Yes, this is the general idea; as its name might suggest, make world rebuilds every system binary from scratch, so you can be certain of having a clean and consistent environment at the end (which is why it takes so long).

If the environment variable DESTDIR is defined while running make world or make install, the newly-created binaries will be deposited in a directory tree identical to the installed one, rooted at ${DESTDIR}. Some random combination of shared libraries modifications and program rebuilds can cause this to fail in make world however.

18.6. Why isn't cvsup.FreeBSD.org a round robin DNS entry to share the load amongst the various CVSup servers?

While CVSup mirrors update from the master CVSup server hourly, this update might happen at any time during the hour. This means that some servers have newer code than others, even though all servers have code that is less than an hour old. If cvsup.FreeBSD.org was a round robin DNS entry that simply redirected users to a random CVSup server, running CVSup twice in a row could download code older than the code already on the system.

18.7. Why does my system say «(bus speed defaulted)» when it boots?

The Adaptec 1542 SCSI host adapters allow the user to configure their bus access speed in software. Previous versions of the 1542 driver tried to determine the fastest usable speed and set the adapter to that. We found that this breaks some users' systems, so you now have to define the TUNE_1542 kernel configuration option in order to have this take place. Using it on those systems where it works may make your disks run faster, but on those systems where it does not, your data could be corrupted.

18.8. Can I follow -CURRENT with limited Internet access?

Yes, you can do this without downloading the whole source tree by using the CTM facility.

18.9. How did you split the distribution into 240k files?

Newer BSD based systems have a -b option to split(1) that allows them to split files on arbitrary byte boundaries.

Here is an example from /usr/src/Makefile.

bin-tarball:
(cd ${DISTDIR}; \
tar cf - . \
gzip --no-name -9 -c | \
split -b 240640 - \
${RELEASEDIR}/tarballs/bindist/bin_tgz.)

18.10. I have written a kernel extension, who do I send it to?

Please take a look at the article on Contributing to FreeBSD to learn how to submit code.

And thanks for the thought!

18.11. How are Plug N Play ISA cards detected and initialized?

By: Frank Durda IV

In a nutshell, there a few I/O ports that all of the PnP boards respond to when the host asks if anyone is out there. So when the PnP probe routine starts, it asks if there are any PnP boards present, and all the PnP boards respond with their model # to a I/O read of the same port, so the probe routine gets a wired-OR «yes» to that question. At least one bit will be on in that reply. Then the probe code is able to cause boards with board model IDs (assigned by Microsoft/Intel) lower than X to go «off-line». It then looks to see if any boards are still responding to the query. If the answer was 0, then there are no boards with IDs above X. Now probe asks if there are any boards below X. If so, probe knows there are boards with a model numbers below X. Probe then asks for boards greater than X-(limit/4) to go off-line. If repeats the query. By repeating this semi-binary search of IDs-in-range enough times, the probing code will eventually identify all PnP boards present in a given machine with a number of iterations that is much lower than what 2^64 would take.

The IDs are two 32-bit fields (hence 2ˆ64) + 8 bit checksum. The first 32 bits are a vendor identifier. They never come out and say it, but it appears to be assumed that different types of boards from the same vendor could have different 32-bit vendor ids. The idea of needing 32 bits just for unique manufacturers is a bit excessive.

The lower 32 bits are a serial #, Ethernet address, something that makes this one board unique. The vendor must never produce a second board that has the same lower 32 bits unless the upper 32 bits are also different. So you can have multiple boards of the same type in the machine and the full 64 bits will still be unique.

The 32 bit groups can never be all zero. This allows the wired-OR to show non-zero bits during the initial binary search.

Once the system has identified all the board IDs present, it will reactivate each board, one at a time (via the same I/O ports), and find out what resources the given board needs, what interrupt choices are available, etc. A scan is made over all the boards to collect this information.

This info is then combined with info from any ECU files on the hard disk or wired into the MLB BIOS. The ECU and BIOS PnP support for hardware on the MLB is usually synthetic, and the peripherals do not really do genuine PnP. However by examining the BIOS info plus the ECU info, the probe routines can cause the devices that are PnP to avoid those devices the probe code cannot relocate.

Then the PnP devices are visited once more and given their I/O, DMA, IRQ and Memory-map address assignments. The devices will then appear at those locations and remain there until the next reboot, although there is nothing that says you cannot move them around whenever you want.

There is a lot of oversimplification above, but you should get the general idea.

Microsoft took over some of the primary printer status ports to do PnP, on the logic that no boards decoded those addresses for the opposing I/O cycles. I found a genuine IBM printer board that did decode writes of the status port during the early PnP proposal review period, but MS said «tough». So they do a write to the printer status port for setting addresses, plus that use that address + 0x800, and a third I/O port for reading that can be located anywhere between 0x200 and 0x3ff.

18.12. Can you assign a major number for a device driver I have written?

FreeBSD-CURRENT after February 2003 has a facility for dynamically and automatically allocating major numbers for device drivers at runtime. This mechanism is highly preferred to the older procedure of statically allocating device numbers. Some comments on this subject can be found in src/sys/conf/majors.

If you are forced for some reason to use a static major number, the procedure for obtaining one depends on whether or not you plan on making the driver publicly available. If you do, then please send us a copy of the driver source code, plus the appropriate modifications to files.i386, a sample configuration file entry, and the appropriate MAKEDEV(8) code to create any special files your device uses. If you do not, or are unable to because of licensing restrictions, then character major number 32 and block major number 8 have been reserved specifically for this purpose; please use them. In any case, we would appreciate hearing about your driver on the ηλεκτρονική λίστα τεχνικών συζητήσεων του FreeBSD.

18.13. What about alternative layout policies for directories?

In answer to the question of alternative layout policies for directories, the scheme that is currently in use is unchanged from what I wrote in 1983. I wrote that policy for the original fast filesystem, and never revisited it. It works well at keeping cylinder groups from filling up. As several of you have noted, it works poorly for find. Most filesystems are created from archives that were created by a depth first search (aka ftw). These directories end up being striped across the cylinder groups thus creating a worst possible scenario for future depth first searches. If one knew the total number of directories to be created, the solution would be to create (total / fs_ncg) per cylinder group before moving on. Obviously, one would have to create some heuristic to guess at this number. Even using a small fixed number like say 10 would make an order of magnitude improvement. To differentiate restores from normal operation (when the current algorithm is probably more sensible), you could use the clustering of up to 10 if they were all done within a ten second window. Anyway, my conclusion is that this is an area ripe for experimentation.

Kirk McKusick, September 1998

18.14. How can I make the most of the data I see when my kernel panics?

[This section was extracted from a mail written by Bill Paul on the freebsd-current mailing list by Dag-Erling C. Smørgrav , who fixed a few typos and added the bracketed comments]

From: Bill Paul <wpaul@skynet.ctr.columbia.edu>
Subject: Re: the fs fun never stops
To: Ben Rosengart
Date: Sun, 20 Sep 1998 15:22:50 -0400 (EDT)
Cc: current@FreeBSD.org

Ben Rosengart posted the following panic message]

> Fatal trap 12: page fault while in kernel mode
> fault virtual address   = 0x40
> fault code              = supervisor read, page not present
> instruction pointer     = 0x8:0xf014a7e5
                                ^^^^^^^^^^
> stack pointer           = 0x10:0xf4ed6f24
> frame pointer           = 0x10:0xf4ed6f28
> code segment            = base 0x0, limit 0xfffff, type 0x1b
>                         = DPL 0, pres 1, def32 1, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 80 (mount)
> interrupt mask          =
> trap number             = 12
> panic: page fault

[When] you see a message like this, it is not enough to just reproduce it and send it in. The instruction pointer value that I highlighted up there is important; unfortunately, it is also configuration dependent. In other words, the value varies depending on the exact kernel image that you are using. If you are using a GENERIC kernel image from one of the snapshots, then it is possible for somebody else to track down the offending function, but if you are running a custom kernel then only you can tell us where the fault occurred.

What you should do is this:

  1. Write down the instruction pointer value. Note that the 0x8: part at the beginning is not significant in this case: it is the 0xf0xxxxxx part that we want.

  2. When the system reboots, do the following:

    % nm -n /kernel.that.caused.the.panic | grep f0xxxxxx
    
    where f0xxxxxx is the instruction pointer value. The odds are you will not get an exact match since the symbols in the kernel symbol table are for the entry points of functions and the instruction pointer address will be somewhere inside a function, not at the start. If you do not get an exact match, omit the last digit from the instruction pointer value and try again, i.e.:
    % nm -n /kernel.that.caused.the.panic | grep f0xxxxx
    
    If that does not yield any results, chop off another digit. Repeat until you get some sort of output. The result will be a possible list of functions which caused the panic. This is a less than exact mechanism for tracking down the point of failure, but it is better than nothing.

I see people constantly show panic messages like this but rarely do I see someone take the time to match up the instruction pointer with a function in the kernel symbol table.

The best way to track down the cause of a panic is by capturing a crash dump, then using gdb(1) to generate a stack trace on the crash dump.

In any case, the method I normally use is this:

  1. Set up a kernel config file, optionally adding options DDB if you think you need the kernel debugger for something. (I use this mainly for setting breakpoints if I suspect an infinite loop condition of some kind.)

  2. Use config -g KERNELCONFIG to set up the build directory.

  3. cd /sys/compile/KERNELCONFIG; make

  4. Wait for kernel to finish compiling.

  5. make install

  6. reboot

The make(1) process will have built two kernels. kernel and kernel.debug. kernel was installed as /kernel, while kernel.debug can be used as the source of debugging symbols for gdb(1).

To make sure you capture a crash dump, you need edit /etc/rc.conf and set dumpdev to point to your swap partition. This will cause the rc(8) scripts to use the dumpon(8) command to enable crash dumps. You can also run dumpon(8) manually. After a panic, the crash dump can be recovered using savecore(8); if dumpdev is set in /etc/rc.conf, the rc(8) scripts will run savecore(8) automatically and put the crash dump in /var/crash.

Σημείωση: FreeBSD crash dumps are usually the same size as the physical RAM size of your machine. That is, if you have 64MB of RAM, you will get a 64MB crash dump. Therefore you must make sure there is enough space in /var/crash to hold the dump. Alternatively, you run savecore(8) manually and have it recover the crash dump to another directory where you have more room. It is possible to limit the size of the crash dump by using options MAXMEM=(foo) to set the amount of memory the kernel will use to something a little more sensible. For example, if you have 128MB of RAM, you can limit the kernel's memory usage to 16MB so that your crash dump size will be 16MB instead of 128MB.

Once you have recovered the crash dump, you can get a stack trace with gdb(1) as follows:

% gdb -k /sys/compile/KERNELCONFIG/kernel.debug /var/crash/vmcore.0
(gdb) where

Note that there may be several screens worth of information; ideally you should use script(1) to capture all of them. Using the unstripped kernel image with all the debug symbols should show the exact line of kernel source code where the panic occurred. Usually you have to read the stack trace from the bottom up in order to trace the exact sequence of events that lead to the crash. You can also use gdb(1) to print out the contents of various variables or structures in order to examine the system state at the time of the crash.

Now, if you are really insane and have a second computer, you can also configure gdb(1) to do remote debugging such that you can use gdb(1) on one system to debug the kernel on another system, including setting breakpoints, single-stepping through the kernel code, just like you can do with a normal user-mode program. I have not played with this yet as I do not often have the chance to set up two machines side by side for debugging purposes.

[Bill adds: "I forgot to mention one thing: if you have DDB enabled and the kernel drops into the debugger, you can force a panic (and a crash dump) just by typing 'panic' at the ddb prompt. It may stop in the debugger again during the panic phase. If it does, type 'continue' and it will finish the crash dump." -ed]

18.15. Why has dlsym() stopped working for ELF executables?

The ELF toolchain does not, by default, make the symbols defined in an executable visible to the dynamic linker. Consequently dlsym() searches on handles obtained from calls to dlopen(NULL, flags) will fail to find such symbols.

If you want to search, using dlsym(), for symbols present in the main executable of a process, you need to link the executable using the -export-dynamic option to the ELF linker (ld(1)).

18.16. How can I increase or reduce the kernel address space?

By default, the kernel address space is 256 MB on FreeBSD 3.X and 1 GB on FreeBSD 4.X. If you run a network-intensive server (e.g. a large FTP or HTTP server), you might find that 256 MB is not enough.

So how do you increase the address space? There are two aspects to this. First, you need to tell the kernel to reserve a larger portion of the address space for itself. Second, since the kernel is loaded at the top of the address space, you need to lower the load address so it does not bump its head against the ceiling.

The first goal is achieved by increasing the value of NKPDE in src/sys/i386/include/pmap.h. Here is what it looks like for a 1 GB address space:

#ifndef NKPDE
#ifdef SMP
#define NKPDE                   254     /* addressable number of page tables/pde's */
#else
#define NKPDE                   255     /* addressable number of page tables/pde's */
#endif  /* SMP */
#endif

To find the correct value of NKPDE, divide the desired address space size (in megabytes) by four, then subtract one for UP and two for SMP.

To achieve the second goal, you need to compute the correct load address: simply subtract the address space size (in bytes) from 0x100100000; the result is 0xc0100000 for a 1 GB address space. Set LOAD_ADDRESS in src/sys/i386/conf/Makefile.i386 to that value; then set the location counter in the beginning of the section listing in src/sys/i386/conf/kernel.script to the same value, as follows:

OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
OUTPUT_ARCH(i386)
ENTRY(btext)
SEARCH_DIR(/usr/lib); SEARCH_DIR(/usr/obj/elf/home/src/tmp/usr/i386-unknown-freebsdelf/lib);
SECTIONS
{
  /* Read-only sections, merged into text segment: */
  . = 0xc0100000 + SIZEOF_HEADERS;
  .interp     : { *(.interp)    }

Then reconfig and rebuild your kernel. You will probably have problems with ps(1) top(1) and the like; make world should take care of it (or a manual rebuild of libkvm, ps(1) and top(1) after copying the patched pmap.h to /usr/include/vm/.

NOTE: the size of the kernel address space must be a multiple of four megabytes.

[David Greenman adds: I think the kernel address space needs to be a power of two, but I am not certain about that. The old(er) boot code used to monkey with the high order address bits and I think expected at least 256MB granularity.]

Αυτό το κείμενο, και άλλα κείμενα, μπορεί να βρεθεί στο ftp://ftp.FreeBSD.org/pub/FreeBSD/doc/.

Για ερωτήσεις σχετικά με το FreeBSD, διαβάστε την τεκμηρίωση πριν να επικοινωνήσετε με την <questions@FreeBSD.org>.
Για ερωτήσεις σχετικά με αυτή την τεκμηρίωση, στείλτε e-mail στην <doc@FreeBSD.org>.